Splitting the Dataset
Hmm… Ah… Looking back at the graphs we used for the Cats and Dogs; I've just realised something, they have no units ^_^
Ah well, that’s about to change, haha!
Okay, so this is the same as all previous graphs. the Circle represents a Cat and the Triangle represents a Dog. However, we now have Size in feet and Furriness as a percentage – so 100% is a very furry cat ;) .
In order to split up the data, we need to have the set of coordinates for each point, as well as that points corresponding correct class label. So using the red lines above, we have 1 set of data for a dog and a cat:
- Cat:
Size (ft) Furriness (%) Class Label 0.4 80 0 - Dog:
Size (ft) Furriness (%) Class Label 1 20 1
Remembering here that from the previous post on Perceptrons, the class label (or output) for cat was a 0, and for dog was a 1.
However, this is just 2 records, from the 6 needed, so for all of the above (randomized) we have:
Size (ft) | Furriness (%) | Class Label |
0.5 | 70 | 0 |
1.2 | 40 | 1 |
1 | 20 | 1 |
0.4 | 80 | 0 |
0.5 | 90 | 0 |
1.5 | 10 | 1 |
Now that we have this, we need to split the data up into 2 sets. One set will be used for Training and the other for Testing – for both a Perceptron and a KNN. When I say split, i really mean split in half. So the top 3 records are for Training, and the last 3 for Testing.
Training
The training set will consist of:
Size (ft) | Furriness (%) | Class Label |
0.5 | 70 | 0 |
1.2 | 40 | 1 |
1 | 20 | 1 |
Which is the top 3 records from above. We use the training data to train a KNN or a Perceptron. So for a Perceptron this is for getting the best weighting values.
The best dataset for a training set is best done small, to save time and space. HOWEVER, don’t make it too small else you’ll make many mistakes in testing, and could identify something wrong, such as identifying a Camel as a Dog.. which is just blatantly wrong!
Testing
The testing set would consist of:
Size (ft) | Furriness (%) | Class Label |
0.4 | 80 | 0 |
0.5 | 90 | 0 |
1.5 | 10 | 1 |
Which is the bottom 3 records from the total dataset above. The testing data is used to test the Perceptron or KNN that you just trained. This is used to simulate what it might be like to see new data in the future.
This testing set needs to be big… very big… the bigger the better!
So basically – for a Perceptron – you will use the training data to calculate the weights that give you the best accuracy. Then you use those weights on the testing set to predict what each new data might be, and gives it its guessed class label. If the percentage error is too big, go back to training and calculate different weight values. However, the testing data will never have as higher accuracy as the training – its impossible to get everything correct!
Best note, read the next post on Confusion Matrices, its technically a continuation…
No comments:
Post a Comment