Cross Validation and Performance Measures in Machine Learning

7 min readApr 1, 2020

Deciding what cross validation and performance measures should be used while using a particular machine learning technique is very important. After training our model on the dataset, we can’t say for sure that the model will perform well on the data which it hasn’t seen before. The process of deciding whether the numerical results quantifying hypothesised relationships between variables, are acceptable as descriptions of the data, is known as validation. Based on the performance on unseen data, we can say whether model is overfitted, underfitted or well generalized.

Cross Validation

Cross validation is a technique which is used to evaluate the machine learning model by training it on the subset of the available data and then evaluating them on the remaining input data. On a simple note, we keep a portion of data aside and then train the model on the remaining data. And then we test and evaluate the performance of model on portion of data that was kept aside.

Types of Cross Validation Techniques

Holdout Method: The holdout method is the simple type of cross validation where the data set is divided into two sets, called the training set and the testing set. The model is fitted and trained using the training set only. Then the model is asked to predict the output values for the data in the testing set and it has never seen this data before. The model is evaluated using the appropriate performance measure such as mean absolute test set error. Advantage — It is preferable to the residual method and takes less time to compute. However, its evaluation can have a high variance. The evaluation depends entirely on which data points are in the training set and the test set, and thus the evaluation will be different depending on the division made.

2. K-Fold Cross Validation Method: It is a modification in the holdout method. The dataset is divided into k subsets and the value of k shouldn’t be too small or too large, ideally we choose 5 to 10 depending on the data size. The higher value of k leads to less biased model whereas the lower value of K is similar to the holdout approach. Then we train the model using the k-1 folds and validate and test the model on the remaining kth fold. Note down the errors. This process is repeated until every K-fold serve as the test set. Then the average of the recorded scores is taken which is the performance metric for the model.

Advantage — It doesn’t matter how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased.

Disadvantage — The training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.

3. Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the model is trained on all the data except for one point and a prediction is made for that point. As before, the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it seems very expensive to compute.

Performance Measures

Classification Accuracy

It is the ratio of number of correct predictions to the total number of input samples.

It works well only if there are equal number of samples belonging to each class. For example, if there are 95% samples of class A and 5% samples of class B in our training set. Then the model can easily get 95% training accuracy by simply predicting every training sample belonging to class A. When the same model is tested on a test set with 55% samples of class A and 45% samples of class B, then the test accuracy would drop down to 55%.

Logarithmic Loss

Logarithmic Loss penalises the false classifications and it works well for multi-class classification. The classifier must assign probability to each class for all the samples. If there are N samples belonging to M classes, then the Log Loss is calculated as below :

where y_ij indicates whether sample i belongs to class j or not and p_ij indicates the probability of sample i belonging to class j

Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.

Confusion Matrix

Confusion Matrix gives us a matrix as output and describes the complete performance of the model.

There are 4 important terms :

True Positives : The cases in which we predicted YES and the actual output was also YES.
True Negatives : The cases in which we predicted NO and the actual output was NO.
False Positives : The cases in which we predicted YES and the actual output was NO.
False Negatives : The cases in which we predicted NO and the actual output was YES.

Accuracy for the matrix can be calculated by taking average of the values lying across the main diagonal i.e

Area Under Curve

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problem. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. Before defining AUC, let us understand two basic terms :

True Positive Rate (Sensitivity) : True Positive Rate is calculated by TP/ (FN+TP). True Positive Rate is the proportion of positive data points that are correctly considered as positive, with respect to all positive data points. It has values in the range [0, 1].
False Positive Rate (Specificity) : False Positive Rate is calculate by FP / (FP+TN) which means that it is the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. It has values in the range [0, 1].

AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1].

AUC also has a range of [0, 1] and greater the value, the better is the performance of our model.

F1 Score

F1 Score is the harmonic mean(H.M.) between precision and recall. The range is [0, 1]. It depicts how precise the classifier is i.e. how many instances it classifies correctly and that it didn’t miss a significant number of instances. The greater the F1 Score, the better is the performance of the model.

Precision : It is the number of correct positive results divided by the number of positive results predicted by the classifier.
Recall : It is the number of correct positive results divided by the number of all samples that should have been identified as positive.

Mean Absolute Error

It is the average of the difference between the original values and the predicted values. It doesn’t gives us any idea of the direction of the error i.e. whether the model is under predicting or over predicting the data.

Mean Squared Error

Mean Squared Error(MSE) is quite similar to Mean Absolute Error with the difference that MSE takes average of the square of the difference between the original values and the predicted values.