K-Fold Cross-Validation in SklearnCreating datasets to train and validate our model from data collection is the most common machine learning approach to increase the model's performance. The split ratio of the dataset could be 70 : 30 or 80 : 20. The holdout approach is the most common cross-validation approach. The issue with this approach is that we are unsure whether a good validation accuracy score of the model denotes a good model. What if the portion of the dataset we used for validation was successful? If we used a distinct portion of the dataset as a validation dataset, would our model still give high accuracy score? The K-fold CV provides solutions to some of these queries. What is Cross-Validation?When we have to train a machine learning model with a good performance score, dividing the dataset into a training and validation dataset is a fundamental and essential operation. We must test our model on an unseen dataset (Validation dataset) to assess whether it is overfitted or not. A particular model will give a low-performance score when supplied with actual live data if the model does not have a good accuracy score for the validation dataset. Given this idea, cross-validation is perhaps one of the most crucial machine learning concepts that guarantee the robustness of our model. Cross-Validation is essentially a procedure that simply sets aside a portion of the dataset to be used for the validation and testing of our model. In contrast, the remaining dataset is utilized for the model to train. Benefits of using Cross-Validation
Many types of Cross-Validation
What is K-Fold Cross-Validation?The k-fold cross-validation method is widely used for calculating how well a machine learning model performs on a validation dataset. Although 10 is a typical choice for k, how can we be confident that this fold is suitable for the dataset and model? One method is investigating the impact of various k choices on the estimated model performance and comparing it with the ideal test condition. This can aid in selecting the correct number of folds. Once a k-value is determined, we can use it to assess various models on the dataset. We may then contrast the pattern of the scores to the scores of an analysis of the same model under the ideal test scenario to see whether or not they are strongly correlated. If the results are correlated, it is confirmed that the selected configuration is a reliable approximation of the ideal test setting. This tutorial will teach the reader how to set up and assess k-fold cross-validation settings. Procedure of K-Fold Cross-Validation MethodAs a general procedure, the following happens:
It's significant that every item in the dataset is given a unique group and remains there throughout the process. This indicates that each data sample has the k number of chances to be a part of the training dataset and the k number of times in the holdout or validation dataset. All data preparation done before fitting the model must take place on the loop's CV-assigned training dataset rather than the more extensive data set. This also holds for any hyper-parameter tuning. Data leakage and an exaggerated assessment of the model's skill may occur from failing to carry out these procedures within the loop. Example of K-Fold Cross Validation TestCode Output: Accuracy score of each fold: [0.9122807017543859, 0.9473684210526315, 0.9736842105263158, 0.9736842105263158, 0.9557522123893806] Mean accuracy score: 0.952553951249806 Cross Validation Using cross_val_score()We can shorten the above code using the cross_val_score class method Code Output: Accuracy score of each fold: [0.9122807017543859, 0.9473684210526315, 0.9736842105263158, 0.9736842105263158, 0.9557522123893806] Mean accuracy score: 0.952553951249806 The K-fold cross-validation method helps train our model on smaller datasets. If our data collection is extensive, K-fold cross-validation may not be necessary. The rationale is that our validation dataset has sufficient records to allow us to verify the performance of our machine learning model. Utilizing the K-fold cross-validation test on a vast dataset requires much time. Furthermore, utilizing more folds to validate our model uses even more computer power. The model will take more time to train for the larger values of K. If K is 5, the model undergoes five separate training runs using five distinct folds of the validation dataset. The model runs ten times if K = 10. Sensitivity Analysis for kThe most important tuning parameter for the k-fold cross-validation test is the value of k, which specifies how many folds to divide a given dataset. Typical values chosen for an average size dataset are k=3, k=5, and k=10. It is observed that k=10 is used most widely among all the values to evaluate the performance of our trained model. The reason for taking this specific value is because research has shown that k=10 offers a reasonable balance between low computing expense and moderate bias in assessing model performance. When testing our models on a sample of our dataset, how can we determine what figure of k to use? Although k=10 is an option, how can we ensure it does justice to our dataset? One way to respond to this query is to run a sensitivity analysis for various k numbers. In other words, compare the results of the same model trained on the same dataset but with different values of k. It is hypothesized that small values of k will produce a noisy prediction of model performance, while big values of k will produce a less noisy prediction. Yet noisy in comparison to what? We lack access to the model's true performance when we pass the unseen data through it. We could apply it to the model's assessment if we knew it. However, we can select a test condition as an "ideal" or as close to an ideal estimate as we can predict the model's performance. One strategy is to train the model using every piece of data that is accessible and then estimate its performance using a different, large sample hold-out dataset. The performance on the hold-out data would indicate the "actual" performance of the model. In contrast, any cross-validation scores on the training dataset would be an estimate of this true score. The approach we mentioned is rarely feasible since we frequently lack the data to put aside a portion of the primary dataset for the test dataset. The leave-one-out cross-validation procedure (LOOCV) is a computationally intensive modification of cross-validation where k = N, N is the total number of samples in the training dataset. We can use it to simulate this condition alternatively. In other words, a single sample from the training dataset is provided for each set in the validation dataset. Given the appropriate data, it can generate a decent estimation of model performance, but it is rarely utilised for large datasets due to the high computational cost. Then, using the same dataset with LOOCV, we can assess the average classification accuracy for the LOOCV procedure with the average classification accuracy of various k values. The discrepancy between the scores provides a rough approximation for determining how closely a k value resembles the ideal model performance test condition. Let's look at how to perform a k-fold cross-validation procedure sensitivity analysis.
Code Output: Ideal score: 0.83 Folds=2, accuracy=0.81 (0.81,0.81) Folds=3, accuracy=0.83 (0.79,0.85) Folds=4, accuracy=0.84 (0.8,0.86) Folds=5, accuracy=0.83 (0.78,0.88) Folds=6, accuracy=0.83 (0.77,0.88) Folds=7, accuracy=0.83 (0.76,0.89) Folds=8, accuracy=0.83 (0.77,0.88) Folds=9, accuracy=0.83 (0.75,0.9) Folds=10, accuracy=0.82 (0.74,0.9) Folds=11, accuracy=0.83 (0.77,0.9) Folds=12, accuracy=0.83 (0.7,0.9) Folds=13, accuracy=0.82 (0.77,0.91) Folds=14, accuracy=0.82 (0.66,0.93) Folds=15, accuracy=0.82 (0.76,0.92) Folds=16, accuracy=0.83 (0.67,0.92) Folds=17, accuracy=0.83 (0.73,0.93) Folds=18, accuracy=0.83 (0.68,0.93) Folds=19, accuracy=0.82 (0.68,0.92) Folds=20, accuracy=0.82 (0.66,0.96) |