Python cross validation implementation

Directory

HoldOut cross-validation

K-fold cross validation

Hierarchical K-fold cross-validation

Leave P Out Cross Validation

Leave-one-out cross-validation

Monte Carlo cross-validation (Shuffle Split)

Time series cross-validation

HoldOut Cross Validation

In this cross-validation technique, the entire data set is randomly divided into training and validation sets. As a rule of thumb, nearly 70% of the entire dataset is used as a training set and the remaining 30% is used as a validation set.

But not suitable for imbalanced data sets

A large amount of data cannot train the model

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
iris=load_iris()
X=iris.data
Y=iris.target
print("Size of Dataset {}".format(len(X)))
logreg=LogisticRegression()
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
logreg.fit(x_train,y_train)
predict=logre.predict(x_test)
print("Accuracy score on training set is {}".format(accuracy_score(logreg.predict(x_train),y_train)))
print("Accuracy score on test set is {}".format(accuracy_score(predict,y_test)))

K-fold cross validation

In this K-fold cross-validation technique, the entire data set is divided into K equal-sized parts. Each partition is called a “fold”. So, because we have K parts, we call it a K-fold. One fold was used as the validation set and the remaining K-1 folds were used as the training set.

The technique is repeated K times until each fold is used as the validation set and the remaining folds are used as the training set.

The final accuracy of the model is calculated by taking the average accuracy of the k-models validation data.

Not for use with imbalanced data sets

Not suitable for time series data

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data
Y=iris.target
logreg=LogisticRegression()
kf=KFold(n_splits=5)
score=cross_val_score(logreg,X,Y,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

Hierarchical K-fold cross-validation

Stratified K-Fold is an enhanced version of K-Fold cross-validation, mainly used for imbalanced data sets. Just like K-fold, the entire data set is divided into K-folds of equal size.

But in this technique, each fold will have the same target variable instance ratio as the entire dataset.

Very effective for imbalanced data

Not suitable for time series data

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,StratifiedKFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data
Y=iris.target
logreg=LogisticRegression()
stratifiedkf=StratifiedKFold(n_splits=5)
score=cross_val_score(logreg,X,Y,cv=stratifiedkf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

Leave P Out Cross Validation

Leave P Out cross-validation is an exhaustive cross-validation technique where p samples are used as the validation set and the remaining np samples are used as the training set.

Suppose we have 100 samples in the dataset. If we use p=10, then in each iteration, 10 values will be used as the validation set and the remaining 90 samples will be used as the training set.

This process is repeated until the entire dataset is split on the validation set of p-samples and n-p training samples.

All data samples are used as training and validation samples.

Long calculation time

Not suitable for imbalanced data sets

from sklearn.model_selection import LeavePOut,cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris=load_iris()
X=iris.data
Y=iris.target
lpo=LeavePOut(p=2)
lpo.get_n_splits(X)
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=lpo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

Leave-one-out cross-validation

Leave-one-out cross-validation is an exhaustive cross-validation technique where 1 sample point is used as the validation set and the remaining n-1 samples are used as the training set.

Suppose we have 100 samples in the dataset. Then in each iteration, 1 value will be used as the validation set and the remaining 99 samples as the training set. Therefore, the process is repeated until every sample of the dataset is used as a validation point.

It is the same as LeavePOut cross-validation with p=1.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut,cross_val_score
iris=load_iris()
X=iris.data
Y=iris.target
loo=LeaveOneOut()
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=loo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

Monte Carlo Cross Validation (Shuffle Split)

Monte Carlo cross-validation, also known as Shuffle Split cross-validation, is a very flexible cross-validation strategy. In this technique, the data set is randomly divided into training and validation sets.

We have decided what percentage of the dataset to use as the training set and what percentage to use as the validation set. If the sum of the percentage increases in training and validation set sizes is not 100, the remaining data set is not used for the training or validation set.

Suppose we have 100 samples, 60% of which are used as training set and 20% of which are used as validation set, then the remaining 20% (100-(60 + 20)) will not be used.

This split will be repeated the “n” times we have to specify.

1. We are free to use the training and validation set sizes.
2. We can choose the number of repetitions without relying on the number of folds of the repetition.

Very few samples may not be selected for the training or validation set.

Not suitable for imbalanced data sets

from sklearn.model_selection import ShuffleSplit,cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
shuffle_split=ShuffleSplit(test_size=0.3,train_size=0.5,n_splits=10)
scores=cross_val_score(logreg,iris.data,iris.target,cv=shuffle_split)
print("cross Validation scores:n {}".format(scores))
print("Average Cross Validation score :{}".format(scores.mean()))

Time series cross-validation

What is time series data?
Time series data is data collected at different points in time. Because data points are collected in adjacent time periods, there may be correlations between observations. This is one of the characteristics that distinguishes time series data from cross-sectional data.

How to do cross validation in case of time series data?
In the case of time series data, we cannot select random samples and assign them to the training set or validation set because it does not make sense to use values from future data to predict values from past data.

Since the order of the data is very important for time series related problems, we split the data into training and validation sets based on time, also known as the “forward chain” method or rolling cross-validation.

We start with a small amount of data as a training set. Based on this set, we predict later data points and then check the accuracy.

The predicted samples are then included as part of the next training dataset and predictions are made on subsequent samples.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
time_series = TimeSeriesSplit()
print(time_series)
for train_index, test_index in time_series.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

ref: https://zhuanlan.zhihu.com/p/435970393