MachineLearningMastery.com

Machine learning algorithms have hyperparameters that allow you to tailor the behavior of the algorithm to your specific dataset.

Hyperparameters are different from parameters, which are the internal coefficients or weights for a model found by the learning algorithm. Unlike parameters, hyperparameters are specified by the practitioner when configuring the model.

Typically, it is challenging to know what values to use for the hyperparameters of a given algorithm on a given dataset, therefore it is common to use random or grid search strategies for different hyperparameter values.

The more hyperparameters of an algorithm that you need to tune, the slower the tuning process. Therefore, it is desirable to select a minimum subset of model hyperparameters to search or tune.

Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behavior, and in turn, the performance of a machine learning algorithm.

As a machine learning practitioner, you must know which hyperparameters to focus on to get a good result quickly.

In this tutorial, you will discover those hyperparameters that are most important for some of the top machine learning algorithms.

Let’s get started.

Update Jan/2020: Updated for changes in scikit-learn v0.22 API.

Hyperparameters for Classification Machine Learning Algorithms
Photo by shuttermonkey, some rights reserved.

Classification Algorithms Overview

We will take a closer look at the important hyperparameters of the top machine learning algorithms that you may use for classification.

We will look at the hyperparameters you need to focus on and suggested values to try when tuning the model on your dataset.

The suggestions are based both on advice from textbooks on the algorithms and practical advice suggested by practitioners, as well as a little of my own experience.

The seven classification algorithms we will look at are as follows:

Logistic Regression
Ridge Classifier
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Bagged Decision Trees (Bagging)
Random Forest
Stochastic Gradient Boosting

We will consider these algorithms in the context of their scikit-learn implementation (Python); nevertheless, you can use the same hyperparameter suggestions with other platforms, such as Weka and R.

A small grid searching example is also given for each algorithm that you can use as a starting point for your own classification predictive modeling project.

Note: if you have had success with different hyperparameter values or even different hyperparameters than those suggested in this tutorial, let me know in the comments below. I’d love to hear about it.

Let’s dive in.

Logistic Regression

Logistic regression does not really have any critical hyperparameters to tune.

Sometimes, you can see useful differences in performance or convergence with different solvers (solver).

solver in [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]

Regularization (penalty) can sometimes be helpful.

penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]

Note: not all solvers support all regularization terms.

The C parameter controls the penality strength, which can also be effective.

C in [100, 10, 1.0, 0.1, 0.01]

For the full list of hyperparameters, see:

sklearn.linear_model.LogisticRegression API.

The example below demonstrates grid searching the key hyperparameters for LogisticRegression on a synthetic binary classification dataset.

Some combinations were omitted to cut back on the warnings/errors.

# example of grid searching key hyperparametres for logistic regression
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.945333 using {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}
0.936333 (0.016829) with: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
0.937667 (0.017259) with: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
0.938667 (0.015861) with: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
0.936333 (0.017413) with: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
0.938333 (0.017904) with: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
0.939000 (0.016401) with: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
0.937333 (0.017114) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
0.939000 (0.017195) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
0.939000 (0.015780) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
0.940000 (0.015706) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
0.940333 (0.014941) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
0.941000 (0.017000) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
0.943000 (0.016763) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'}
0.943000 (0.016763) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'}
0.945333 (0.017651) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}

Ridge Classifier

Ridge regression is a penalized linear regression model for predicting a numerical value.

Nevertheless, it can be very effective when applied to classification.

Perhaps the most important parameter to tune is the regularization strength (alpha). A good starting point might be values in the range [0.1 to 1.0]

alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

For the full list of hyperparameters, see:

sklearn.linear_model.RidgeClassifier API.

The example below demonstrates grid searching the key hyperparameters for RidgeClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for ridge classifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeClassifier
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = RidgeClassifier()
alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
# define grid search
grid = dict(alpha=alpha)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.974667 using {'alpha': 0.1}
0.974667 (0.014545) with: {'alpha': 0.1}
0.974667 (0.014545) with: {'alpha': 0.2}
0.974667 (0.014545) with: {'alpha': 0.3}
0.974667 (0.014545) with: {'alpha': 0.4}
0.974667 (0.014545) with: {'alpha': 0.5}
0.974667 (0.014545) with: {'alpha': 0.6}
0.974667 (0.014545) with: {'alpha': 0.7}
0.974667 (0.014545) with: {'alpha': 0.8}
0.974667 (0.014545) with: {'alpha': 0.9}
0.974667 (0.014545) with: {'alpha': 1.0}

K-Nearest Neighbors (KNN)

The most important hyperparameter for KNN is the number of neighbors (n_neighbors).

Test values between at least 1 and 21, perhaps just the odd numbers.

n_neighbors in [1 to 21]

It may also be interesting to test different distance metrics (metric) for choosing the composition of the neighborhood.

metric in [‘euclidean’, ‘manhattan’, ‘minkowski’]

For a fuller list see:

sklearn.neighbors.DistanceMetric API

It may also be interesting to test the contribution of members of the neighborhood via different weightings (weights).

weights in [‘uniform’, ‘distance’]

For the full list of hyperparameters, see:

sklearn.neighbors.KNeighborsClassifier API.

The example below demonstrates grid searching the key hyperparameters for KNeighborsClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for KNeighborsClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = KNeighborsClassifier()
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.937667 using {'metric': 'manhattan', 'n_neighbors': 13, 'weights': 'uniform'}
0.833667 (0.031674) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'uniform'}
0.833667 (0.031674) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'distance'}
0.895333 (0.030081) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
0.895333 (0.030081) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}
0.909000 (0.021810) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}
0.909000 (0.021810) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}
0.925333 (0.020774) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'uniform'}
0.925333 (0.020774) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}
0.929000 (0.027368) with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'uniform'}
0.929000 (0.027368) with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'distance'}
...

Support Vector Machine (SVM)

The SVM algorithm, like gradient boosting, is very popular, very effective, and provides a large number of hyperparameters to tune.

Perhaps the first important parameter is the choice of kernel that will control the manner in which the input variables will be projected. There are many to choose from, but linear, polynomial, and RBF are the most common, perhaps just linear and RBF in practice.

kernels in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]

If the polynomial kernel works out, then it is a good idea to dive into the degree hyperparameter.

Another critical parameter is the penalty (C) that can take on a range of values and has a dramatic effect on the shape of the resulting regions for each class. A log scale might be a good starting point.

C in [100, 10, 1.0, 0.1, 0.001]

For the full list of hyperparameters, see:

sklearn.svm.SVC API.

The example below demonstrates grid searching the key hyperparameters for SVC on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for SVC
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define model and parameters
model = SVC()
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
# define grid search
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.974333 using {'C': 1.0, 'gamma': 'scale', 'kernel': 'poly'}
0.973667 (0.012512) with: {'C': 50, 'gamma': 'scale', 'kernel': 'poly'}
0.970667 (0.018062) with: {'C': 50, 'gamma': 'scale', 'kernel': 'rbf'}
0.945333 (0.024594) with: {'C': 50, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.973667 (0.012512) with: {'C': 10, 'gamma': 'scale', 'kernel': 'poly'}
0.970667 (0.018062) with: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.957000 (0.016763) with: {'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.974333 (0.012565) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'poly'}
0.971667 (0.016948) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}
0.966333 (0.016224) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.972333 (0.013585) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'poly'}
0.974000 (0.013317) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'}
0.971667 (0.015934) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.972333 (0.013585) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'poly'}
0.973667 (0.014716) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'rbf'}
0.974333 (0.013828) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'sigmoid'}

Bagged Decision Trees (Bagging)

The most important parameter for bagged decision trees is the number of trees (n_estimators).

Ideally, this should be increased until no further improvement is seen in the model.

Good values might be a log scale from 10 to 1,000.

n_estimators in [10, 100, 1000]

For the full list of hyperparameters, see:

sklearn.ensemble.BaggingClassifier API

The example below demonstrates grid searching the key hyperparameters for BaggingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for BaggingClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = BaggingClassifier()
n_estimators = [10, 100, 1000]
# define grid search
grid = dict(n_estimators=n_estimators)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.873667 using {'n_estimators': 1000}
0.839000 (0.038588) with: {'n_estimators': 10}
0.869333 (0.030434) with: {'n_estimators': 100}
0.873667 (0.035070) with: {'n_estimators': 1000}

Random Forest

The most important parameter is the number of random features to sample at each split point (max_features).

You could try a range of integer values, such as 1 to 20, or 1 to half the number of input features.

max_features [1 to 20]

Alternately, you could try a suite of different default value calculators.

max_features in [‘sqrt’, ‘log2’]

Another important parameter for random forest is the number of trees (n_estimators).

Ideally, this should be increased until no further improvement is seen in the model.

Good values might be a log scale from 10 to 1,000.

n_estimators in [10, 100, 1000]

For the full list of hyperparameters, see:

sklearn.ensemble.RandomForestClassifier API.

The example below demonstrates grid searching the key hyperparameters for BaggingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for RandomForestClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = RandomForestClassifier()
n_estimators = [10, 100, 1000]
max_features = ['sqrt', 'log2']
# define grid search
grid = dict(n_estimators=n_estimators,max_features=max_features)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.952000 using {'max_features': 'log2', 'n_estimators': 1000}
0.841000 (0.032078) with: {'max_features': 'sqrt', 'n_estimators': 10}
0.938333 (0.020830) with: {'max_features': 'sqrt', 'n_estimators': 100}
0.944667 (0.024998) with: {'max_features': 'sqrt', 'n_estimators': 1000}
0.817667 (0.033235) with: {'max_features': 'log2', 'n_estimators': 10}
0.940667 (0.021592) with: {'max_features': 'log2', 'n_estimators': 100}
0.952000 (0.019562) with: {'max_features': 'log2', 'n_estimators': 1000}

Stochastic Gradient Boosting

Also called Gradient Boosting Machine (GBM) or named for the specific implementation, such as XGBoost.

The gradient boosting algorithm has many parameters to tune.

There are some parameter pairings that are important to consider. The first is the learning rate, also called shrinkage or eta (learning_rate) and the number of trees in the model (n_estimators). Both could be considered on a log scale, although in different directions.

learning_rate in [0.001, 0.01, 0.1]
n_estimators [10, 100, 1000]

Another pairing is the number of rows or subset of the data to consider for each tree (subsample) and the depth of each tree (max_depth). These could be grid searched at a 0.1 and 1 interval respectively, although common values can be tested directly.

subsample in [0.5, 0.7, 1.0]
max_depth in [3, 7, 9]

For more detailed advice on tuning the XGBoost implementation, see:

How to Configure the Gradient Boosting Algorithm

For the full list of hyperparameters, see:

sklearn.ensemble.GradientBoostingClassifier API.

The example below demonstrates grid searching the key hyperparameters for GradientBoostingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for GradientBoostingClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
# define dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# define models and parameters
model = GradientBoostingClassifier()
n_estimators = [10, 100, 1000]
learning_rate = [0.001, 0.01, 0.1]
subsample = [0.5, 0.7, 1.0]
max_depth = [3, 7, 9]
# define grid search
grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.936667 using {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.5}
0.803333 (0.042058) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.5}
0.783667 (0.042386) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.7}
0.711667 (0.041157) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 1.0}
0.832667 (0.040244) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.5}
0.809667 (0.040040) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7}
0.741333 (0.043261) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
0.881333 (0.034130) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.5}
0.866667 (0.035150) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.7}
0.838333 (0.037424) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 1.0}
0.838333 (0.036614) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 0.5}
0.821667 (0.040586) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 0.7}
0.729000 (0.035903) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 1.0}
0.884667 (0.036854) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.5}
0.871333 (0.035094) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.7}
0.729000 (0.037625) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 1.0}
0.905667 (0.033134) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 1000, 'subsample': 0.5}
...

Summary

In this tutorial, you discovered the top hyperparameters and how to configure them for top machine learning algorithms.

Do you have other hyperparameter suggestions? Let me know in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Tune Hyperparameters for Classification Machine Learning Algorithms appeared first on Machine Learning Mastery.

It is important that beginner machine learning practitioners practice on small real-world datasets.

So-called standard machine learning datasets contain actual observations, fit into memory, and are well studied and well understood. As such, they can be used by beginner practitioners to quickly test, explore, and practice data preparation and modeling techniques.

A practitioner can confirm whether they have the data skills required to achieve a good result on a standard machine learning dataset. A good result is a result that is above the 80th or 90th percentile result of what may be technically possible for a given dataset.

The skills developed by practitioners on standard machine learning datasets can provide the foundation for tackling larger, more challenging projects.

In this post, you will discover standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.

After reading this post, you will know:

The importance of standard machine learning datasets.
How to systematically evaluate a model on a standard machine learning dataset.
Standard datasets for classification and regression and the baseline and good performance expected on each.

Let’s get started.

Updated Jun/2020: Added improved results for the glass and horse colic dataset.

Results for Standard Classification and Regression Machine Learning Datasets
Photo by Don Dearing, some rights reserved.

Overview

This tutorial is divided into seven parts; they are:

Value of Small Machine Learning Datasets
Definition of a Standard Machine Learning Dataset
Standard Machine Learning Datasets
Good Results for Standard Datasets
Model Evaluation Methodology
Results for Classification Datasets
1. Binary Classification Datasets
  1. Ionosphere
  2. Pima Indian Diabetes
  3. Sonar
  4. Wisconsin Breast Cancer
  5. Horse Colic
2. Multiclass Classification Datasets
  1. Iris Flowers
  2. Glass
  3. Wine
  4. Wheat Seeds
Results for Regression Datasets
1. Housing
2. Auto Insurance
3. Abalone
4. Auto Imports

Value of Small Machine Learning Datasets

There are a number of small machine learning datasets for classification and regression predictive modeling problems that are frequently reused.

Sometimes the datasets are used as the basis for demonstrating a machine learning or data preparation technique. Other times, they are used as a basis for comparing different techniques.

These datasets were collected and made publicly available in the early days of applied machine learning when data and real-world datasets were scarce. As such, they have become a standard or canonized from their wide adoption and reuse alone, not for any intrinsic interestingness in the problems.

Finding a good model on one of these datasets does not mean you have “solved” the general problem. Also, some of the datasets may contain names or indicators that might be considered questionable or culturally insensitive (which was very likely not the intent when the data was collected). As such, they are also sometimes referred to as “toy” datasets.

Such datasets are not really useful for points of comparison for machine learning algorithms, as most empirical experiments are nearly impossible to reproduce.

Nevertheless, such datasets are valuable in the field of applied machine learning today. Even in the era of standard machine learning libraries, big data, and the abundance of data.

There are three main reasons why they are valuable; they are:

The datasets are real.
The datasets are small.
The datasets are understood.

Real datasets are useful as compared to contrived datasets because they are messy. There may be and are measurement errors, missing values, mislabeled examples, and more. Some or all of these issues must be searched for and addressed, and are some of the properties we may encounter when working on our own projects.

Small datasets are useful as compared to large datasets that may be many gigabytes in size. Small datasets can easily fit into memory and allow for the testing and exploration of many different data visualization, data preparation, and modeling algorithms easily and quickly. Speed of testing ideas and getting feedback is critical for beginners, and small datasets facilitate exactly this.

Understood datasets are useful as compared to new or newly created datasets. The features are well defined, the units of the features are specified, the source of the data is known, and the dataset has been well studied in tens, hundreds, and in some cases, thousands of research projects and papers. This provides a context in which results can be compared and evaluated, a property not available in entirely new domains.

Given these properties, I strongly advocate machine learning beginners (and practitioners that are new to a specific technique) start with standard machine learning datasets.

Definition of a Standard Machine Learning Dataset

I would like to go one step further and define some more specific properties of a “standard” machine learning dataset.

A standard machine learning dataset has the following properties.

Less than 10,000 rows (samples).
Less than 100 columns (features).
Last column is the target variable.
Stored in a single file with CSV format and without header line.
Missing values marked with a question mark character (‘?’)
It is possible to achieve a better than naive result.

Now that we have a clear definition of a dataset, let’s look at what a “good” result means.

Standard Machine Learning Datasets

A dataset is a standard machine learning dataset if it is frequently used in books, research papers, tutorials, presentations, and more.

The best repository for these so-called classical or standard machine learning datasets is the University of California at Irvine (UCI) machine learning repository. This website categorizes datasets by type and provides a download of the data and additional information about each dataset and references relevant papers.

I have chosen five or fewer datasets for each problem type as a starting point.

All standard datasets used in this post are available on GitHub here:

Machine Learning Mastery Datasets

Download links are also provided for each dataset and for additional details about the dataset (the so-called a “.name” file).

Each code example will automatically download a given dataset for you. If this is a problem, you can download the CSV file manually, place it in the same directory as the code example, then change the code example to use the filename instead of the URL.

For example:

...
# load dataset
dataframe = read_csv('ionosphere.csv', header=None)

Good Results for Standard Datasets

A challenge for beginners when working with standard machine learning datasets is what represents a good result.

In general, a model is skillful if it can demonstrate a performance that is better than a naive method, such as predicting the majority class in classification or the mean value in regression. This is called a baseline model or a baseline of performance that provides a relative measure of performance specific to a dataset. You can learn more about this here:

How To Know if Your Machine Learning Model Has Good Performance

Given that we now have a method for determining whether a model has skill on a dataset, beginners remain interested in the upper limits of performance for a given dataset. This is required information to know whether you are “getting good” at the process of applied machine learning.

Good does not mean perfect predictions. All models will have prediction errors, and perfect predictions are not possible (tractable?) on real-world datasets.

Defining “good” or “best” results for a dataset is challenging because it is dependent generally on the model evaluation methodology, and specifically on the versions of the dataset and libraries used in the evaluation.

Good means “good-enough” given available resources. Often, this means a skill score that is above the 80th or 90th percentile of what might be possible for a dataset given unbounded skill, time, and computational resources.

In this tutorial, you will discover how to calculate the baseline performance and “good” (near-best) performance that is possible on each dataset. You will also discover how to specify the data preparation and model used to achieve the performance.

Rather than explain how to do this, a short Python code example is given that you can use to reproduce the baseline and good result.

Model Evaluation Methodology

The evaluation methodology is simple and fast, and generally recommended when working with small predictive modeling problems.

The procedure is evaluated as follows:

A model is evaluated using 10-fold cross-validation.
The evaluation procedure is repeated three times.
The random seed for the cross-validation split is the repeat number (1, 2, or 3).

This results in 30 estimates of model performance from which a mean and standard deviation can be calculated to summarize the performance of a given model.

Using the repeat number as the seed for each cross-validation split ensures that each algorithm evaluated on the dataset gets the same splits of the data, ensuring a fair direct comparison.

Using the scikit-learn Python machine learning library, the example below can be used to evaluate a given model (or Pipeline). The RepeatedStratifiedKFold class defines the number of folds and repeats for classification, and the cross_val_score() function defines the score and performs the evaluation and returns a list of scores from which a mean and standard deviation can be calculated.

...
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

For regression we can use the RepeatedKFold class and the MAE score.

...
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

The “good” scores reported are the best that I can get out of my own personal set of “get a good result fast on a given dataset” scripts. I believe the scores represent good scores that can be achieved on each dataset, perhaps in the 90th or 95th percentile of what is possible for each dataset, if not better.

That being said, I am not claiming that they are the best possible scores as I have not performed hyperparameter tuning for the well-performing models. I leave this as an exercise for interested practitioners. Best scores are not required if a practitioner can address a given dataset as getting a top percentile score is more than sufficient to demonstrate competence.

Note: I will update the results and models as I improve my own personal scripts and achieve better scores.

Can you get a better score for a dataset?
I would love to know. Share your model and score in the comments below and I will try to reproduce it and update the post (and give you full credit!)

Let’s dive in.

Results for Classification Datasets

Classification is a predictive modeling problem that predicts one label given one or more input variables.

The baseline model for classification tasks is a model that predicts the majority label. This can be achieved in scikit-learn using the DummyClassifier class with the ‘most_frequent‘ strategy; for example:

...
model = DummyClassifier(strategy='most_frequent')

The standard evaluation for classification models is classification accuracy, although this is not ideal for imbalanced and some multi-class problems. Nevertheless, for better or worse, this score will be used (for now).

Accuracy is reported as a fraction between 0 (0% or no skill) and 1 (100% or perfect skill).

There are two main types of classification tasks: binary and multi-class classification, divided based on the number of labels to be predicted for a given dataset as two or more than two respectively. Given the prevalence of classification tasks in machine learning, we will treat these two subtypes of classification problems separately.

Binary Classification Datasets

In this section, we will review the baseline and good performance on the following binary classification predictive modeling datasets:

Ionosphere
Pima Indian Diabetes
Sonar
Wisconsin Breast Cancer
Horse Colic

Ionosphere

Download: ionosphere.csv
Details: ionosphere.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Ionosphere
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVC(kernel='rbf', gamma='scale', C=10)
steps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (351, 34), (351,)
Baseline: 0.641 (0.006)
Good: 0.948 (0.033)

Pima Indian Diabetes

Download: pima-indians-diabetes.csv
Details: pima-indians-diabetes.name

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Pima Indian Diabetes
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = LogisticRegression(solver='newton-cg',penalty='l2',C=1)
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Note: you may see some warnings, but they can be safely ignored.

Shape: (768, 8), (768,)
Baseline: 0.651 (0.003)
Good: 0.774 (0.055)

Sonar

Download: sonar.csv
Details: sonar.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Sonar
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = KNeighborsClassifier(n_neighbors=2, metric='minkowski', weights='distance')
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (208, 60), (208,)
Baseline: 0.534 (0.012)
Good: 0.882 (0.071)

Wisconsin Breast Cancer

Download: breast-cancer-wisconsin.csv
Details: breast-cancer-wisconsin.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wisconsin Breast Cancer
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer-wisconsin.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVC(kernel='sigmoid', gamma='scale', C=0.1)
steps = [('i',SimpleImputer(strategy='median')), ('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Note: you may see some warnings, but they can be safely ignored.

Shape: (699, 9), (699,)
Baseline: 0.655 (0.003)
Good: 0.973 (0.019)

Horse Colic

Download: horse-colic.csv
Details: horse-colic.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Horse Colic
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RandomForestClassifier(n_estimators=1000)
imputer = SimpleImputer(strategy='median', add_indicator=True)
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (300, 27), (300,)
Baseline: 0.670 (0.010)
Good: 0.878 (0.042)

Multiclass Classification Datasets

In this section, we will review the baseline and good performance on the following multiclass classification predictive modeling datasets:

Iris Flowers
Glass
Wine
Wheat Seeds

Iris Flowers

Download: iris.csv
Details: iris.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Iris
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = LinearDiscriminantAnalysis()
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (150, 4), (150,)
Baseline: 0.333 (0.000)
Good: 0.980 (0.039)

Glass

Download: glass.csv
Details: glass.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

Note: The test harness was changed from 10-fold to 5-fold cross-validation to ensure each fold had examples of all classes and avoid warning messages.

# baseline and good result for Glass
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0}
model = RandomForestClassifier(n_estimators=1000, class_weight=weights, max_features=2)
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (214, 9), (214,)
Baseline: 0.355 (0.009)
Good: 0.815 (0.048)

Wine

Download: wine.csv
Details: wine.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wine
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.dummy import DummyClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = QuadraticDiscriminantAnalysis()
steps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (178, 13), (178,)
Baseline: 0.399 (0.017)
Good: 0.992 (0.020)

Wheat Seeds

Download: wheat-seeds.csv
Details: wheat-seeds.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wine
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import RidgeClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wheat-seeds.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RidgeClassifier(alpha=0.2)
steps = [('s',StandardScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (210, 7), (210,)
Baseline: 0.333 (0.000)
Good: 0.973 (0.036)

Results for Regression Datasets

Regression is a predictive modeling problem that predicts a numerical value given one or more input variables.

The baseline model for classification tasks is a model that predicts the mean or median value. This can be achieved in scikit-learn using the DummyRegressor class using the ‘median‘ strategy; for example:

...
model = DummyRegressor(strategy='median')

The standard evaluation for regression models is mean absolute error (MAE), although this is not ideal for all regression problems. Nevertheless, for better or worse, this score will be used (for now).

MAE is reported as an error score between 0 (perfect skill) and a very large number or infinity (no skill).

In this section, we will review the baseline and good performance on the following regression predictive modeling datasets:

Housing
Auto Insurance
Abalone
Auto Imports

Housing

Download: housing.csv
Details: housing.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Housing
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.dummy import DummyRegressor
from xgboost import XGBRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = XGBRegressor(learning_rate=0.1, n_estimators=100, subsample=0.7, max_depth=9, colsample_bynode=0.6, objective='reg:squarederror')
m_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (506, 13), (506,)
Baseline: 6.660 (0.706)
Good: 1.955 (0.279)

Auto Insurance

Download: auto-insurance.csv
Details: auto-insurance.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Auto Insurance
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import HuberRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = HuberRegressor(epsilon=1.0, alpha=0.001)
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
target = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
m_scores = cross_val_score(target, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (63, 1), (63,)
Baseline: 66.624 (19.303)
Good: 28.358 (9.747)

Abalone

Download: abalone.csv
Details: abalone.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Abalone
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.svm import SVR
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
transform = ColumnTransformer(transformers=[('c', OneHotEncoder(), [0])], remainder='passthrough')
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVR(kernel='rbf',gamma='scale',C=10)
target = TransformedTargetRegressor(regressor=model, transformer=PowerTransformer(), check_inverse=False)
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',target)])
m_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (4177, 8), (4177,)
Baseline: 2.363 (0.116)
Good: 1.460 (0.075)

Auto Imports

Download: auto_imports.csv
Details: auto_imports.names

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Auto Imports
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto_imports.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cat_ix = [2,3,4,5,6,7,8,14,15,17]
num_ix = [0,1,9,10,11,12,13,16,18,19,20,21,22,23,24]
steps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')),('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_ix), ('n', SimpleImputer(strategy='median'), num_ix)]
transform = ColumnTransformer(transformers=steps, remainder='passthrough')
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RandomForestRegressor(n_estimators=100,max_features=10)
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',model)])
m_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (201, 25), (201,)
Baseline: 5880.718 (1197.967)
Good: 1405.420 (317.683)

Summary

In this post, you discovered standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.

Specifically, you learned:

The importance of standard machine learning datasets.
How to systematically evaluate a model on a standard machine learning dataset.
Standard datasets for classification and regression and the baseline and good performance expected on each.

Did I miss your favorite dataset?
Let me know in the comments and I will calculate a score for it, or perhaps even add it to this post.

Can you get a better score for a dataset?
I would love to know; share your model and score in the comments below and I will try to reproduce it and update the post (and give you full credit!)

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Best Results for Standard Machine Learning Datasets appeared first on Machine Learning Mastery.

Distance measures play an important role in machine learning.

They provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

In this tutorial, you will discover distance measures in machine learning.

After completing this tutorial, you will know:

The role and importance of distance measures in machine learning algorithms.
How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Let’s get started.

Distance Measures for Machine Learning
Photo by Prince Roy, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

Role of Distance Measures
Hamming Distance
Euclidean Distance
Manhattan Distance (Taxicab or City Block)
Minkowski Distance

Role of Distance Measures

Distance measures play an important role in machine learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

Perhaps the most likely way you will encounter distance measures is when you are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm of this type is the k-nearest neighbors algorithm, or KNN for short.

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which use distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization, or LVQ, algorithm that may also be considered a type of neural network.

Related is the self-organizing map algorithm, or SOM, that also uses distance measures and can be used for supervised or unsupervised learning. Another unsupervised learning algorithm that uses distance measures at its core is the K-means clustering algorithm.

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

A short list of some of the more popular machine learning algorithms that use distance measures at their core is as follows:

K-Nearest Neighbors
Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM)
K-Means Clustering

There are many kernel-based methods may also be considered distance-based algorithms. Perhaps the most widely known kernel method is the support vector machine algorithm, or SVM for short.

Do you know more algorithms that use distance measures?
Let me know in the comments below.

When calculating the distance between two examples or rows of data, it is possible that different data types are used for different columns of the examples. An example might have real values, boolean values, categorical values, and ordinal values. Different distance measures may be required for each that are summed together into a single distance score.

Numerical values may have different scales. This can greatly impact the calculation of distance measure and it is often a good practice to normalize or standardize numerical values prior to calculating the distance measure.

Numerical error in regression problems may also be considered a distance. For example, the error between the expected value and the predicted value is a one-dimensional distance measure that can be summed or averaged over all examples in a test set to give a total distance between the expected and predicted outcomes in the dataset. The calculation of the error, such as the mean squared error or mean absolute error, may resemble a standard distance measure.

As we can see, distance measures play an important role in machine learning. Perhaps four of the most commonly used distance measures in machine learning are as follows:

Hamming Distance
Euclidean Distance
Manhattan Distance
Minkowski Distance

What are some other distance measures you have used or heard of?
Let me know in the comments below.

You need to know how to calculate each of these distance measures when implementing algorithms from scratch and the intuition for what is being calculated when using algorithms that make use of these distance measures.

Let’s take a closer look at each in turn.

Hamming Distance

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns of data.

For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode each example as a bitstring with one bit for each column.

red = [1, 0, 0]
green = [0, 1, 0]
blue = [0, 0, 1]

The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

HammingDistance = sum for i to N abs(v1[i] – v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N

We can demonstrate this with an example of calculating the Hamming distance between two bitstrings, listed below.

# calculating hamming distance between bit strings

# calculate hamming distance
def hamming_distance(a, b):
	return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)

Running the example reports the Hamming distance between the two bitstrings.

We can see that there are two differences between the strings, or 2 out of 6 bit positions different, which averaged (2/6) is about 1/3 or 0.333.

0.3333333333333333

We can also perform the same calculation using the hamming() function from SciPy. The complete example is listed below.

# calculating hamming distance between bit strings
from scipy.spatial.distance import hamming
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

0.3333333333333333

Euclidean Distance

Euclidean distance calculates the distance between two real-valued vectors.

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.

If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Although there are other possible choices, most instance-based learners use Euclidean distance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

EuclideanDistance = sqrt(sum for i to N (v1[i] – v2[i])^2)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

EuclideanDistance = sum for i to N (v1[i] – v2[i])^2

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors
from math import sqrt

# calculate euclidean distance
def euclidean_distance(a, b):
	return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean_distance(row1, row2)
print(dist)

Running the example reports the Euclidean distance between the two vectors.

6.082762530298219

We can also perform the same calculation using the euclidean() function from SciPy. The complete example is listed below.

# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

Manhattan Distance (Taxicab or City Block Distance)

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

ManhattanDistance = sum for i to N sum |v1[i] – v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors
from math import sqrt

# calculate manhattan distance
def manhattan_distance(a, b):
	return sum(abs(e1-e2) for e1, e2 in zip(a,b))

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = manhattan_distance(row1, row2)
print(dist)

Running the example reports the Manhattan distance between the two vectors.

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors
from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

Minkowski Distance

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “order” or “p“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

p=1: Manhattan distance.
p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “p” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors
from math import sqrt

# calculate minkowski distance
def minkowski_distance(a, b, p):
	return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example first calculates and prints the Minkowski distance with p set to 1 to give the Manhattan distance, then with p set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0
6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors
from scipy.spatial import minkowski_distance
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0
6.082762530298219

Summary

In this tutorial, you discovered distance measures in machine learning.

Specifically, you learned:

The role and importance of distance measures in machine learning algorithms.
How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post 4 Distance Measures for Machine Learning appeared first on Machine Learning Mastery.

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

In this tutorial, you will discover how to evaluate machine learning models using the train-test split.

After completing this tutorial, you will know:

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
How to use the scikit-learn machine learning library to perform the train-test split procedure.
How to evaluate machine learning algorithms for classification and regression using the train-test split.

Let’s get started.

Train-Test Split for Evaluating Machine Learning Algorithms
Photo by Paul VanDerWerf, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Train-Test Split Evaluation
1. When to Use the Train-Test Split
2. How to Configure the Train-Test Split
Train-Test Split Procedure in Scikit-Learn
1. Repeatable Train-Test Splits
2. Stratified Train-Test Splits
Train-Test Split to Evaluate Machine Learning Models
1. Train-Test Split for Classification
2. Train-Test Split for Regression

Train-Test Split Evaluation

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

The train-test procedure is appropriate when there is a sufficiently large dataset available.

When to Use the Train-Test Split

The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.

A suitable representation of the problem domain means that there are enough records to cover all common cases and most uncommon cases in the domain. This might mean combinations of input variables observed in practice. It might require thousands, hundreds of thousands, or millions of examples.

Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).

If you have insufficient data, then a suitable alternate model evaluation procedure would be the k-fold cross-validation procedure.

In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.

Some models are very costly to train, and in that case, repeated evaluation used in other procedures is intractable. An example might be deep neural network models. In this case, the train-test procedure is commonly used.

Alternately, a project may have an efficient model and a vast dataset, although may require an estimate of model performance quickly. Again, the train-test split procedure is approached in this situation.

Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.

How to Configure the Train-Test Split

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

Computational cost in training the model.
Computational cost in evaluating the model.
Training set representativeness.
Test set representativeness.

Nevertheless, common split percentages include:

Train: 80%, Test: 20%
Train: 67%, Test: 33%
Train: 50%, Test: 50%

Now that we are familiar with the train-test split model evaluation procedure, let’s look at how we can use this procedure in Python.

Train-Test Split Procedure in Scikit-Learn

The scikit-learn Python machine learning library provides an implementation of the train-test split evaluation procedure via the train_test_split() function.

The function takes a loaded dataset as input and returns the dataset split into two subsets.

...
# split into train test sets
train, test = train_test_split(dataset, ...)

Ideally, you can split your original dataset into input (X) and output (y) columns, then call the function passing both arrays and have them split appropriately into train and test subsets.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)

The size of the split can be specified via the “test_size” argument that takes a number of rows (integer) or a percentage (float) of the size of the dataset between 0 and 1.

The latter is the most common, with values used such as 0.33 where 33 percent of the dataset will be allocated to the test set and 67 percent will be allocated to the training set.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can demonstrate this using a synthetic classification dataset with 1,000 examples.

The complete example is listed below.

# split a dataset into train and test sets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=1000)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset into train and test sets, then prints the size of the new dataset.

We can see that 670 examples (67 percent) were allocated to the training set and 330 examples (33 percent) were allocated to the test set, as we specified.

(670, 2) (330, 2) (670,) (330,)

Alternatively, the dataset can be split by specifying the “train_size” argument that can be either a number of rows (integer) or a percentage of the original dataset between 0 and 1, such as 0.67 for 67 percent.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)

Repeatable Train-Test Splits

Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset. If you are new to pseudo-random number generators, see the tutorial:

Introduction to Random Number Generators for Machine Learning in Python

This can be achieved by setting the “random_state” to an integer value. Any value will do; it is not a tunable hyperparameter.

...
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

The example below demonstrates this and shows that two separate splits of the data result in the same result.

# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])

Running the example splits the dataset and prints the first five rows of the training dataset.

The dataset is split again and the first five rows of the training dataset are printed showing identical values, confirming that when we fix the seed for the pseudorandom number generator, we get an identical split of the original dataset.

[[-2.54341511  4.98947608]
 [ 5.65996724 -8.50997751]
 [-2.5072835  10.06155749]
 [ 6.92679558 -5.91095498]
 [ 6.01313957 -7.7749444 ]]

[[-2.54341511  4.98947608]
 [ 5.65996724 -8.50997751]
 [-2.5072835  10.06155749]
 [ 6.92679558 -5.91095498]
 [ 6.01313957 -7.7749444 ]]

Stratified Train-Test Splits

One final consideration is for classification problems only.

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified train-test split.

We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

We can demonstrate this with an example of a classification dataset with 94 examples in one class and six examples in a second class.

First, we can split the dataset into train and test sets without the “stratify” argument. The complete example is listed below.

# split imbalanced dataset into train and test sets without stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)
print(Counter(y_train))
print(Counter(y_test))

Running the example first reports the composition of the dataset by class label, showing the expected 94 percent vs. 6 percent.

Then the dataset is split and the composition of the train and test sets is reported. We can see that the train set has 45/5 examples in the test set has 49/1 examples. The composition of the train and test sets differ, and this is not desirable.

Counter({0: 94, 1: 6})
Counter({0: 45, 1: 5})
Counter({0: 49, 1: 1})

Next, we can stratify the train-test split and compare the results.

# split imbalanced dataset into train and test sets with stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
print(Counter(y_train))
print(Counter(y_test))

Given that we have used a 50 percent split for the train and test sets, we would expect both the train and test sets to have 47/3 examples in the train/test sets respectively.

Running the example, we can see that in this case, the stratified version of the train-test split has created both the train and test datasets with 47/3 examples in the train/test sets as we expected.

Counter({0: 94, 1: 6})
Counter({0: 47, 1: 3})
Counter({0: 47, 1: 3})

Now that we are familiar with the train_test_split() function, let’s look at how we can use it to evaluate a machine learning model.

Train-Test Split to Evaluate Machine Learning Models

In this section, we will explore using the train-test split procedure to evaluate machine learning models on standard classification and regression predictive modeling datasets.

Train-Test Split for Classification

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the sonar dataset.

The sonar dataset is a standard machine learning dataset composed of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

...
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the classification accuracy performance metric.

...
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the sonar dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The dataset is split into train and test sets and we can see that there are 139 rows for training and 69 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data has an accuracy of about 78.3 percent.

(208, 60) (208,)
(139, 60) (69, 60) (139,) (69,)
Accuracy: 0.783

Train-Test Split for Regression

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the housing dataset.

The housing dataset is a standard machine learning dataset composed of 506 rows of data with 13 numerical input variables and a numerical target variable.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)

Running the example confirms the 506 rows of data and 13 input variables and single numeric target variables (14 in total).

(506, 14)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

...
# fit the model
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the mean absolute error (MAE) performance metric.

...
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the housing dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The dataset is split into train and test sets and we can see that there are 339 rows for training and 167 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data is a mean absolute error of about 2.211 (thousands of dollars).

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)
MAE: 2.157

Summary

In this tutorial, you discovered how to evaluate machine learning models using the train-test split.

Specifically, you learned:

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
How to use the scikit-learn machine learning library to perform the train-test split procedure.
How to evaluate machine learning algorithms for classification and regression using the train-test split.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Train-Test Split for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

The Leave-One-Out Cross-Validation, or LOOCV, procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance. Although simple to use and no configuration to specify, there are times when the procedure should not be used, such as when you have a very large dataset or a computationally expensive model to evaluate.

In this tutorial, you will discover how to evaluate machine learning models using leave-one-out cross-validation.

After completing this tutorial, you will know:

The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost of the method.
How to use the scikit-learn machine learning library to perform the leave-one-out cross-validation procedure.
How to evaluate machine learning algorithms for classification and regression using leave-one-out cross-validation.

Let’s get started.

LOOCV for Evaluating Machine Learning Algorithms
Photo by Heather Harvey, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

LOOCV Model Evaluation
LOOCV Procedure in Scikit-Learn
LOOCV to Evaluate Machine Learning Models
1. LOOCV for Classification
2. LOOCV for Regression

LOOCV Model Evaluation

Cross-validation, or k-fold cross-validation, is a procedure used to estimate the performance of a machine learning algorithm when making predictions on data not used during the training of the model.

The cross-validation has a single hyperparameter “k” that controls the number of subsets that a dataset is split into. Once split, each subset is given the opportunity to be used as a test set while all other subsets together are used as a training dataset.

This means that k-fold cross-validation involves fitting and evaluating k models. This, in turn, provides k estimates of a model’s performance on the dataset, which can be reported using summary statistics such as the mean and standard deviation. This score can then be used to compare and ultimately select a model and configuration to use as the “final model” for a dataset.

Typical values for k are k=3, k=5, and k=10, with 10 representing the most common value. This is because, given extensive testing, 10-fold cross-validation provides a good balance of low computational cost and low bias in the estimate of model performance as compared to other k values and a single train-test split.

For more on k-fold cross-validation, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

Leave-one-out cross-validation, or LOOCV, is a configuration of k-fold cross-validation where k is set to the number of examples in the dataset.

LOOCV is an extreme version of k-fold cross-validation that has the maximum computational cost. It requires one model to be created and evaluated for each example in the training dataset.

The benefit of so many fit and evaluated models is a more robust estimate of model performance as each row of data is given an opportunity to represent the entirety of the test dataset.

Given the computational cost, LOOCV is not appropriate for very large datasets such as more than tens or hundreds of thousands of examples, or for models that are costly to fit, such as neural networks.

Don’t Use LOOCV: Large datasets or costly models to fit.

Given the improved estimate of model performance, LOOCV is appropriate when an accurate estimate of model performance is critical. This particularly case when the dataset is small, such as less than thousands of examples, can lead to model overfitting during training and biased estimates of model performance.

Further, given that no sampling of the training dataset is used, this estimation procedure is deterministic, unlike train-test splits and other k-fold cross-validation confirmations that provide a stochastic estimate of model performance.

Use LOOCV: Small datasets or when estimated model performance is critical.

Once models have been evaluated using LOOCV and a final model and configuration chosen, a final model is then fit on all available data and used to make predictions on new data.

Now that we are familiar with the LOOCV procedure, let’s look at how we can use the method in Python.

LOOCV Procedure in Scikit-Learn

The scikit-learn Python machine learning library provides an implementation of the LOOCV via the LeaveOneOut class.

The method has no configuration, therefore, no arguments are provided to create an instance of the class.

...
# create loocv procedure
cv = LeaveOneOut()

Once created, the split() function can be called and provided the dataset to enumerate.

Each iteration will return the row indices that can be used for the train and test sets from the provided dataset.

...
for train_ix, test_ix in cv.split(X):
	...

These indices can be used on the input (X) and output (y) columns of the dataset array to split the dataset.

...
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]

The training set can be used to fit a model and the test set can be used to evaluate it by first making a prediction and calculating a performance metric on the predicted values versus the expected values.

...
# fit model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# evaluate model
yhat = model.predict(X_test)

Scores can be saved from each evaluation and a final mean estimate of model performance can be presented.

We can tie this together and demonstrate how to use LOOCV to evaluate a RandomForestClassifier model for a synthetic binary classification dataset created with the make_blobs() function.

The complete example is listed below.

# loocv to manually evaluate the performance of a random forest classifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# create dataset
X, y = make_blobs(n_samples=100, random_state=1)
# create loocv procedure
cv = LeaveOneOut()
# enumerate splits
y_true, y_pred = list(), list()
for train_ix, test_ix in cv.split(X):
	# split data
	X_train, X_test = X[train_ix, :], X[test_ix, :]
	y_train, y_test = y[train_ix], y[test_ix]
	# fit model
	model = RandomForestClassifier(random_state=1)
	model.fit(X_train, y_train)
	# evaluate model
	yhat = model.predict(X_test)
	# store
	y_true.append(y_test[0])
	y_pred.append(yhat[0])
# calculate accuracy
acc = accuracy_score(y_true, y_pred)
print('Accuracy: %.3f' % acc)

Running the example manually estimates the performance of the random forest classifier on the synthetic dataset.

Given that the dataset has 100 examples, it means that 100 train/test splits of the dataset were created, with each single row of the dataset given an opportunity to be used as the test set. Similarly, 100 models are created and evaluated.

The classification accuracy across all predictions is then reported, in this case as 99 percent.

Accuracy: 0.990

A downside of enumerating the folds manually is that it is slow and involves a lot of code that could introduce bugs.

An alternative to evaluating a model using LOOCV is to use the cross_val_score() function.

This function takes the model, the dataset, and the instantiated LOOCV object set via the “cv” argument. A sample of accuracy scores is then returned that can be summarized by calculating the mean and standard deviation.

We can also set the “n_jobs” argument to -1 to use all CPU cores, greatly decreasing the computational cost in fitting and evaluating so many models.

The example below demonstrates evaluating the RandomForestClassifier using LOOCV on the same synthetic dataset using the cross_val_score() function.

# loocv to automatically evaluate the performance of a random forest classifier
from numpy import mean
from numpy import std
from sklearn.datasets import make_blobs
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# create dataset
X, y = make_blobs(n_samples=100, random_state=1)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestClassifier(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example automatically estimates the performance of the random forest classifier on the synthetic dataset.

The mean classification accuracy across all folds matches our manual estimate previously.

Accuracy: 0.990 (0.099)

Now that we are familiar with how to use the LeaveOneOut class, let’s look at how we can use it to evaluate a machine learning model on real datasets.

LOOCV to Evaluate Machine Learning Models

In this section, we will explore using the LOOCV procedure to evaluate machine learning models on standard classification and regression predictive modeling datasets.

LOOCV for Classification

We will demonstrate how to use LOOCV to evaluate a random forest algorithm on the sonar dataset.

The sonar dataset is a standard machine learning dataset comprising 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

We can now evaluate a model using LOOCV.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Next, we define the LOOCV procedure.

...
# create loocv procedure
cv = LeaveOneOut()

We can then define the model to evaluate.

...
# create model
model = RandomForestClassifier(random_state=1)

Then use the cross_val_score() function to enumerate the folds, fit models, then make and evaluate predictions. We can then report the mean and standard deviation of model performance.

...
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example is listed below.

# loocv evaluate random forest on the sonar dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestClassifier(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The model is then evaluated using LOOCV and the estimated performance when making predictions on new data has an accuracy of about 82.2 percent.

(208, 60) (208,)
Accuracy: 0.822 (0.382)

LOOCV for Regression

We will demonstrate how to use LOOCV to evaluate a random forest algorithm on the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)

Running the example confirms the 506 rows of data and 13 input variables and single numeric target variables (14 in total).

(506, 14)

We can now evaluate a model using LOOCV.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Next, we define the LOOCV procedure.

...
# create loocv procedure
cv = LeaveOneOut()

We can then define the model to evaluate.

...
# create model
model = RandomForestRegressor(random_state=1)

Then use the cross_val_score() function to enumerate the folds, fit models, then make and evaluate predictions. We can then report the mean and standard deviation of model performance.

In this case, we use the mean absolute error (MAE) performance metric appropriate for regression.

...
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example is listed below.

# loocv evaluate random forest on the housing dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestRegressor(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The model is evaluated using LOOCV and the performance of the model when making predictions on new data is a mean absolute error of about 2.180 (thousands of dollars).

(506, 13) (506,)
MAE: 2.180 (2.346)

Summary

In this tutorial, you discovered how to evaluate machine learning models using leave-one-out cross-validation.

Specifically, you learned:

The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost of the method.
How to use the scikit-learn machine learning library to perform the leave-one-out cross-validation procedure.
How to evaluate machine learning algorithms for classification and regression using leave-one-out cross-validation.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post LOOCV for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.

In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.

After completing this tutorial, you will know:

Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Let’s get started.

Nested Cross-Validation for Machine Learning with Python
Photo by Andrew Bone, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Combined Hyperparameter Tuning and Model Selection
What Is Nested Cross-Validation
Nested Cross-Validation With Scikit-Learn

Combined Hyperparameter Tuning and Model Selection

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.

Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.

Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, an often noisy model performance score. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

One approach to this problem is called nested cross-validation.

What Is Nested Cross-Validation

Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.

In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.

As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “double cross-validation.”

Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.”

Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.

This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.

In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

What Is the Cost of Nested Cross-Validation?

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.

To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.

How Do You Set k?

The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.

You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.

It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.

How Do You Configure the Final Model?

The final model is configured and fit using the procedure applied internally to the outer loop.

As follows:

An algorithm is selected based on its performance on the outer loop of nested cross-validation.
Then the inner-procedure is applied to the entire dataset.
The hyperparameters found during this final search are then used to configure a final model.
The final model is fit on the entire dataset.

This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.

Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.

Nested Cross-Validation With Scikit-Learn

The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.

The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset. The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.

For example:

...
# configure the cross-validation procedure
cv = KFold(n_splits=10, random_state=1)
# perform cross-validation procedure
for train_ix, test_ix in cv_outer.split(X):
	# split data
	X_train, X_test = X[train_ix, :], X[test_ix, :]
	y_train, y_test = y[train_ix], y[test_ix]
	# fit and evaluate a model
	...

This class can be used to perform the outer-loop of the nested-cross validation procedure.

The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.

For example:

...
# configure the cross-validation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# define search space
space = dict()
...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)

These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.

We can tie these elements together and implement the nested cross-validation procedure.

Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.

...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv_inner, refit=True)
# execute search
result = search.fit(X_train, y_train)
# get the best performing model fit on the whole training set
best_model = result.best_estimator_

This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.

...
# evaluate model on the hold out dataset
yhat = best_model.predict(X_test)

Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.

We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.

The complete example is listed below.

# manual nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for train_ix, test_ix in cv_outer.split(X):
	# split data
	X_train, X_test = X[train_ix, :], X[test_ix, :]
	y_train, y_test = y[train_ix], y[test_ix]
	# configure the cross-validation procedure
	cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
	# define the model
	model = RandomForestClassifier(random_state=1)
	# define search space
	space = dict()
	space['n_estimators'] = [10, 100, 500]
	space['max_features'] = [2, 4, 6]
	# define search
	search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)
	# execute search
	result = search.fit(X_train, y_train)
	# get the best performing model fit on the whole training set
	best_model = result.best_estimator_
	# evaluate model on the hold out dataset
	yhat = best_model.predict(X_test)
	# evaluate the model
	acc = accuracy_score(y_test, yhat)
	# store the result
	outer_results.append(acc)
	# report progress
	print('>acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_))
# summarize the estimated performance of the model
print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.

You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.

Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.

This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.

A final mean classification accuracy is then reported.

>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100}
>acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100}
>acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500}
>acc=0.960, est=0.926, cfg={'max_features': 2, 'n_estimators': 500}
>acc=0.900, est=0.937, cfg={'max_features': 4, 'n_estimators': 500}
Accuracy: 0.927 (0.019)

A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.

This greatly reduces the amount of code required to perform the nested cross-validation.

The complete example is listed below.

# automatic nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
# define the model
model = RandomForestClassifier(random_state=1)
# define search space
space = dict()
space['n_estimators'] = [10, 100, 500]
space['max_features'] = [2, 4, 6]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# execute the nested cross-validation
scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.

Accuracy: 0.927 (0.019)

Summary

In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.

Specifically, you learned:

Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Nested Cross-Validation for Machine Learning with Python appeared first on Machine Learning Mastery.

The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm on a dataset.

A common value for k is 10, although how do we know that this configuration is appropriate for our dataset and our algorithms?

One approach is to explore the effect of different k values on the estimate of model performance and compare this to an ideal test condition. This can help to choose an appropriate value for k.

Once a k-value is chosen, it can be used to evaluate a suite of different algorithms on the dataset and the distribution of results can be compared to an evaluation of the same algorithms using an ideal test condition to see if they are highly correlated or not. If correlated, it confirms the chosen configuration is a robust approximation for the ideal test condition.

In this tutorial, you will discover how to configure and evaluate configurations of k-fold cross-validation.

After completing this tutorial, you will know:

How to evaluate a machine learning algorithm using k-fold cross-validation on a dataset.
How to perform a sensitivity analysis of k-values for k-fold cross-validation.
How to calculate the correlation between a cross-validation test harness and an ideal test condition.

Let’s get started.

How to Configure k-Fold Cross-Validation
Photo by Patricia Farrell, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

k-Fold Cross-Validation
Sensitivity Analysis for k
Correlation of Test Harness With Target

k-Fold Cross-Validation

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.

First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.

The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 100 samples each with 20 input features, 15 of which contribute to the target variable.

The example below creates and summarizes the dataset.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms that it contains 100 samples and 10 input variables.

The fixed seed for the pseudorandom number generator ensures that we get the same samples each time the dataset is generated.

(100, 20) (100,)

Next, we can evaluate a model on this dataset using k-fold cross-validation.

We will evaluate a LogisticRegression model and use the KFold class to perform the cross-validation, configured to shuffle the dataset and set k=10, a popular default.

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

The complete example is listed below.

# evaluate a logistic regression model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation. The mean classification accuracy on the dataset is then reported.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an estimated classification accuracy of about 85.0 percent.

Accuracy: 0.850 (0.128)

Now that we are familiar with k-fold cross-validation, let’s look at how we might configure the procedure.

Sensitivity Analysis for k

The key configuration parameter for k-fold cross-validation is k that defines the number folds in which to split a given dataset.

Common values are k=3, k=5, and k=10, and by far the most popular value used in applied machine learning to evaluate models is k=10. The reason for this is studies were performed and k=10 was found to provide good trade-off of low computational cost and low bias in an estimate of model performance.

How do we know what value of k to use when evaluating models on our own dataset?

You can choose k=10, but how do you know this makes sense for your dataset?

One approach to answering this question is to perform a sensitivity analysis for different k values. That is, evaluate the performance of the same model on the same dataset with different values of k and see how they compare.

The expectation is that low values of k will result in a noisy estimate of model performance and large values of k will result in a less noisy estimate of model performance.

But noisy compared to what?

We don’t know the true performance of the model when making predictions on new/unseen data, as we don’t have access to new/unseen data. If we did, we would make use of it in the evaluation of the model.

Nevertheless, we can choose a test condition that represents an “ideal” or as-best-as-we-can-achieve “ideal” estimate of model performance.

One approach would be to train the model on all available data and estimate the performance on a separate large and representative hold-out dataset. The performance on this hold-out dataset would represent the “true” performance of the model and any cross-validation performances on the training dataset would represent an estimate of this score.

This is rarely possible as we often do not have enough data to hold some back and use it as a test set. Kaggle machine learning competitions are one exception to this, where we do have a hold-out test set, a sample of which is evaluated via submissions.

Instead, we can simulate this case using the leave-one-out cross-validation (LOOCV), a computationally expensive version of cross-validation where k=N, and N is the total number of examples in the training dataset. That is, each sample in the training set is given an example to be used alone as the test evaluation dataset. It is rarely used for large datasets as it is computationally expensive, although it can provide a good estimate of model performance given the available data.

We can then compare the mean classification accuracy for different k values to the mean classification accuracy from LOOCV on the same dataset. The difference between the scores provides a rough proxy for how well a k value approximates the ideal model evaluation test condition.

Let’s explore how to implement a sensitivity analysis of k-fold cross-validation.

First, let’s define a function to create the dataset. This allows you to change the dataset to your own if you desire.

# create the dataset
def get_dataset(n_samples=100):
	X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

Next, we can define a dataset to create the model to evaluate.

Again, this separation allows you to change the model to your own if you desire.

# retrieve the model to be evaluate
def get_model():
	model = LogisticRegression()
	return model

Next, you can define a function to evaluate the model on the dataset given a test condition. The test condition could be an instance of the KFold configured with a given k-value, or it could be an instance of LeaveOneOut that represents our ideal test condition.

The function returns the mean classification accuracy as well as the min and max accuracy from the folds. We can use the min and max to summarize the distribution of scores.

# evaluate the model using a given test condition
def evaluate_model(cv):
	# get the dataset
	X, y = get_dataset()
	# get the model
	model = get_model()
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return scores
	return mean(scores), scores.min(), scores.max()

Next, we can calculate the model performance using the LOOCV procedure.

...
# calculate the ideal test condition
ideal, _, _ = evaluate_model(LeaveOneOut())
print('Ideal: %.3f' % ideal)

We can then define the k values to evaluate. In this case, we will test values between 2 and 30.

...
# define folds to test
folds = range(2,31)

We can then evaluate each value in turn and store the results as we go.

...
# record mean and min/max of each set of results
means, mins, maxs = list(),list(),list()
# evaluate each k value
for k in folds:
	# define the test condition
	cv = KFold(n_splits=k, shuffle=True, random_state=1)
	# evaluate k value
	k_mean, k_min, k_max = evaluate_model(cv)
	# report performance
	print('> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
	# store mean accuracy
	means.append(k_mean)
	# store min and max relative to the mean
	mins.append(k_mean - k_min)
	maxs.append(k_max - k_mean)

Finally, we can plot the results for interpretation.

...
# line plot of k mean values with min/max error bars
pyplot.errorbar(folds, means, yerr=[mins, maxs], fmt='o')
# plot the ideal case in a separate color
pyplot.plot(folds, [ideal for _ in range(len(folds))], color='r')
# show the plot
pyplot.show()

Tying this together, the complete example is listed below.

# sensitivity analysis of k in k-fold cross-validation
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

# create the dataset
def get_dataset(n_samples=100):
	X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

# retrieve the model to be evaluate
def get_model():
	model = LogisticRegression()
	return model

# evaluate the model using a given test condition
def evaluate_model(cv):
	# get the dataset
	X, y = get_dataset()
	# get the model
	model = get_model()
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return scores
	return mean(scores), scores.min(), scores.max()

# calculate the ideal test condition
ideal, _, _ = evaluate_model(LeaveOneOut())
print('Ideal: %.3f' % ideal)
# define folds to test
folds = range(2,31)
# record mean and min/max of each set of results
means, mins, maxs = list(),list(),list()
# evaluate each k value
for k in folds:
	# define the test condition
	cv = KFold(n_splits=k, shuffle=True, random_state=1)
	# evaluate k value
	k_mean, k_min, k_max = evaluate_model(cv)
	# report performance
	print('> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
	# store mean accuracy
	means.append(k_mean)
	# store min and max relative to the mean
	mins.append(k_mean - k_min)
	maxs.append(k_max - k_mean)
# line plot of k mean values with min/max error bars
pyplot.errorbar(folds, means, yerr=[mins, maxs], fmt='o')
# plot the ideal case in a separate color
pyplot.plot(folds, [ideal for _ in range(len(folds))], color='r')
# show the plot
pyplot.show()

Running the example first reports the LOOCV, then the mean, min, and max accuracy for each k value that was evaluated.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the LOOCV result was about 84 percent, slightly lower than the k=10 result of 85 percent.

Ideal: 0.840
> folds=2, accuracy=0.740 (0.700,0.780)
> folds=3, accuracy=0.749 (0.697,0.824)
> folds=4, accuracy=0.790 (0.640,0.920)
> folds=5, accuracy=0.810 (0.600,0.950)
> folds=6, accuracy=0.820 (0.688,0.941)
> folds=7, accuracy=0.799 (0.571,1.000)
> folds=8, accuracy=0.811 (0.385,0.923)
> folds=9, accuracy=0.829 (0.636,1.000)
> folds=10, accuracy=0.850 (0.600,1.000)
> folds=11, accuracy=0.829 (0.667,1.000)
> folds=12, accuracy=0.785 (0.250,1.000)
> folds=13, accuracy=0.839 (0.571,1.000)
> folds=14, accuracy=0.807 (0.429,1.000)
> folds=15, accuracy=0.821 (0.571,1.000)
> folds=16, accuracy=0.827 (0.500,1.000)
> folds=17, accuracy=0.816 (0.600,1.000)
> folds=18, accuracy=0.831 (0.600,1.000)
> folds=19, accuracy=0.826 (0.600,1.000)
> folds=20, accuracy=0.830 (0.600,1.000)
> folds=21, accuracy=0.814 (0.500,1.000)
> folds=22, accuracy=0.820 (0.500,1.000)
> folds=23, accuracy=0.802 (0.250,1.000)
> folds=24, accuracy=0.804 (0.250,1.000)
> folds=25, accuracy=0.810 (0.250,1.000)
> folds=26, accuracy=0.804 (0.250,1.000)
> folds=27, accuracy=0.818 (0.250,1.000)
> folds=28, accuracy=0.821 (0.250,1.000)
> folds=29, accuracy=0.822 (0.250,1.000)
> folds=30, accuracy=0.822 (0.333,1.000)

A line plot is created comparing the mean accuracy scores to the LOOCV result with the min and max of each result distribution indicated using error bars.

The results suggest that for this model on this dataset, most k values underestimate the performance of the model compared to the ideal case. The results suggest that perhaps k=10 alone is slightly optimistic and perhaps k=13 might be a more accurate estimate.

Line Plot of Mean Accuracy for Cross-Validation k-Values With Error Bars (Blue) vs. the Ideal Case (red)

This provides a template that you can use to perform a sensitivity analysis of k values of your chosen model on your dataset against a given ideal test condition.

Correlation of Test Harness With Target

Once a test harness is chosen, another consideration is how well it matches the ideal test condition across different algorithms.

It is possible that for some algorithms and some configurations, the k-fold cross-validation will be a better approximation of the ideal test condition compared to other algorithms and algorithm configurations.

We can evaluate and report on this relationship explicitly. This can be achieved by calculating how well the k-fold cross-validation results across a range of algorithms match the evaluation of the same algorithms on the ideal test condition.

The Pearson’s correlation coefficient can be calculated between the two groups of scores to measure how closely they match. That is, do they change together in the same ways: when one algorithm looks better than another via k-fold cross-validation, does this hold on the ideal test condition?

We expect to see a strong positive correlation between the scores, such as 0.5 or higher. A low correlation suggests the need to change the k-fold cross-validation test harness to better match the ideal test condition.

First, we can define a function that will create a list of standard machine learning models to evaluate via each test harness.

# get a list of models to evaluate
def get_models():
	models = list()
	models.append(LogisticRegression())
	models.append(RidgeClassifier())
	models.append(SGDClassifier())
	models.append(PassiveAggressiveClassifier())
	models.append(KNeighborsClassifier())
	models.append(DecisionTreeClassifier())
	models.append(ExtraTreeClassifier())
	models.append(LinearSVC())
	models.append(SVC())
	models.append(GaussianNB())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier())
	models.append(RandomForestClassifier())
	models.append(ExtraTreesClassifier())
	models.append(GaussianProcessClassifier())
	models.append(GradientBoostingClassifier())
	models.append(LinearDiscriminantAnalysis())
	models.append(QuadraticDiscriminantAnalysis())
	return models

We will use k=10 for the chosen test harness.

We can then enumerate each model and evaluate it using 10-fold cross-validation and our ideal test condition, in this case, LOOCV.

...
# define test conditions
ideal_cv = LeaveOneOut()
cv = KFold(n_splits=10, shuffle=True, random_state=1)
# get the list of models to consider
models = get_models()
# collect results
ideal_results, cv_results = list(), list()
# evaluate each model
for model in models:
	# evaluate model using each test condition
	cv_mean = evaluate_model(cv, model)
	ideal_mean = evaluate_model(ideal_cv, model)
	# check for invalid results
	if isnan(cv_mean) or isnan(ideal_mean):
		continue
	# store results
	cv_results.append(cv_mean)
	ideal_results.append(ideal_mean)
	# summarize progress
	print('>%s: ideal=%.3f, cv=%.3f' % (type(model).__name__, ideal_mean, cv_mean))

We can then calculate the correlation between the mean classification accuracy from the 10-fold cross-validation test harness and the LOOCV test harness.

...
# calculate the correlation between each test condition
corr, _ = pearsonr(cv_results, ideal_results)
print('Correlation: %.3f' % corr)

Finally, we can create a scatter plot of the two sets of results and draw a line of best fit to visually see how well they change together.

...
# scatter plot of results
pyplot.scatter(cv_results, ideal_results)
# plot the line of best fit
coeff, bias = polyfit(cv_results, ideal_results, 1)
line = coeff * asarray(cv_results) + bias
pyplot.plot(cv_results, line, color='r')
# show the plot
pyplot.show()

Tying all of this together, the complete example is listed below.

# correlation between test harness and ideal test condition
from numpy import mean
from numpy import isnan
from numpy import asarray
from numpy import polyfit
from scipy.stats import pearsonr
from matplotlib import pyplot
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# create the dataset
def get_dataset(n_samples=100):
	X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

# get a list of models to evaluate
def get_models():
	models = list()
	models.append(LogisticRegression())
	models.append(RidgeClassifier())
	models.append(SGDClassifier())
	models.append(PassiveAggressiveClassifier())
	models.append(KNeighborsClassifier())
	models.append(DecisionTreeClassifier())
	models.append(ExtraTreeClassifier())
	models.append(LinearSVC())
	models.append(SVC())
	models.append(GaussianNB())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier())
	models.append(RandomForestClassifier())
	models.append(ExtraTreesClassifier())
	models.append(GaussianProcessClassifier())
	models.append(GradientBoostingClassifier())
	models.append(LinearDiscriminantAnalysis())
	models.append(QuadraticDiscriminantAnalysis())
	return models

# evaluate the model using a given test condition
def evaluate_model(cv, model):
	# get the dataset
	X, y = get_dataset()
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return scores
	return mean(scores)

# define test conditions
ideal_cv = LeaveOneOut()
cv = KFold(n_splits=10, shuffle=True, random_state=1)
# get the list of models to consider
models = get_models()
# collect results
ideal_results, cv_results = list(), list()
# evaluate each model
for model in models:
	# evaluate model using each test condition
	cv_mean = evaluate_model(cv, model)
	ideal_mean = evaluate_model(ideal_cv, model)
	# check for invalid results
	if isnan(cv_mean) or isnan(ideal_mean):
		continue
	# store results
	cv_results.append(cv_mean)
	ideal_results.append(ideal_mean)
	# summarize progress
	print('>%s: ideal=%.3f, cv=%.3f' % (type(model).__name__, ideal_mean, cv_mean))
# calculate the correlation between each test condition
corr, _ = pearsonr(cv_results, ideal_results)
print('Correlation: %.3f' % corr)
# scatter plot of results
pyplot.scatter(cv_results, ideal_results)
# plot the line of best fit
coeff, bias = polyfit(cv_results, ideal_results, 1)
line = coeff * asarray(cv_results) + bias
pyplot.plot(cv_results, line, color='r')
# label the plot
pyplot.title('10-fold CV vs LOOCV Mean Accuracy')
pyplot.xlabel('Mean Accuracy (10-fold CV)')
pyplot.ylabel('Mean Accuracy (LOOCV)')
# show the plot
pyplot.show()

Running the example reports the mean classification accuracy for each algorithm calculated via each test harness.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

You may see some warnings that you can safely ignore, such as:

Variables are collinear

We can see that for some algorithms, the test harness over-estimates the accuracy compared to LOOCV, and in other cases, it under-estimates the accuracy. This is to be expected.

At the end of the run, we can see that the correlation between the two sets of results is reported. In this case, we can see that a correlation of 0.746 is reported, which is a good strong positive correlation. The results suggest that 10-fold cross-validation does provide a good approximation for the LOOCV test harness on this dataset as calculated with 18 popular machine learning algorithms.

>LogisticRegression: ideal=0.840, cv=0.850
>RidgeClassifier: ideal=0.830, cv=0.830
>SGDClassifier: ideal=0.730, cv=0.790
>PassiveAggressiveClassifier: ideal=0.780, cv=0.760
>KNeighborsClassifier: ideal=0.760, cv=0.770
>DecisionTreeClassifier: ideal=0.690, cv=0.630
>ExtraTreeClassifier: ideal=0.710, cv=0.620
>LinearSVC: ideal=0.850, cv=0.830
>SVC: ideal=0.900, cv=0.880
>GaussianNB: ideal=0.730, cv=0.720
>AdaBoostClassifier: ideal=0.740, cv=0.740
>BaggingClassifier: ideal=0.770, cv=0.740
>RandomForestClassifier: ideal=0.810, cv=0.790
>ExtraTreesClassifier: ideal=0.820, cv=0.820
>GaussianProcessClassifier: ideal=0.790, cv=0.760
>GradientBoostingClassifier: ideal=0.820, cv=0.820
>LinearDiscriminantAnalysis: ideal=0.830, cv=0.830
>QuadraticDiscriminantAnalysis: ideal=0.610, cv=0.760
Correlation: 0.746

Finally, a scatter plot is created comparing the distribution of mean accuracy scores for the test harness (x-axis) vs. the accuracy scores via LOOCV (y-axis).

A red line of best fit is drawn through the results showing the strong linear correlation.

Scatter Plot of Cross-Validation vs. Ideal Test Mean Accuracy With Line of Best Fit

This provides a harness for comparing your chosen test harness to an ideal test condition on your own dataset.

Summary

In this tutorial, you discovered how to configure and evaluate configurations of k-fold cross-validation.

Specifically, you learned:

How to evaluate a machine learning algorithm using k-fold cross-validation on a dataset.
How to perform a sensitivity analysis of k-values for k-fold cross-validation.
How to calculate the correlation between a cross-validation test harness and an ideal test condition.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Configure k-Fold Cross-Validation appeared first on Machine Learning Mastery.

The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm or configuration on a dataset.

A single run of the k-fold cross-validation procedure may result in a noisy estimate of model performance. Different splits of the data may result in very different results.

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

In this tutorial, you will discover repeated k-fold cross-validation for model evaluation.

After completing this tutorial, you will know:

The mean performance reported from a single run of k-fold cross-validation may be noisy.
Repeated k-fold cross-validation provides a way to reduce the error in the estimate of mean model performance.
How to evaluate machine learning models using repeated k-fold cross-validation in Python.

Let’s get started.

Repeated k-Fold Cross-Validation for Model Evaluation in Python
Photo by lina smith, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

k-Fold Cross-Validation
Repeated k-Fold Cross-Validation
Repeated k-Fold Cross-Validation in Python

k-Fold Cross-Validation

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.

First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.

The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 1,000 samples each with 20 input features, 15 of which contribute to the target variable.

The example below creates and summarizes the dataset.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms that it contains 1,000 samples and 10 input variables.

The fixed seed for the pseudorandom number generator ensures that we get the same samples each time the dataset is generated.

(1000, 20) (1000,)

Next, we can evaluate a model on this dataset using k-fold cross-validation.

We will evaluate a LogisticRegression model and use the KFold class to perform the cross-validation, configured to shuffle the dataset and set k=10, a popular default.

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

The complete example is listed below.

# evaluate a logistic regression model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation. The mean classification accuracy on the dataset is then reported.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an estimated classification accuracy of about 86.8 percent.

Accuracy: 0.868 (0.032)

Now that we are familiar with k-fold cross-validation, let’s look at an extension that repeats the procedure.

Repeated k-Fold Cross-Validation

The estimate of model performance via k-fold cross-validation can be noisy.

This means that each time the procedure is run, a different split of the dataset into k-folds can be implemented, and in turn, the distribution of performance scores can be different, resulting in a different mean estimate of model performance.

The amount of difference in the estimated performance from one run of k-fold cross-validation to another is dependent upon the model that is being used and on the dataset itself.

A noisy estimate of model performance can be frustrating as it may not be clear which result should be used to compare and select a final model to address the problem.

One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.

An alternate approach is to repeat the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats. This approach is generally referred to as repeated k-fold cross-validation.

… repeated k-fold cross-validation replicates the procedure […] multiple times. For example, if 10-fold cross-validation was repeated five times, 50 different held-out sets would be used to estimate model efficacy.

— Page 70, Applied Predictive Modeling, 2013.

Importantly, each repeat of the k-fold cross-validation process must be performed on the same dataset split into different folds.

Repeated k-fold cross-validation has the benefit of improving the estimate of the mean model performance at the cost of fitting and evaluating many more models.

Common numbers of repeats include 3, 5, and 10. For example, if 3 repeats of 10-fold cross-validation are used to estimate the model performance, this means that (3 * 10) or 30 different models would need to be fit and evaluated.

Appropriate: for small datasets and simple models (e.g. linear).

As such, the approach is suited for small- to modestly-sized datasets and/or models that are not too computationally costly to fit and evaluate. This suggests that the approach may be appropriate for linear models and not appropriate for slow-to-fit models like deep learning neural networks.

Like k-fold cross-validation itself, repeated k-fold cross-validation is easy to parallelize, where each fold or each repeated cross-validation process can be executed on different cores or different machines.

Repeated k-Fold Cross-Validation in Python

The scikit-learn Python machine learning library provides an implementation of repeated k-fold cross-validation via the RepeatedKFold class.

The main parameters are the number of folds (n_splits), which is the “k” in k-fold cross-validation, and the number of repeats (n_repeats).

A good default for k is k=10.

A good default for the number of repeats depends on how noisy the estimate of model performance is on the dataset. A value of 3, 5, or 10 repeats is probably a good start. More repeats than 10 are probably not required.

...
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

The example below demonstrates repeated k-fold cross-validation of our test dataset.

# evaluate a logistic regression model using repeated k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation with three repeats. The mean classification accuracy on the dataset is then reported.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an estimated classification accuracy of about 86.7 percent, which is lower than the single run result reported previously of 86.8 percent. This may suggest that the single run result may be optimistic and that the result from three repeats might be a better estimate of the true mean model performance.

Accuracy: 0.867 (0.031)

The expectation of repeated k-fold cross-validation is that the repeated mean would be a more reliable estimate of model performance than the result of a single k-fold cross-validation procedure.

This may mean less statistical noise.

One way this could be measured is by comparing the distributions of mean performance scores under differing numbers of repeats.

We can imagine that there is a true unknown underlying mean performance of a model on a dataset and that repeated k-fold cross-validation runs estimate this mean. We can estimate the error in the mean performance from the true unknown underlying mean performance using a statistical tool called the standard error.

The standard error can provide an indication for a given sample size of the amount of error or the spread of error that may be expected from the sample mean to the underlying and unknown population mean.

Standard error can be calculated as follows:

standard_error = sample_standard_deviation / sqrt(number of repeats)

We can calculate the standard error for a sample using the sem() scipy function.

Ideally, we would like to select a number of repeats that shows both minimization of the standard error and stabilizing of the mean estimated performance compared to other numbers of repeats.

The example below demonstrates this by reporting model performance with 10-fold cross-validation with 1 to 15 repeats of the procedure.

We would expect that more repeats of the procedure would result in a more accurate estimate of the mean model performance, given the law of large numbers. Although, the trials are not independent, so the underlying statistical principles become challenging.

# compare the number of repeats for repeated k-fold cross-validation
from scipy.stats import sem
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

# evaluate a model with a given number of repeats
def evaluate_model(X, y, repeats):
	# prepare the cross-validation procedure
	cv = RepeatedKFold(n_splits=10, n_repeats=repeats, random_state=1)
	# create model
	model = LogisticRegression()
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# configurations to test
repeats = range(1,16)
results = list()
for r in repeats:
	# evaluate using a given number of repeats
	scores = evaluate_model(X, y, r)
	# summarize
	print('>%d mean=%.4f se=%.3f' % (r, mean(scores), sem(scores)))
	# store
	results.append(scores)
# plot the results
pyplot.boxplot(results, labels=[str(r) for r in repeats], showmeans=True)
pyplot.show()

Running the example reports the mean and standard error classification accuracy using 10-fold cross-validation with different numbers of repeats.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the default of one repeat appears optimistic compared to the other results with an accuracy of about 86.80 percent compared to 86.73 percent and lower with differing numbers of repeats.

We can see that the mean seems to coalesce around a value of about 86.5 percent. We might take this as the stable estimate of model performance and in turn, choose 5 or 6 repeats that seem to approximate this value first.

Looking at the standard error, we can see that it decreases with an increase in the number of repeats and stabilizes with a value around 0.003 at around 9 or 10 repeats, although 5 repeats achieve a standard error of 0.005, half of that achieved with a single repeat.

>1 mean=0.8680 se=0.011
>2 mean=0.8675 se=0.008
>3 mean=0.8673 se=0.006
>4 mean=0.8670 se=0.006
>5 mean=0.8658 se=0.005
>6 mean=0.8655 se=0.004
>7 mean=0.8651 se=0.004
>8 mean=0.8651 se=0.004
>9 mean=0.8656 se=0.003
>10 mean=0.8658 se=0.003
>11 mean=0.8655 se=0.003
>12 mean=0.8654 se=0.003
>13 mean=0.8652 se=0.003
>14 mean=0.8651 se=0.003
>15 mean=0.8653 se=0.003

A box and whisker plot is created to summarize the distribution of scores for each number of repeats.

The orange line indicates the median of the distribution and the green triangle represents the arithmetic mean. If these symbols (values) coincide, it suggests a reasonable symmetric distribution and that the mean may capture the central tendency well.

This might provide an additional heuristic for choosing an appropriate number of repeats for your test harness.

Taking this into consideration, using five repeats with this chosen test harness and algorithm appears to be a good choice.

Box and Whisker Plots of Classification Accuracy vs Repeats for k-Fold Cross-Validation

Summary

In this tutorial, you discovered repeated k-fold cross-validation for model evaluation.

Specifically, you learned:

The mean performance reported from a single run of k-fold cross-validation may be noisy.
Repeated k-fold cross-validation provides a way to reduce the error in the estimate of mean model performance.
How to evaluate machine learning models using repeated k-fold cross-validation in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Repeated k-Fold Cross-Validation for Model Evaluation in Python appeared first on Machine Learning Mastery.

Data visualization provides insight into the distribution and relationships between variables in a dataset.

This insight can be helpful in selecting data preparation techniques to apply prior to modeling and the types of algorithms that may be most suited to the data.

Seaborn is a data visualization library for Python that runs on top of the popular Matplotlib data visualization library, although it provides a simple interface and aesthetically better-looking plots.

In this tutorial, you will discover a gentle introduction to Seaborn data visualization for machine learning.

After completing this tutorial, you will know:

How to summarize the distribution of variables using bar charts, histograms, and box and whisker plots.
How to summarize relationships using line plots and scatter plots.
How to compare the distribution and relationships of variables for different class values on the same plot.

Let’s get started.

How to use Seaborn Data Visualization for Machine Learning
Photo by Martin Pettitt, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

Seaborn Data Visualization Library
Line Plots
Bar Chart Plots
Histogram Plots
Box and Whisker Plots
Scatter Plots

Seaborn Data Visualization Library

The primary plotting library for Python is called Matplotlib.

Seaborn is a plotting library that offers a simpler interface, sensible defaults for plots needed for machine learning, and most importantly, the plots are aesthetically better looking than those in Matplotlib.

Seaborn requires that Matplotlib is installed first.

You can install Matplotlib directly using pip, as follows:

sudo pip install matplotlib

Once installed, you can confirm that the library can be loaded and used by printing the version number, as follows:

# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)

Running the example prints the current version of the Matplotlib library.

matplotlib: 3.1.2

Next, the Seaborn library can be installed, also using pip:

sudo pip install seaborn

Once installed, we can also confirm the library can be loaded and used by printing the version number, as follows:

# seaborn
import seaborn
print('seaborn: %s' % seaborn.__version__)

Running the example prints the current version of the Seaborn library.

seaborn: 0.10.0

To create Seaborn plots, you must import the Seaborn library and call functions to create the plots.

Importantly, Seaborn plotting functions expect data to be provided as Pandas DataFrames. This means that if you are loading your data from CSV files, you must use Pandas functions like read_csv() to load your data as a DataFrame. When plotting, columns can then be specified via the DataFrame name or column index.

To show the plot, you can call the show() function on Matplotlib library.

...
# display the plot
pyplot.show()

Alternatively, the plots can be saved to file, such as a PNG formatted image file. The savefig() Matplotlib function can be used to save images.

...
# save the plot
pyplot.savefig('my_image.png')

Now that we have Seaborn installed, let’s look at some common plots we may need when working with machine learning data.

Line Plots

A line plot is generally used to present observations collected at regular intervals.

The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.

A line plot can be created in Seaborn by calling the lineplot() function and passing the x-axis data for the regular interval, and y-axis for the observations.

We can demonstrate a line plot using a time series dataset of monthly car sales.

The dataset has two columns: “Month” and “Sales.” Month will be used as the x-axis and Sales will be plotted on the y-axis.

...
# create line plot
lineplot(x='Month', y='Sales', data=dataset)

Tying this together, the complete example is listed below.

# line plot of a time series dataset
from pandas import read_csv
from seaborn import lineplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv'
dataset = read_csv(url, header=0)
# create line plot
lineplot(x='Month', y='Sales', data=dataset)
# show plot
pyplot.show()

Running the example first loads the time series dataset and creates a line plot of the data, clearly showing a trend and seasonality in the sales data.

Line Plot of a Time Series Dataset

For more great examples of line plots with Seaborn, see: Visualizing statistical relationships.

Bar Chart Plots

A bar chart is generally used to present relative quantities for multiple categories.

The x-axis represents the categories that are spaced evenly. The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.

A bar chart can be created in Seaborn by calling the countplot() function and passing the data.

We will demonstrate a bar chart with a variable from the breast cancer classification dataset that is comprised of categorical input variables.

We will just plot one variable, in this case, the first variable which is the age bracket.

...
# create line plot
countplot(x=0, data=dataset)

Tying this together, the complete example is listed below.

# bar chart plot of a categorical variable
from pandas import read_csv
from seaborn import countplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv'
dataset = read_csv(url, header=None)
# create bar chart plot
countplot(x=0, data=dataset)
# show plot
pyplot.show()

Running the example first loads the breast cancer dataset and creates a bar chart plot of the data, showing each age group and the number of individuals (samples) that fall within reach group.

Bar Chart Plot of Age Range Categorical Variable

We might also want to plot the counts for each category for a variable, such as the first variable, against the class label.

This can be achieved using the countplot() function and specifying the class variable (column index 9) via the “hue” argument, as follows:

...
# create bar chart plot
countplot(x=0, hue=9, data=dataset)

Tying this together, the complete example is listed below.

# bar chart plot of a categorical variable against a class variable
from pandas import read_csv
from seaborn import countplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv'
dataset = read_csv(url, header=None)
# create bar chart plot
countplot(x=0, hue=9, data=dataset)
# show plot
pyplot.show()

Running the example first loads the breast cancer dataset and creates a bar chart plot of the data, showing each age group and the number of individuals (samples) that fall within each group separated by the two class labels for the dataset.

Bar Chart Plot of Age Range Categorical Variable by Class Label

For more great examples of bar chart plots with Seaborn, see: Plotting with categorical data.

Histogram Plots

A histogram plot is generally used to summarize the distribution of a numerical data sample.

The x-axis represents discrete bins or intervals for the observations. For example, observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on.

The y-axis represents the frequency or count of the number of observations in the dataset that belong to each bin.

Essentially, a data sample is transformed into a bar chart where each category on the x-axis represents an interval of observation values.

A histogram can be created in Seaborn by calling the distplot() function and passing the variable.

We will demonstrate a boxplot with a numerical variable from the diabetes classification dataset. We will just plot one variable, in this case, the first variable, which is the number of times that a patient was pregnant.

...
# create histogram plot
distplot(dataset[[0]])

Tying this together, the complete example is listed below.

# histogram plot of a numerical variable
from pandas import read_csv
from seaborn import distplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataset = read_csv(url, header=None)
# create histogram plot
distplot(dataset[[0]])
# show plot
pyplot.show()

Running the example first loads the diabetes dataset and creates a histogram plot of the variable, showing the distribution of the values with a hard cut-off at zero.

The plot shows both the histogram (counts of bins) as well as a smooth estimate of the probability density function.

Histogram Plot of Number of Times Pregnant Numerical Variable

For more great examples of histogram plots with Seaborn, see: Visualizing the distribution of a dataset.

Box and Whisker Plots

A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of a data sample.

The x-axis is used to represent the data sample, where multiple boxplots can be drawn side by side on the x-axis if desired.

The y-axis represents the observation values. A box is drawn to summarize the middle 50 percent of the dataset starting at the observation at the 25th percentile and ending at the 75th percentile. This is called the interquartile range, or IQR. The median, or 50th percentile, is drawn with a line.

Lines called whiskers are drawn extending from both ends of the box, calculated as (1.5 * IQR) to demonstrate the expected range of sensible values in the distribution. Observations outside the whiskers might be outliers and are drawn with small circles.

A boxplot can be created in Seaborn by calling the boxplot() function and passing the data.

...
# create box and whisker plot
boxplot(x=0, data=dataset)

Tying this together, the complete example is listed below.

# box and whisker plot of a numerical variable
from pandas import read_csv
from seaborn import boxplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataset = read_csv(url, header=None)
# create box and whisker plot
boxplot(x=0, data=dataset)
# show plot
pyplot.show()

Running the example first loads the diabetes dataset and creates a boxplot plot of the first input variable, showing the distribution of the number of times patients were pregnant.

We can see the median just above 2.5 times, some outliers up around 15 times (wow!).

Box and Whisker Plot of Number of Times Pregnant Numerical Variable

We might also want to plot the distribution of the numerical variable for each value of a categorical variable, such as the first variable, against the class label.

This can be achieved by calling the boxplot() function and passing the class variable as the x-axis and the numerical variable as the y-axis.

...
# create box and whisker plot
boxplot(x=8, y=0, data=dataset)

Tying this together, the complete example is listed below.

# box and whisker plot of a numerical variable vs class label
from pandas import read_csv
from seaborn import boxplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataset = read_csv(url, header=None)
# create box and whisker plot
boxplot(x=8, y=0, data=dataset)
# show plot
pyplot.show()

Running the example first loads the diabetes dataset and creates a boxplot of the data, showing the distribution of the number of times pregnant as a numerical variable for the two-class labels.

Box and Whisker Plot of Number of Times Pregnant Numerical Variable by Class Label

Scatter Plots

A scatter plot, or scatterplot, is generally used to summarize the relationship between two paired data samples.

Paired data samples mean that two measures were recorded for a given observation, such as the weight and height of a person.

The x-axis represents observation values for the first sample, and the y-axis represents the observation values for the second sample. Each point on the plot represents a single observation.

A scatterplot can be created in Seaborn by calling the scatterplot() function and passing the two numerical variables.

We will demonstrate a scatterplot with two numerical variables from the diabetes classification dataset. We will plot the first versus the second variable, in this case, the first variable, which is the number of times that a patient was pregnant, and the second is the plasma glucose concentration after a two hour oral glucose tolerance test (more details of the variables here).

...
# create scatter plot
scatterplot(x=0, y=1, data=dataset)

Tying this together, the complete example is listed below.

# scatter plot of two numerical variables
from pandas import read_csv
from seaborn import scatterplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataset = read_csv(url, header=None)
# create scatter plot
scatterplot(x=0, y=1, data=dataset)
# show plot
pyplot.show()

Running the example first loads the diabetes dataset and creates a scatter plot of the first two input variables.

We can see a somewhat uniform relationship between the two variables.

Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables

We might also want to plot the relationship for the pair of numerical variables against the class label.

This can be achieved using the scatterplot() function and specifying the class variable (column index 8) via the “hue” argument, as follows:

...
# create scatter plot
scatterplot(x=0, y=1, hue=8, data=dataset)

Tying this together, the complete example is listed below.

# scatter plot of two numerical variables vs class label
from pandas import read_csv
from seaborn import scatterplot
from matplotlib import pyplot
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataset = read_csv(url, header=None)
# create scatter plot
scatterplot(x=0, y=1, hue=8, data=dataset)
# show plot
pyplot.show()

Running the example first loads the diabetes dataset and creates a scatter plot of the first two variables vs. class label.

Scatter Plot of Number of Times Pregnant vs. Plasma Glucose Numerical Variables by Class Label

Summary

In this tutorial, you discovered a gentle introduction to Seaborn data visualization for machine learning.

Specifically, you learned:

How to summarize the distribution of variables using bar charts, histograms, and box and whisker plots.
How to summarize relationships using line plots and scatter plots.
How to compare the distribution and relationships of variables for different class values on the same plot.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to use Seaborn Data Visualization for Machine Learning appeared first on Machine Learning Mastery.

Classification algorithms learn how to assign class labels to examples, although their decisions can appear opaque.

A popular diagnostic for understanding the decisions made by a classification algorithm is the decision surface. This is a plot that shows how a fit machine learning algorithm predicts a coarse grid across the input feature space.

A decision surface plot is a powerful tool for understanding how a given model “sees” the prediction task and how it has decided to divide the input feature space by class label.

In this tutorial, you will discover how to plot a decision surface for a classification machine learning algorithm.

After completing this tutorial, you will know:

Decision surface is a diagnostic tool for understanding how a classification algorithm divides up the feature space.
How to plot a decision surface for using crisp class labels for a machine learning algorithm.
How to plot and interpret a decision surface using predicted probabilities.

Let’s get started.

Plot a Decision Surface for Machine Learning Algorithms in Python
Photo by Tony Webster, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Decision Surface
Dataset and Model
Plot a Decision Surface

Decision Surface

Classification machine learning algorithms learn to assign labels to input examples.

Consider numeric input features for the classification task defining a continuous input feature space.

We can think of each input feature defining an axis or dimension on a feature space. Two input features would define a feature space that is a plane, with dots representing input coordinates in the input space. If there were three input variables, the feature space would be a three-dimensional volume.

Each point in the space can be assigned a class label. In terms of a two-dimensional feature space, we can think of each point on the planing having a different color, according to their assigned class.

The goal of a classification algorithm is to learn how to divide up the feature space such that labels are assigned correctly to points in the feature space, or at least, as correctly as is possible.

This is a useful geometric understanding of classification predictive modeling. We can take it one step further.

Once a classification machine learning algorithm divides a feature space, we can then classify each point in the feature space, on some arbitrary grid, to get an idea of how exactly the algorithm chose to divide up the feature space.

This is called a decision surface or decision boundary, and it provides a diagnostic tool for understanding a model on a classification predictive modeling task.

Although the notion of a “surface” suggests a two-dimensional feature space, the method can be used with feature spaces with more than two dimensions, where a surface is created for each pair of input features.

Now that we are familiar with what a decision surface is, next, let’s define a dataset and model for which we later explore the decision surface.

Dataset and Model

In this section, we will define a classification task and predictive model to learn the task.

Synthetic Classification Dataset

We can use the make_blobs() scikit-learn function to define a classification task with a two-dimensional class numerical feature space and each point assigned one of two class labels, e.g. a binary classification task.

...
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)

Once defined, we can then create a scatter plot of the feature space with the first feature defining the x-axis, the second feature defining the y axis, and each sample represented as a point in the feature space.

We can then color points in the scatter plot according to their class label as either 0 or 1.

...
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Tying this together, the complete example of defining and plotting a synthetic classification dataset is listed below.

# generate binary classification dataset and plot
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example creates the dataset, then plots the dataset as a scatter plot with points colored by class label.

We can see a clear separation between examples from the two classes and we can imagine how a machine learning model might draw a line to separate the two classes, e.g. perhaps a diagonal line right through the middle of the two groups.

Scatter Plot of Binary Classification Dataset With 2D Feature Space

Fit Classification Predictive Model

We can now fit a model on our dataset.

In this case, we will fit a logistic regression algorithm because we can predict both crisp class labels and probabilities, both of which we can use in our decision surface.

We can define the model, then fit it on the training dataset.

...
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)

Once defined, we can use the model to make a prediction for the training dataset to get an idea of how well it learned to divide the feature space of the training dataset and assign labels.

...
# make predictions
yhat = model.predict(X)

The predictions can be evaluated using classification accuracy.

...
# evaluate the predictions
acc = accuracy_score(y, yhat)
print('Accuracy: %.3f' % acc)

Tying this together, the complete example of fitting and evaluating a model on the synthetic binary classification dataset is listed below.

# example of fitting and evaluating a model on the classification dataset
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
# evaluate the predictions
acc = accuracy_score(y, yhat)
print('Accuracy: %.3f' % acc)

Running the example fits the model and makes a prediction for each example.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a performance of about 97.2 percent.

Accuracy: 0.972

Now that we have a dataset and model, let’s explore how we can develop a decision surface.

Plot a Decision Surface

We can create a decision surface by fitting a model on the training dataset, then using the model to make predictions for a grid of values across the input domain.

Once we have the grid of predictions, we can plot the values and their class label.

A scatter plot could be used if a fine enough grid was taken. A better approach is to use a contour plot that can interpolate the colors between the points.

The contourf() Matplotlib function can be used.

This requires a few steps.

First, we need to define a grid of points across the feature space.

To do this, we can find the minimum and maximum values for each feature and expand the grid one step beyond that to ensure the whole feature space is covered.

...
# define bounds of the domain
min1, max1 = X[:, 0].min()-1, X[:, 0].max()+1
min2, max2 = X[:, 1].min()-1, X[:, 1].max()+1

We can then create a uniform sample across each dimension using the arange() function at a chosen resolution. We will use a resolution of 0.1 in this case.

...
# define the x and y scale
x1grid = arange(min1, max1, 0.1)
x2grid = arange(min2, max2, 0.1)

Now we need to turn this into a grid.

We can use the meshgrid() NumPy function to create a grid from these two vectors.

If the first feature x1 is our x-axis of the feature space, then we need one row of x1 values of the grid for each point on the y-axis.

Similarly, if we take x2 as our y-axis of the feature space, then we need one column of x2 values of the grid for each point on the x-axis.

The meshgrid() function will do this for us, duplicating the rows and columns for us as needed. It returns two grids for the two input vectors. The first grid of x-values and the second of y-values, organized in an appropriately sized grid of rows and columns across the feature space.

...
# create all of the lines and rows of the grid
xx, yy = meshgrid(x1grid, x2grid)

We then need to flatten out the grid to create samples that we can feed into the model and make a prediction.

To do this, first, we flatten each grid into a vector.

...
# flatten each grid to a vector
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))

Then we stack the vectors side by side as columns in an input dataset, e.g. like our original training dataset, but at a much higher resolution.

...
# horizontal stack vectors to create x1,x2 input for the model
grid = hstack((r1,r2))

We can then feed this into our model and get a prediction for each point in the grid.

...
# make predictions for the grid
yhat = model.predict(grid)
# reshape the predictions back into a grid

So far, so good.

We have a grid of values across the feature space and the class labels as predicted by our model.

Next, we need to plot the grid of values as a contour plot.

The contourf() function takes separate grids for each axis, just like what was returned from our prior call to meshgrid(). Great!

So we can use xx and yy that we prepared earlier and simply reshape the predictions (yhat) from the model to have the same shape.

...
# reshape the predictions back into a grid
zz = yhat.reshape(xx.shape)

We then plot the decision surface with a two-color colormap.

...
# plot the grid of x, y and z values as a surface
pyplot.contourf(xx, yy, zz, cmap='Paired')

We can then plot the actual points of the dataset over the top to see how well they were separated by the logistic regression decision surface.

The complete example of plotting a decision surface for a logistic regression model on our synthetic binary classification dataset is listed below.

# decision surface for logistic regression on a binary classification dataset
from numpy import where
from numpy import meshgrid
from numpy import arange
from numpy import hstack
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# define bounds of the domain
min1, max1 = X[:, 0].min()-1, X[:, 0].max()+1
min2, max2 = X[:, 1].min()-1, X[:, 1].max()+1
# define the x and y scale
x1grid = arange(min1, max1, 0.1)
x2grid = arange(min2, max2, 0.1)
# create all of the lines and rows of the grid
xx, yy = meshgrid(x1grid, x2grid)
# flatten each grid to a vector
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))
# horizontal stack vectors to create x1,x2 input for the model
grid = hstack((r1,r2))
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# make predictions for the grid
yhat = model.predict(grid)
# reshape the predictions back into a grid
zz = yhat.reshape(xx.shape)
# plot the grid of x, y and z values as a surface
pyplot.contourf(xx, yy, zz, cmap='Paired')
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Paired')
# show the plot
pyplot.show()

Running the example fits the model and uses it to predict outcomes for the grid of values across the feature space and plots the result as a contour plot.

We can see, as we might have suspected, logistic regression divides the feature space using a straight line. It is a linear model, after all; this is all it can do.

Creating a decision surface is almost like magic. It gives immediate and meaningful insight into how the model has learned the task.

Try it with different algorithms, like an SVM or decision tree.
Post your resulting maps as links in the comments below!

Decision Surface for Logistic Regression on a Binary Classification Task

We can add more depth to the decision surface by using the model to predict probabilities instead of class labels.

...
# make predictions for the grid
yhat = model.predict_proba(grid)
# keep just the probabilities for class 0
yhat = yhat[:, 0]

When plotted, we can see how confident or likely it is that each point in the feature space belongs to each of the class labels, as seen by the model.

We can use a different color map that has gradations, and show a legend so we can interpret the colors.

...
# plot the grid of x, y and z values as a surface
c = pyplot.contourf(xx, yy, zz, cmap='RdBu')
# add a legend, called a color bar
pyplot.colorbar(c)

The complete example of creating a decision surface using probabilities is listed below.

# probability decision surface for logistic regression on a binary classification dataset
from numpy import where
from numpy import meshgrid
from numpy import arange
from numpy import hstack
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# define bounds of the domain
min1, max1 = X[:, 0].min()-1, X[:, 0].max()+1
min2, max2 = X[:, 1].min()-1, X[:, 1].max()+1
# define the x and y scale
x1grid = arange(min1, max1, 0.1)
x2grid = arange(min2, max2, 0.1)
# create all of the lines and rows of the grid
xx, yy = meshgrid(x1grid, x2grid)
# flatten each grid to a vector
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))
# horizontal stack vectors to create x1,x2 input for the model
grid = hstack((r1,r2))
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# make predictions for the grid
yhat = model.predict_proba(grid)
# keep just the probabilities for class 0
yhat = yhat[:, 0]
# reshape the predictions back into a grid
zz = yhat.reshape(xx.shape)
# plot the grid of x, y and z values as a surface
c = pyplot.contourf(xx, yy, zz, cmap='RdBu')
# add a legend, called a color bar
pyplot.colorbar(c)
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Paired')
# show the plot
pyplot.show()

Running the example predicts the probability of class membership for each point on the grid across the feature space and plots the result.

Here, we can see that the model is unsure (lighter colors) around the middle of the domain, given the sampling noise in that area of the feature space. We can also see that the model is very confident (full colors) in the bottom-left and top-right halves of the domain.

Together, the crisp class and probability decision surfaces are powerful diagnostic tools for understanding your model and how it divides the feature space for your predictive modeling task.

Probability Decision Surface for Logistic Regression on a Binary Classification Task

Summary

In this tutorial, you discovered how to plot a decision surface for a classification machine learning algorithm.

Specifically, you learned:

Decision surface is a diagnostic tool for understanding how a classification algorithm divides up the feature space.
How to plot a decision surface for using crisp class labels for a machine learning algorithm.
How to plot and interpret a decision surface using predicted probabilities.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Plot a Decision Surface for Machine Learning Algorithms in Python appeared first on Machine Learning Mastery.

The performance of a machine learning model can be characterized in terms of the bias and the variance of the model.

A model with high bias makes strong assumptions about the form of the unknown underlying function that maps inputs to outputs in the dataset, such as linear regression. A model with high variance is highly dependent upon the specifics of the training dataset, such as unpruned decision trees. We desire models with low bias and low variance, although there is often a trade-off between these two concerns.

The bias-variance trade-off is a useful conceptualization for selecting and configuring models, although generally cannot be computed directly as it requires full knowledge of the problem domain, which we do not have. Nevertheless, in some cases, we can estimate the error of a model and divide the error down into bias and variance components, which may provide insight into a given model’s behavior.

In this tutorial, you will discover how to calculate the bias and variance for a machine learning model.

After completing this tutorial, you will know:

Model error consists of model variance, model bias, and irreducible error.
We seek models with low bias and variance, although typically reducing one results in a rise in the other.
How to decompose mean squared error into model bias and variance terms.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Calculate the Bias-Variance Trade-off in Python
Photo by Nathalie, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Bias, Variance, and Irreducible Error
Bias-Variance Trade-off
Calculate the Bias and Variance

Bias, Variance, and Irreducible Error

Consider a machine learning model that makes predictions for a predictive modeling task, such as regression or classification.

The performance of the model on the task can be described in terms of the prediction error on all examples not used to train the model. We will refer to this as the model error.

Error(Model)

The model error can be decomposed into three sources of error: the variance of the model, the bias of the model, and the variance of the irreducible error in the data.

Error(Model) = Variance(Model) + Bias(Model) + Variance(Irreducible Error)

Let’s take a closer look at each of these three terms.

Model Bias

The bias is a measure of how close the model can capture the mapping function between inputs and outputs.

It captures the rigidity of the model: the strength of the assumption the model has about the functional form of the mapping between inputs and outputs.

This reflects how close the functional form of the model can get to the true relationship between the predictors and the outcome.

— Page 97, Applied Predictive Modeling, 2013.

A model with high bias is helpful when the bias matches the true but unknown underlying mapping function for the predictive modeling problem. Yet, a model with a large bias will be completely useless when the functional form for the problem is mismatched with the assumptions of the model, e.g. assuming a linear relationship for data with a high non-linear relationship.

Low Bias: Weak assumptions regarding the functional form of the mapping of inputs to outputs.
High Bias: Strong assumptions regarding the functional form of the mapping of inputs to outputs.

The bias is always positive.

Model Variance

The variance of the model is the amount the performance of the model changes when it is fit on different training data.

It captures the impact of the specifics the data has on the model.

Variance refers to the amount by which [the model] would change if we estimated it using a different training data set.

— Page 34, An Introduction to Statistical Learning with Applications in R, 2014.

A model with high variance will change a lot with small changes to the training dataset. Conversely, a model with low variance will change little with small or even large changes to the training dataset.

Low Variance: Small changes to the model with changes to the training dataset.
High Variance: Large changes to the model with changes to the training dataset.

The variance is always positive.

Irreducible Error

On the whole, the error of a model consists of reducible error and irreducible error.

Model Error = Reducible Error + Irreducible Error

The reducible error is the element that we can improve. It is the quantity that we reduce when the model is learning on a training dataset and we try to get this number as close to zero as possible.

The irreducible error is the error that we can not remove with our model, or with any model.

The error is caused by elements outside our control, such as statistical noise in the observations.

… usually called “irreducible noise” and cannot be eliminated by modeling.

— Page 97, Applied Predictive Modeling, 2013.

As such, although we may be able to squash the reducible error to a very small value close to zero, or even zero in some cases, we will also have some irreducible error. It defines a lower bound in performance on a problem.

It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y. This bound is almost always unknown in practice.

— Page 19, An Introduction to Statistical Learning with Applications in R, 2014.

It is a reminder that no model is perfect.

Bias-Variance Trade-off

The bias and the variance of a model’s performance are connected.

Ideally, we would prefer a model with low bias and low variance, although in practice, this is very challenging. In fact, this could be described as the goal of applied machine learning for a given predictive modeling problem,

Reducing the bias cannot easily be achieved by increasing the variance. Conversely, reducing the variance can easily be achieved by increasing the bias.

This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance […] or a method with very low variance but high bias …

— Page 36, An Introduction to Statistical Learning with Applications in R, 2014.

This relationship is generally referred to as the bias-variance trade-off. It is a conceptual framework for thinking about how to choose models and model configuration.

We can choose a model based on its bias or variance. Simple models, such as linear regression and logistic regression, generally have a high bias and a low variance. Complex models, such as random forest, generally have a low bias but a high variance.

We may also choose model configurations based on their effect on the bias and variance of the model. The k hyperparameter in k-nearest neighbors controls the bias-variance trade-off. Small values, such as k=1, result in a low bias and a high variance, whereas large k values, such as k=21, result in a high bias and a low variance.

High bias is not always bad, nor is high variance, but they can lead to poor results.

We often must test a suite of different models and model configurations in order to discover what works best for a given dataset. A model with a large bias may be too rigid and underfit the problem. Conversely, a large variance may overfit the problem.

We may decide to increase the bias or the variance as long as it decreases the overall estimate of model error.

Calculate the Bias and Variance

I get this question all the time:

How can I calculate the bias-variance trade-off for my algorithm on my dataset?

Technically, we cannot perform this calculation.

We cannot calculate the actual bias and variance for a predictive modeling problem.

This is because we do not know the true mapping function for a predictive modeling problem.

Instead, we use the bias, variance, irreducible error, and the bias-variance trade-off as tools to help select models, configure models, and interpret results.

In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the bias-variance trade-off in mind.

— Page 36, An Introduction to Statistical Learning with Applications in R, 2014.

Even though the bias-variance trade-off is a conceptual tool, we can estimate it in some cases.

The mlxtend library by Sebastian Raschka provides the bias_variance_decomp() function that can estimate the bias and variance for a model over multiple bootstrap samples.

First, you must install the mlxtend library; for example:

sudo pip install mlxtend

The example below loads the Boston housing dataset directly via URL, splits it into train and test sets, then estimates the mean squared error (MSE) for a linear regression as well as the bias and variance for the model error over 200 bootstrap samples.

# estimate the bias and variance for a regression model
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the model
model = LinearRegression()
# estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train, X_test, y_test, loss='mse', num_rounds=200, random_seed=1)
# summarize results
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)

Running the example reports the estimated error as well as the estimated bias and variance for the model error.

Your specific results may vary given the stochastic nature of the evaluation routine. Try running the example a few times.

In this case, we can see that the model has a high bias and a low variance. This is to be expected given that we are using a linear regression model. We can also see that the sum of the estimated mean and variance equals the estimated error of the model, e.g. 20.726 + 1.761 = 22.487.

MSE: 22.487
Bias: 20.726
Variance: 1.761

Summary

In this tutorial, you discovered how to calculate the bias and variance for a machine learning model.

Specifically, you learned:

Model error consists of model variance, model bias, and irreducible error.
We seek models with low bias and variance, although typically reducing one results in a rise in the other.
How to decompose mean squared error into model bias and variance terms.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate the Bias-Variance Trade-off with Python appeared first on Machine Learning Mastery.

Hyperparameter optimization refers to performing a search in order to discover the set of specific model configuration arguments that result in the best performance of the model on a specific dataset.

There are many ways to perform hyperparameter optimization, although modern methods, such as Bayesian Optimization, are fast and effective. The Scikit-Optimize library is an open-source Python library that provides an implementation of Bayesian Optimization that can be used to tune the hyperparameters of machine learning models from the scikit-Learn Python library.

You can easily use the Scikit-Optimize library to tune the models on your next machine learning project.

In this tutorial, you will discover how to use the Scikit-Optimize library to use Bayesian Optimization for hyperparameter tuning.

After completing this tutorial, you will know:

Scikit-Optimize provides a general toolkit for Bayesian Optimization that can be used for hyperparameter tuning.
How to manually use the Scikit-Optimize library to tune the hyperparameters of a machine learning model.
How to use the built-in BayesSearchCV class to perform model hyperparameter tuning.

Let’s get started.

Scikit-Optimize for Hyperparameter Tuning in Machine Learning
Photo by Dan Nevill, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Scikit-Optimize
Machine Learning Dataset and Model
Manually Tune Algorithm Hyperparameters
Automatically Tune Algorithm Hyperparameters

Scikit-Optimize

Scikit-Optimize, or skopt for short, is an open-source Python library for performing optimization tasks.

It offers efficient optimization algorithms, such as Bayesian Optimization, and can be used to find the minimum or maximum of arbitrary cost functions.

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

For more on the topic of Bayesian Optimization, see the tutorial:

How to Implement Bayesian Optimization From Scratch in Python

Importantly, the library provides support for tuning the hyperparameters of machine learning algorithms offered by the scikit-learn library, so-called hyperparameter optimization. As such, it offers an efficient alternative to less efficient hyperparameter optimization procedures such as grid search and random search.

The scikit-optimize library can be installed using pip, as follows:

sudo pip install scikit-optimize

Once installed, we can import the library and print the version number to confirm the library was installed successfully and can be accessed.

The complete example is listed below.

# report scikit-optimize version number
import skopt
print('skopt %s' % skopt.__version__)

Running the example reports the currently installed version number of scikit-optimize.

Your version number should be the same or higher.

skopt 0.7.2

For more installation instructions, see the documentation:

Scikit-Optimize Installation Instructions

Now that we are familiar with what Scikit-Optimize is and how to install it, let’s explore how we can use it to tune the hyperparameters of a machine learning model.

Machine Learning Dataset and Model

First, let’s select a standard dataset and a model to address it.

We will use the ionosphere machine learning dataset. This is a standard machine learning dataset comprising 351 rows of data with three numerical input variables and a target variable with two class values, e.g. binary classification.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve an accuracy of about 64 percent. A top performing model can achieve accuracy on this same test harness of about 94 percent. This provides the bounds of expected performance on this dataset.

The dataset involves predicting whether measurements of the ionosphere indicate a specific structure or not.

You can learn more about the dataset here:

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the ionosphere dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 351 rows of data with 34 input variables.

(351, 34) (351,)

We can evaluate a support vector machine (SVM) model on this dataset using repeated stratified cross-validation.

We can report the mean model performance on the dataset averaged over all folds and repeats, which will provide a reference for model hyperparameter tuning performed in later sections.

The complete example is listed below.

# evaluate an svm for the ionosphere dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.svm import SVC
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# define model model
model = SVC()
# define test harness
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example first loads and prepares the dataset, then evaluates the SVM model on the dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the SVM with default hyperparameters achieved a mean classification accuracy of about 83.7 percent, which is skillful and close to the top performance on the problem of 94 percent.

(351, 34) (351,)
Accuracy: 0.937 (0.038)

Next, let’s see if we can improve performance by tuning the model hyperparameters using the scikit-optimize library.

Manually Tune Algorithm Hyperparameters

The Scikit-Optimize library can be used to tune the hyperparameters of a machine learning model.

We can achieve this manually by using the Bayesian Optimization capabilities of the library.

This requires that we first define a search space. In this case, this will be the hyperparameters of the model that we wish to tune, and the scope or range of each hyperparameter.

We will tune the following hyperparameters of the SVM model:

C, the regularization parameter.
kernel, the type of kernel used in the model.
degree, used for the polynomial kernel.
gamma, used in most other kernels.

For the numeric hyperparameters C and gamma, we will define a log scale to search between a small value of 1e-6 and 100. Degree is an integer and we will search values between 1 and 5. Finally, the kernel is a categorical variable with specific named values.

We can define the search space for these four hyperparameters, a list of data types from the skopt library, as follows:

...
# define the space of hyperparameters to search
search_space = list()
search_space.append(Real(1e-6, 100.0, 'log-uniform', name='C'))
search_space.append(Categorical(['linear', 'poly', 'rbf', 'sigmoid'], name='kernel'))
search_space.append(Integer(1, 5, name='degree'))
search_space.append(Real(1e-6, 100.0, 'log-uniform', name='gamma'))

Note the data type, the range, and the name of the hyperparameter specified for each.

We can then define a function that will be called by the search procedure. This is a function expected by the optimization procedure later and takes a model and set of specific hyperparameters for the model, evaluates it, and returns a score for the set of hyperparameters.

In our case, we want to evaluate the model using repeated stratified 10-fold cross-validation on our ionosphere dataset. We want to maximize classification accuracy, e.g. find the set of model hyperparameters that give the best accuracy. By default, the process minimizes the score returned from this function, therefore, we will return one minus the accuracy, e.g. perfect skill will be (1 – accuracy) or 0.0, and the worst skill will be 1.0.

The evaluate_model() function below implements this and takes a specific set of hyperparameters.

# define the function used to evaluate a given configuration
@use_named_args(search_space)
def evaluate_model(**params):
	# configure the model with specific hyperparameters
	model = SVC()
	model.set_params(**params)
	# define test harness
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# calculate 5-fold cross validation
	result = cross_val_score(model, X, y, cv=cv, n_jobs=-1, scoring='accuracy')
	# calculate the mean of the scores
	estimate = mean(result)
	# convert from a maximizing score to a minimizing score
	return 1.0 - estimate

Next, we can execute the search by calling the gp_minimize() function and passing the name of the function to call to evaluate each model and the search space to optimize.

...
# perform optimization
result = gp_minimize(evaluate_model, search_space)

The procedure will run until it converges and returns a result.

The result object contains lots of details, but importantly, we can access the score of the best performing configuration and the hyperparameters used by the best forming model.

...
# summarizing finding:
print('Best Accuracy: %.3f' % (1.0 - result.fun))
print('Best Parameters: %s' % (result.x))

Tying this together, the complete example of manually tuning the hyperparameters of an SVM on the ionosphere dataset is listed below.

# manually tune svm model hyperparameters using skopt on the ionosphere dataset
from numpy import mean
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.svm import SVC
from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import gp_minimize

# define the space of hyperparameters to search
search_space = list()
search_space.append(Real(1e-6, 100.0, 'log-uniform', name='C'))
search_space.append(Categorical(['linear', 'poly', 'rbf', 'sigmoid'], name='kernel'))
search_space.append(Integer(1, 5, name='degree'))
search_space.append(Real(1e-6, 100.0, 'log-uniform', name='gamma'))

# define the function used to evaluate a given configuration
@use_named_args(search_space)
def evaluate_model(**params):
	# configure the model with specific hyperparameters
	model = SVC()
	model.set_params(**params)
	# define test harness
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# calculate 5-fold cross validation
	result = cross_val_score(model, X, y, cv=cv, n_jobs=-1, scoring='accuracy')
	# calculate the mean of the scores
	estimate = mean(result)
	# convert from a maximizing score to a minimizing score
	return 1.0 - estimate

# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# perform optimization
result = gp_minimize(evaluate_model, search_space)
# summarizing finding:
print('Best Accuracy: %.3f' % (1.0 - result.fun))
print('Best Parameters: %s' % (result.x))

Running the example may take a few moments, depending on the speed of your machine.

You may see some warning messages that you can safely ignore, such as:

UserWarning: The objective has been evaluated at this point before.

At the end of the run, the best-performing configuration is reported.

In this case, we can see that configuration, reported in order of the search space list, was a modest C value, a RBF kernel, a degree of 2 (ignored by the RBF kernel), and a modest gamma value.

Importantly, we can see that the skill of this model was approximately 94.7 percent, which is a top-performing model

(351, 34) (351,)
Best Accuracy: 0.948
Best Parameters: [1.2852670137769258, 'rbf', 2, 0.18178016885627174]

This is not the only way to use the Scikit-Optimize library for hyperparameter tuning. In the next section, we can see a more automated approach.

Automatically Tune Algorithm Hyperparameters

The Scikit-Learn machine learning library provides tools for tuning model hyperparameters.

Specifically, it provides the GridSearchCV and RandomizedSearchCV classes that take a model, a search space, and a cross-validation configuration.

The benefit of these classes is that the search procedure is performed automatically, requiring minimal configuration.

Similarly, the Scikit-Optimize library provides a similar interface for performing a Bayesian Optimization of model hyperparameters via the BayesSearchCV class.

This class can be used in the same way as the Scikit-Learn equivalents.

First, the search space must be defined as a dictionary with hyperparameter names used as the key and the scope of the variable as the value.

...
# define search space
params = dict()
params['C'] = (1e-6, 100.0, 'log-uniform')
params['gamma'] = (1e-6, 100.0, 'log-uniform')
params['degree'] = (1,5)
params['kernel'] = ['linear', 'poly', 'rbf', 'sigmoid']

We can then define the BayesSearchCV configuration taking the model we wish to evaluate, the hyperparameter search space, and the cross-validation configuration.

...
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the search
search = BayesSearchCV(estimator=SVC(), search_spaces=params, n_jobs=-1, cv=cv)

We can then execute the search and report the best result and configuration at the end.

...
# perform the search
search.fit(X, y)
# report the best result
print(search.best_score_)
print(search.best_params_)

Tying this together, the complete example of automatically tuning SVM hyperparameters using the BayesSearchCV class on the ionosphere dataset is listed below.

# automatic svm hyperparameter tuning using skopt for the ionosphere dataset
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedStratifiedKFold
from skopt import BayesSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# define search space
params = dict()
params['C'] = (1e-6, 100.0, 'log-uniform')
params['gamma'] = (1e-6, 100.0, 'log-uniform')
params['degree'] = (1,5)
params['kernel'] = ['linear', 'poly', 'rbf', 'sigmoid']
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the search
search = BayesSearchCV(estimator=SVC(), search_spaces=params, n_jobs=-1, cv=cv)
# perform the search
search.fit(X, y)
# report the best result
print(search.best_score_)
print(search.best_params_)

Running the example may take a few moments, depending on the speed of your machine.

You may see some warning messages that you can safely ignore, such as:

UserWarning: The objective has been evaluated at this point before.

At the end of the run, the best-performing configuration is reported.

In this case, we can see that the model performed above top-performing models achieving a mean classification accuracy of about 95.2 percent.

The search discovered a large C value, an RBF kernel, and a small gamma value.

(351, 34) (351,)
0.9525166191832859
OrderedDict([('C', 4.8722263953328735), ('degree', 4), ('gamma', 0.09805881007239009), ('kernel', 'rbf')])

This provides a template that you can use to tune the hyperparameters on your machine learning project.

Summary

In this tutorial, you discovered how to use the Scikit-Optimize library to use Bayesian Optimization for hyperparameter tuning.

Specifically, you learned:

Scikit-Optimize provides a general toolkit for Bayesian Optimization that can be used for hyperparameter tuning.
How to manually use the Scikit-Optimize library to tune the hyperparameters of a machine learning model.
How to use the built-in BayesSearchCV class to perform model hyperparameter tuning.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Scikit-Optimize for Hyperparameter Tuning in Machine Learning appeared first on Machine Learning Mastery.

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user involvement.

Auto-Sklearn is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset.

In this tutorial, you will discover how to use Auto-Sklearn for AutoML with Scikit-Learn machine learning algorithms in Python.

After completing this tutorial, you will know:

Auto-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Auto-Sklearn to automatically discover top-performing models for classification tasks.
How to use Auto-Sklearn to automatically discover top-performing models for regression tasks.

Let’s get started.

Auto-Sklearn for Automated Machine Learning in Python
Photo by Richard, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

AutoML With Auto-Sklearn
Install and Using Auto-Sklearn
Auto-Sklearn for Classification
Auto-Sklearn for Regression

AutoML With Auto-Sklearn

Automated Machine Learning, or AutoML for short, is a process of discovering the best-performing pipeline of data transforms, model, and model configuration for a dataset.

AutoML often involves the use of sophisticated optimization algorithms, such as Bayesian Optimization, to efficiently navigate the space of possible models and model configurations and quickly discover what works well for a given predictive modeling task. It allows non-expert machine learning practitioners to quickly and easily discover what works well or even best for a given dataset with very little technical background or direct input.

Auto-Sklearn is an open-source Python library for AutoML using machine learning models from the scikit-learn machine learning library.

It was developed by Matthias Feurer, et al. and described in their 2015 paper titled “Efficient and Robust Automated Machine Learning.”

… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters).

— Efficient and Robust Automated Machine Learning, 2015.

The benefit of Auto-Sklearn is that, in addition to discovering the data preparation and model that performs for a dataset, it also is able to learn from models that performed well on similar datasets and is able to automatically create an ensemble of top-performing models discovered as part of the optimization process.

This system, which we dub AUTO-SKLEARN, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.

— Efficient and Robust Automated Machine Learning, 2015.

The authors provide a useful depiction of their system in the paper, provided below.

Overview of the Auto-Sklearn System.
Taken from: Efficient and Robust Automated Machine Learning, 2015.

Install and Using Auto-Sklearn

The first step is to install the Auto-Sklearn library, which can be achieved using pip, as follows:

sudo pip install autosklearn

Once installed, we can import the library and print the version number to confirm it was installed successfully:

# print autosklearn version
import autosklearn
print('autosklearn: %s' % autosklearn.__version__)

Running the example prints the version number.

Your version number should be the same or higher.

autosklearn: 0.6.0

Using Auto-Sklearn is straightforward.

Depending on whether your prediction task is classification or regression, you create and configure an instance of the AutoSklearnClassifier or AutoSklearnRegressor class, fit it on your dataset, and that’s it. The resulting model can then be used to make predictions directly or saved to file (using pickle) for later use.

...
# define search
model = AutoSklearnClassifier()
# perform the search
model.fit(X_train, y_train)

There are a ton of configuration options provided as arguments to the AutoSklearn class.

By default, the search will use a train-test split of your dataset during the search, and this default is recommended both for speed and simplicity.

Importantly, you should set the “n_jobs” argument to the number of cores in your system, e.g. 8 if you have 8 cores.

The optimization process will run for as long as you allow, measure in minutes. By default, it will run for one hour.

I recommend setting the “time_left_for_this_task” argument for the number of seconds you want the process to run. E.g. less than 5-10 minutes is probably plenty for many small predictive modeling tasks (sub 1,000 rows).

We will use 5 minutes (300 seconds) for the examples in this tutorial. We will also limit the time allocated to each model evaluation to 30 seconds via the “per_run_time_limit” argument. For example:

...
# define search
model = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30, n_jobs=8)

You can limit the algorithms considered in the search, as well as the data transforms.

By default, the search will create an ensemble of top-performing models discovered as part of the search. Sometimes, this can lead to overfitting and can be disabled by setting the “ensemble_size” argument to 1 and “initial_configurations_via_metalearning” to 0.

...
# define search
model = AutoSklearnClassifier(ensemble_size=1, initial_configurations_via_metalearning=0)

At the end of a run, the list of models can be accessed, as well as other details.

Perhaps the most useful feature is the sprint_statistics() function that summarizes the search and the performance of the final model.

...
# summarize performance
print(model.sprint_statistics())

Now that we are familiar with the Auto-Sklearn library, let’s look at some worked examples.

Auto-Sklearn for Classification

In this section, we will use Auto-Sklearn to discover a model for the sonar dataset.

The sonar dataset is a standard machine learning dataset comprised of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve an accuracy of about 53 percent. A top-performing model can achieve accuracy on this same test harness of about 88 percent. This provides the bounds of expected performance on this dataset.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

We will use Auto-Sklearn to find a good model for the sonar dataset.

First, we will split the dataset into train and test sets and allow the process to find a good model on the training set, then later evaluate the performance of what was found on the holdout test set.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

The AutoSklearnClassifier is configured to run for 5 minutes with 8 cores and limit each model evaluation to 30 seconds.

...
# define search
model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

The search is then performed on the training dataset.

...
# perform the search
model.fit(X_train, y_train)

Afterward, a summary of the search and best-performing model is reported.

...
# summarize
print(model.sprint_statistics())

Finally, we evaluate the performance of the model that was prepared on the holdout test dataset.

...
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Tying this together, the complete example is listed below.

# example of auto-sklearn for the sonar classification dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# print(dataframe.head())
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)
# perform the search
model.fit(X_train, y_train)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Running the example will take about five minutes, given the hard limit we imposed on the run.

At the end of the run, a summary is printed showing that 1,054 models were evaluated and the estimated performance of the final model was 91 percent.

auto-sklearn results:
Dataset name: f4c282bd4b56d4db7e5f7fe1a6a8edeb
Metric: accuracy
Best validation score: 0.913043
Number of target algorithm runs: 1054
Number of successful target algorithm runs: 952
Number of crashed target algorithm runs: 94
Number of target algorithms that exceeded the time limit: 8
Number of target algorithms that exceeded the memory limit: 0

We then evaluate the model on the holdout dataset and see that classification accuracy of 81.2 percent was achieved, which is reasonably skillful.

Accuracy: 0.812

Auto-Sklearn for Regression

In this section, we will use Auto-Sklearn to discover a model for the auto insurance dataset.

The auto insurance dataset is a standard machine learning dataset comprised of 63 rows of data with one numerical input variable and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 66. A top-performing model can achieve a MAE on this same test harness of about 28. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the total amount in claims (thousands of Swedish Kronor) given the number of claims for different geographical regions.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 63 rows of data with one input variable.

(63, 1) (63,)

We will use Auto-Sklearn to find a good model for the auto insurance dataset.

We can use the same process as was used in the previous section, although we will use the AutoSklearnRegressor class instead of the AutoSklearnClassifier.

...
# define search
model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

By default, the regressor will optimize the R^2 metric.

In this case, we are interested in the mean absolute error, or MAE, which we can specify via the “metric” argument when calling the fit() function.

...
# perform the search
model.fit(X_train, y_train, metric=auto_mean_absolute_error)

The complete example is listed below.

# example of auto-sklearn for the insurance regression dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from autosklearn.regression import AutoSklearnRegressor
from autosklearn.metrics import mean_absolute_error as auto_mean_absolute_error
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)
# perform the search
model.fit(X_train, y_train, metric=auto_mean_absolute_error)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(X_test)
mae = mean_absolute_error(y_test, y_hat)
print("MAE: %.3f" % mae)

Running the example will take about five minutes, given the hard limit we imposed on the run.

You might see some warning messages during the run and you can safely ignore them, such as:

Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 1.0 for quality scenarios. (Change value through "cost_for_crash"-option.)

At the end of the run, a summary is printed showing that 1,759 models were evaluated and the estimated performance of the final model was a MAE of 29.

auto-sklearn results:
Dataset name: ff51291d93f33237099d48c48ee0f9ad
Metric: mean_absolute_error
Best validation score: 29.911203
Number of target algorithm runs: 1759
Number of successful target algorithm runs: 1362
Number of crashed target algorithm runs: 394
Number of target algorithms that exceeded the time limit: 3
Number of target algorithms that exceeded the memory limit: 0

We then evaluate the model on the holdout dataset and see that a MAE of 26 was achieved, which is a great result.

MAE: 26.498

Summary

In this tutorial, you discovered how to use Auto-Sklearn for AutoML with Scikit-Learn machine learning algorithms in Python.

Specifically, you learned:

Auto-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Auto-Sklearn to automatically discover top-performing models for classification tasks.
How to use Auto-Sklearn to automatically discover top-performing models for regression tasks.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Auto-Sklearn for Automated Machine Learning in Python appeared first on Machine Learning Mastery.

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user involvement.

TPOT is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Genetic Programming stochastic global search procedure to efficiently discover a top-performing model pipeline for a given dataset.

In this tutorial, you will discover how to use TPOT for AutoML with Scikit-Learn machine learning algorithms in Python.

After completing this tutorial, you will know:

TPOT is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use TPOT to automatically discover top-performing models for classification tasks.
How to use TPOT to automatically discover top-performing models for regression tasks.

Let’s get started.

TPOT for Automated Machine Learning in Python
Photo by Gwen, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

TPOT for Automated Machine Learning
Install and Use TPOT
TPOT for Classification
TPOT for Regression

TPOT for Automated Machine Learning

Tree-based Pipeline Optimization Tool, or TPOT for short, is a Python library for automated machine learning.

TPOT uses a tree-based structure to represent a model pipeline for a predictive modeling problem, including data preparation and modeling algorithms and model hyperparameters.

… an evolutionary algorithm called the Tree-based Pipeline Optimization Tool (TPOT) that automatically designs and optimizes machine learning pipelines.

— Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

An optimization procedure is then performed to find a tree structure that performs best for a given dataset. Specifically, a genetic programming algorithm, designed to perform a stochastic global optimization on programs represented as trees.

TPOT uses a version of genetic programming to automatically design and optimize a series of data transformations and machine learning models that attempt to maximize the classification accuracy for a given supervised learning data set.

— Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

The figure below taken from the TPOT paper shows the elements involved in the pipeline search, including data cleaning, feature selection, feature processing, feature construction, model selection, and hyperparameter optimization.

Overview of the TPOT Pipeline Search
Taken from: Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

Now that we are familiar with what TPOT is, let’s look at how we can install and use TPOT to find an effective model pipeline.

Install and Use TPOT

The first step is to install the TPOT library, which can be achieved using pip, as follows:

pip install tpot

Once installed, we can import the library and print the version number to confirm it was installed successfully:

# check tpot version
import tpot
print('tpot: %s' % tpot.__version__)

Running the example prints the version number.

Your version number should be the same or higher.

tpot: 0.11.1

Using TPOT is straightforward.

It involves creating an instance of the TPOTRegressor or TPOTClassifier class, configuring it for the search, and then exporting the model pipeline that was found to achieve the best performance on your dataset.

Configuring the class involves two main elements.

The first is how models will be evaluated, e.g. the cross-validation scheme and performance metric. I recommend explicitly specifying a cross-validation class with your chosen configuration and the performance metric to use.

For example, RepeatedKFold for regression with ‘neg_mean_absolute_error‘ metric for regression:

...
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTRegressor(... scoring='neg_mean_absolute_error', cv=cv)

Or a RepeatedStratifiedKFold for regression with ‘accuracy‘ metric for classification:

...
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTClassifier(... scoring='accuracy', cv=cv)

The other element is the nature of the stochastic global search procedure.

As an evolutionary algorithm, this involves setting configuration, such as the size of the population, the number of generations to run, and potentially crossover and mutation rates. The former importantly control the extent of the search; the latter can be left on default values if evolutionary search is new to you.

For example, a modest population size of 100 and 5 or 10 generations is a good starting point.

...
# define search
model = TPOTClassifier(generations=5, population_size=50, ...)

At the end of a search, a Pipeline is found that performs the best.

This Pipeline can be exported as code into a Python file that you can later copy-and-paste into your own project.

...
# export the best model
model.export('tpot_model.py')

Now that we are familiar with how to use TPOT, let’s look at some worked examples with real data.

TPOT for Classification

In this section, we will use TPOT to discover a model for the sonar dataset.

The sonar dataset is a standard machine learning dataset comprised of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

Next, let’s use TPOT to find a good model for the sonar dataset.

First, we can define the method for evaluating models. We will use a good practice of repeated stratified k-fold cross-validation with three repeats and 10 folds.

...
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

We will use a population size of 50 for five generations for the search and use all cores on the system by setting “n_jobs” to -1.

...
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

Finally, we can start the search and ensure that the best-performing model is saved at the end of the run.

...
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_sonar_best_model.py')

Tying this together, the complete example is listed below.

# example of tpot for the sonar classification dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_sonar_best_model.py')

Running the example may take a few minutes, and you will see a progress bar on the command line.

The accuracy of top-performing models will be reported along the way.

Generation 1 - Current best internal CV score: 0.8650793650793651
Generation 2 - Current best internal CV score: 0.8650793650793651
Generation 3 - Current best internal CV score: 0.8650793650793651
Generation 4 - Current best internal CV score: 0.8650793650793651
Generation 5 - Current best internal CV score: 0.8667460317460318

Best pipeline: GradientBoostingClassifier(GaussianNB(input_matrix), learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)

In this case, we can see that the top-performing pipeline achieved the mean accuracy of about 86.6 percent. This is a skillful model, and close to a top-performing model on this dataset.

The top-performing pipeline is then saved to a file named “tpot_sonar_best_model.py“.

Opening this file, you can see that there is some generic code for loading a dataset and fitting the pipeline. An example is listed below.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: 0.8667460317460318
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianNB()),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Note: as-is, this code does not execute, by design. It is a template that you can copy-and-paste into your project.

In this case, we can see that the best-performing model is a pipeline comprised of a Naive Bayes model and a Gradient Boosting model.

We can adapt this code to fit a final model on all available data and make a prediction for new data.

The complete example is listed below.

# example of fitting a final model and making a prediction on the sonar dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# Average CV score on the training set was: 0.8667460317460318
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianNB()),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)
# fit the model
exported_pipeline.fit(X, y)
# make a prediction on a new row of data
row = [0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

Running the example fits the best-performing model on the dataset and makes a prediction for a single row of new data.

Predicted: 1.000

TPOT for Regression

In this section, we will use TPOT to discover a model for the auto insurance dataset.

The auto insurance dataset is a standard machine learning dataset comprised of 63 rows of data with one numerical input variable and a numerical target variable.

The dataset involves predicting the total amount in claims (thousands of Swedish Kronor) given the number of claims for different geographical regions.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 63 rows of data with one input variable.

(63, 1) (63,)

Next, we can use TPOT to find a good model for the auto insurance dataset.

First, we can define the method for evaluating models. We will use a good practice of repeated k-fold cross-validation with three repeats and 10 folds.

...
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

We will use a population size of 50 for 5 generations for the search and use all cores on the system by setting “n_jobs” to -1.

...
# define search
model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

Finally, we can start the search and ensure that the best-performing model is saved at the end of the run.

...
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_insurance_best_model.py')

Tying this together, the complete example is listed below.

# example of tpot for the insurance regression dataset
from pandas import read_csv
from sklearn.model_selection import RepeatedKFold
from tpot import TPOTRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_insurance_best_model.py')

Running the example may take a few minutes, and you will see a progress bar on the command line.

The MAE of top-performing models will be reported along the way.

Generation 1 - Current best internal CV score: -29.147625969129034
Generation 2 - Current best internal CV score: -29.147625969129034
Generation 3 - Current best internal CV score: -29.147625969129034
Generation 4 - Current best internal CV score: -29.147625969129034
Generation 5 - Current best internal CV score: -29.147625969129034

Best pipeline: LinearSVR(input_matrix, C=1.0, dual=False, epsilon=0.0001, loss=squared_epsilon_insensitive, tol=0.001)

In this case, we can see that the top-performing pipeline achieved the mean MAE of about 29.14. This is a skillful model, and close to a top-performing model on this dataset.

The top-performing pipeline is then saved to a file named “tpot_insurance_best_model.py“.

Opening this file, you can see that there is some generic code for loading a dataset and fitting the pipeline. An example is listed below.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -29.147625969129034
exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Note: as-is, this code does not execute, by design. It is a template that you can copy-paste into your project.

In this case, we can see that the best-performing model is a pipeline comprised of a linear support vector machine model.

We can adapt this code to fit a final model on all available data and make a prediction for new data.

The complete example is listed below.

# example of fitting a final model and making a prediction on the insurance dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# Average CV score on the training set was: -29.147625969129034
exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)
# fit the model
exported_pipeline.fit(X, y)
# make a prediction on a new row of data
row = [108]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

Running the example fits the best-performing model on the dataset and makes a prediction for a single row of new data.

Predicted: 389.612

Summary

In this tutorial, you discovered how to use TPOT for AutoML with Scikit-Learn machine learning algorithms in Python.

Specifically, you learned:

TPOT is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use TPOT to automatically discover top-performing models for classification tasks.
How to use TPOT to automatically discover top-performing models for regression tasks.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post TPOT for Automated Machine Learning in Python appeared first on Machine Learning Mastery.

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user involvement.

HyperOpt is an open-source library for large scale AutoML and HyperOpt-Sklearn is a wrapper for HyperOpt that supports AutoML with HyperOpt for the popular Scikit-Learn machine learning library, including the suite of data preparation transforms and classification and regression algorithms.

In this tutorial, you will discover how to use HyperOpt for automatic machine learning with Scikit-Learn in Python.

After completing this tutorial, you will know:

Hyperopt-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Hyperopt-Sklearn to automatically discover top-performing models for classification tasks.
How to use Hyperopt-Sklearn to automatically discover top-performing models for regression tasks.

Let’s get started.

HyperOpt for Automated Machine Learning With Scikit-Learn
Photo by Neil Williamson, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

HyperOpt and HyperOpt-Sklearn
How to Install and Use HyperOpt-Sklearn
HyperOpt-Sklearn for Classification
HyperOpt-Sklearn for Regression

HyperOpt and HyperOpt-Sklearn

HyperOpt is an open-source Python library for Bayesian optimization developed by James Bergstra.

It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

The library was explicitly used to optimize machine learning pipelines, including data preparation, model selection, and model hyperparameters.

Our approach is to expose the underlying expression graph of how a performance metric (e.g. classification accuracy on validation examples) is computed from hyperparameters that govern not only how individual processing steps are applied, but even which processing steps are included.

— Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures, 2013.

HyperOpt is challenging to use directly, requiring the optimization procedure and search space to be carefully specified.

An extension to HyperOpt was created called HyperOpt-Sklearn that allows the HyperOpt procedure to be applied to data preparation and machine learning models provided by the popular Scikit-Learn open-source machine learning library.

HyperOpt-Sklearn wraps the HyperOpt library and allows for the automatic search of data preparation methods, machine learning algorithms, and model hyperparameters for classification and regression tasks.

… we introduce Hyperopt-Sklearn: a project that brings the benefits of automatic algorithm configuration to users of Python and scikit-learn. Hyperopt-Sklearn uses Hyperopt to describe a search space over possible configurations of Scikit-Learn components, including preprocessing and classification modules.

— Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn, 2014.

Now that we are familiar with HyperOpt and HyperOpt-Sklearn, let’s look at how to use HyperOpt-Sklearn.

How to Install and Use HyperOpt-Sklearn

The first step is to install the HyperOpt library.

This can be achieved using the pip package manager as follows:

sudo pip install hyperopt

Once installed, we can confirm that the installation was successful and check the version of the library by typing the following command:

sudo pip show hyperopt

This will summarize the installed version of HyperOpt, confirming that a modern version is being used.

Name: hyperopt
Version: 0.2.3
Summary: Distributed Asynchronous Hyperparameter Optimization
Home-page: http://hyperopt.github.com/hyperopt/
Author: James Bergstra
Author-email: james.bergstra@gmail.com
License: BSD
Location: ...
Requires: tqdm, six, networkx, future, scipy, cloudpickle, numpy
Required-by:

Next, we must install the HyperOpt-Sklearn library.

This too can be installed using pip, although we must perform this operation manually by cloning the repository and running the installation from the local files, as follows:

git clone git@github.com:hyperopt/hyperopt-sklearn.git
cd hyperopt-sklearn
sudo pip install .
cd ..

Again, we can confirm that the installation was successful by checking the version number with the following command:

sudo pip show hpsklearn

This will summarize the installed version of HyperOpt-Sklearn, confirming that a modern version is being used.

Name: hpsklearn
Version: 0.0.3
Summary: Hyperparameter Optimization for sklearn
Home-page: http://hyperopt.github.com/hyperopt-sklearn/
Author: James Bergstra
Author-email: anon@anon.com
License: BSD
Location: ...
Requires: nose, scikit-learn, numpy, scipy, hyperopt
Required-by:

Now that the required libraries are installed, we can review the HyperOpt-Sklearn API.

Using HyperOpt-Sklearn is straightforward. The search process is defined by creating and configuring an instance of the HyperoptEstimator class.

The algorithm used for the search can be specified via the “algo” argument, the number of evaluations performed in the search is specified via the “max_evals” argument, and a limit can be imposed on evaluating each pipeline via the “trial_timeout” argument.

...
# define search
model = HyperoptEstimator(..., algo=tpe.suggest, max_evals=50, trial_timeout=120)

Many different optimization algorithms are available, including:

Random Search
Tree of Parzen Estimators
Annealing
Tree
Gaussian Process Tree

The “Tree of Parzen Estimators” is a good default, and you can learn more about the types of algorithms in the paper “Algorithms for Hyper-Parameter Optimization. [PDF]”

For classification tasks, the “classifier” argument specifies the search space of models, and for regression, the “regressor” argument specifies the search space of models, both of which can be set to use predefined lists of models provided by the library, e.g. “any_classifier” and “any_regressor“.

Similarly, the search space of data preparation is specified via the “preprocessing” argument and can also use a pre-defined list of preprocessing steps via “any_preprocessing.

...
# define search
model = HyperoptEstimator(classifier=any_classifier('cla'), preprocessing=any_preprocessing('pre'), ...)

For more on the other arguments to the search, you can review the source code for the class directly:

Arguments to the HyperoptEstimator Class

Once the search is defined, it can be executed by calling the fit() function.

...
# perform the search
model.fit(X_train, y_train)

At the end of the run, the best-performing model can be evaluated on new data by calling the score() function.

...
# summarize performance
acc = model.score(X_test, y_test)
print("Accuracy: %.3f" % acc)

Finally, we can retrieve the Pipeline of transforms, models, and model configurations that performed the best on the training dataset via the best_model() function.

...
# summarize the best model
print(model.best_model())

Now that we are familiar with the API, let’s look at some worked examples.

HyperOpt-Sklearn for Classification

In this section, we will use HyperOpt-Sklearn to discover a model for the sonar dataset.

The sonar dataset is a standard machine learning dataset comprised of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

Next, let’s use HyperOpt-Sklearn to find a good model for the sonar dataset.

We can perform some basic data preparation, including converting the target string to class labels, then split the dataset into train and test sets.

...
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Next, we can define the search procedure. We will explore all classification algorithms and all data transforms available to the library and use the TPE, or Tree of Parzen Estimators, search algorithm, described in “Algorithms for Hyper-Parameter Optimization.”

The search will evaluate 50 pipelines and limit each evaluation to 30 seconds.

...
# define search
model = HyperoptEstimator(classifier=any_classifier('cla'), preprocessing=any_preprocessing('pre'), algo=tpe.suggest, max_evals=50, trial_timeout=30)

We then start the search.

...
# perform the search
model.fit(X_train, y_train)

At the end of the run, we will report the performance of the model on the holdout dataset and summarize the best performing pipeline.

...
# summarize performance
acc = model.score(X_test, y_test)
print("Accuracy: %.3f" % acc)
# summarize the best model
print(model.best_model())

Tying this together, the complete example is listed below.

# example of hyperopt-sklearn for the sonar classification dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from hpsklearn import HyperoptEstimator
from hpsklearn import any_classifier
from hpsklearn import any_preprocessing
from hyperopt import tpe
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = HyperoptEstimator(classifier=any_classifier('cla'), preprocessing=any_preprocessing('pre'), algo=tpe.suggest, max_evals=50, trial_timeout=30)
# perform the search
model.fit(X_train, y_train)
# summarize performance
acc = model.score(X_test, y_test)
print("Accuracy: %.3f" % acc)
# summarize the best model
print(model.best_model())

Running the example may take a few minutes.

The progress of the search will be reported and you will see some warnings that you can safely ignore.

At the end of the run, the best-performing model is evaluated on the holdout dataset and the Pipeline discovered is printed for later use.

In this case, we can see that the chosen model achieved an accuracy of about 85.5 percent on the holdout test set. The Pipeline involves a gradient boosting model with no pre-processing.

Accuracy: 0.855
{'learner': GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.009132299586303643, loss='deviance',
                           max_depth=None, max_features='sqrt',
                           max_leaf_nodes=None, min_impurity_decrease=0.0,
                           min_impurity_split=None, min_samples_leaf=1,
                           min_samples_split=2, min_weight_fraction_leaf=0.0,
                           n_estimators=342, n_iter_no_change=None,
                           presort='auto', random_state=2,
                           subsample=0.6844206624548879, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False), 'preprocs': (), 'ex_preprocs': ()}

The printed model can then be used directly, e.g. the code copy-pasted into another project.

Next, let’s take a look at using HyperOpt-Sklearn for a regression predictive modeling problem.

HyperOpt-Sklearn for Regression

In this section, we will use HyperOpt-Sklearn to discover a model for the housing dataset.

The housing dataset is a standard machine learning dataset comprised of 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 63 rows of data with one input variable.

(208, 60), (208,)

Next, we can use HyperOpt-Sklearn to find a good model for the auto insurance dataset.

Using HyperOpt-Sklearn for regression is the same as using it for classification, except the “regressor” argument must be specified.

In this case, we want to optimize the MAE, therefore, we will set the “loss_fn” argument to the mean_absolute_error() function provided by the scikit-learn library.

...
# define search
model = HyperoptEstimator(regressor=any_regressor('reg'), preprocessing=any_preprocessing('pre'), loss_fn=mean_absolute_error, algo=tpe.suggest, max_evals=50, trial_timeout=30)

Tying this together, the complete example is listed below.

# example of hyperopt-sklearn for the housing regression dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from hpsklearn import HyperoptEstimator
from hpsklearn import any_regressor
from hpsklearn import any_preprocessing
from hyperopt import tpe
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = HyperoptEstimator(regressor=any_regressor('reg'), preprocessing=any_preprocessing('pre'), loss_fn=mean_absolute_error, algo=tpe.suggest, max_evals=50, trial_timeout=30)
# perform the search
model.fit(X_train, y_train)
# summarize performance
mae = model.score(X_test, y_test)
print("MAE: %.3f" % mae)
# summarize the best model
print(model.best_model())

Running the example may take a few minutes.

The progress of the search will be reported and you will see some warnings that you can safely ignore.

At the end of the run, the best performing model is evaluated on the holdout dataset and the Pipeline discovered is printed for later use.

In this case, we can see that the chosen model achieved a MAE of about 0.883 on the holdout test set, which appears skillful. The Pipeline involves an XGBRegressor model with no pre-processing.

Note: for the search to use XGBoost, you must have the XGBoost library installed.

MAE: 0.883
{'learner': XGBRegressor(base_score=0.5, booster='gbtree',
             colsample_bylevel=0.5843250948679669, colsample_bynode=1,
             colsample_bytree=0.6635160670570662, gamma=6.923399395303031e-05,
             importance_type='gain', learning_rate=0.07021104887683309,
             max_delta_step=0, max_depth=3, min_child_weight=5, missing=nan,
             n_estimators=4000, n_jobs=1, nthread=None, objective='reg:linear',
             random_state=0, reg_alpha=0.5690202874759704,
             reg_lambda=3.3098341637038, scale_pos_weight=1, seed=1,
             silent=None, subsample=0.7194797262656784, verbosity=1), 'preprocs': (), 'ex_preprocs': ()}

Summary

In this tutorial, you discovered how to use HyperOpt for automatic machine learning with Scikit-Learn in Python.

Specifically, you learned:

Hyperopt-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Hyperopt-Sklearn to automatically discover top-performing models for classification tasks.
How to use Hyperopt-Sklearn to automatically discover top-performing models for regression tasks.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post HyperOpt for Automated Machine Learning With Scikit-Learn appeared first on Machine Learning Mastery.

Machine learning models have hyperparameters that you must set in order to customize the model to your dataset.

Often the general effects of hyperparameters on a model are known, but how to best set a hyperparameter and combinations of interacting hyperparameters for a given dataset is challenging. There are often general heuristics or rules of thumb for configuring hyperparameters.

A better approach is to objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance on a given dataset. This is called hyperparameter optimization or hyperparameter tuning and is available in the scikit-learn Python machine learning library. The result of a hyperparameter optimization is a single set of well-performing hyperparameters that you can use to configure your model.

In this tutorial, you will discover hyperparameter optimization for machine learning in Python.

After completing this tutorial, you will know:

Hyperparameter optimization is required to get the most out of your machine learning models.
How to configure random and grid search hyperparameter optimization for classification tasks.
How to configure random and grid search hyperparameter optimization for regression tasks.

Let’s get started.

Hyperparameter Optimization With Random Search and Grid Search
Photo by James St. John, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

Model Hyperparameter Optimization
Hyperparameter Optimization Scikit-Learn API
Hyperparameter Optimization for Classification
1. Random Search for Classification
2. Grid Search for Classification
Hyperparameter Optimization for Regression
1. Random Search for Regression
2. Grid Search for Regression
Common Questions About Hyperparameter Optimization

Model Hyperparameter Optimization

Machine learning models have hyperparameters.

Hyperparameters are points of choice or configuration that allow a machine learning model to be customized for a specific task or dataset.

Hyperparameter: Model configuration argument specified by the developer to guide the learning process for a specific dataset.

Machine learning models also have parameters, which are the internal coefficients set by training or optimizing the model on a training dataset.

Parameters are different from hyperparameters. Parameters are learned automatically; hyperparameters are set manually to help guide the learning process.

For more on the difference between parameters and hyperparameters, see the tutorial:

What Is the Difference Between a Parameter and a Hyperparameter?

Typically a hyperparameter has a known effect on a model in the general sense, but it is not clear how to best set a hyperparameter for a given dataset. Further, many machine learning models have a range of hyperparameters and they may interact in nonlinear ways.

As such, it is often required to search for a set of hyperparameters that result in the best performance of a model on a dataset. This is called hyperparameter optimization, hyperparameter tuning, or hyperparameter search.

An optimization procedure involves defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension are the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.

Search Space: Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration.

A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

Random Search. Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.
Grid Search. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.

More advanced methods are sometimes used, such as Bayesian Optimization and Evolutionary Optimization.

Now that we are familiar with hyperparameter optimization, let’s look at how we can use this method in Python.

Hyperparameter Optimization Scikit-Learn API

The scikit-learn Python open-source machine learning library provides techniques to tune model hyperparameters.

Specifically, it provides the RandomizedSearchCV for random search and GridSearchCV for grid search. Both techniques evaluate models for a given hyperparameter vector using cross-validation, hence the “CV” suffix of each class name.

Both classes require two arguments. The first is the model that you are optimizing. This is an instance of the model with values of hyperparameters set that you do not want to optimize. The second is the search space. This is defined as a dictionary where the names are the hyperparameter arguments to the model and the values are discrete values or a distribution of values to sample in the case of a random search.

...
# define model
model = LogisticRegression()
# define search space
space = dict()
...
# define search
search = GridSearchCV(model, space)

Both classes provide a “cv” argument that allows either an integer number of folds to be specified, e.g. 5, or a configured cross-validation object. I recommend defining and specifying a cross-validation object to gain more control over model evaluation and make the evaluation procedure obvious and explicit.

In the case of classification tasks, I recommend using the RepeatedStratifiedKFold class, and for regression tasks, I recommend using the RepeatedKFold with an appropriate number of folds and repeats, such as 10 folds and three repeats.

...
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
search = GridSearchCV(..., cv=cv)

Both hyperparameter optimization classes also provide a “scoring” argument that takes a string indicating the metric to optimize.

The metric must be maximizing, meaning better models result in larger scores. For classification, this may be ‘accuracy‘. For regression, this is a negative error measure, such as ‘neg_mean_absolute_error‘ for a negative version of the mean absolute error, where values closer to zero represent less prediction error by the model.

...
# define search
search = GridSearchCV(..., scoring='neg_mean_absolute_error')

You can see a list of build-in scoring metrics here:

The scoring parameter: defining model evaluation rules

Finally, the search can be made parallel, e.g. use all of the CPU cores by specifying the “n_jobs” argument as an integer with the number of cores in your system, e.g. 8. Or you can set it to be -1 to automatically use all of the cores in your system.

...
# define search
search = GridSearchCV(..., n_jobs=-1)

Once defined, the search is performed by calling the fit() function and providing a dataset used to train and evaluate model hyperparameter combinations using cross-validation.

...
# execute search
result = search.fit(X, y)

Running the search may take minutes or hours, depending on the size of the search space and the speed of your hardware. You’ll often want to tailor the search to how much time you have rather than the possibility of what could be searched.

At the end of the search, you can access all of the results via attributes on the class. Perhaps the most important attributes are the best score observed and the hyperparameters that achieved the best score.

...
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Once you know the set of hyperparameters that achieve the best result, you can then define a new model, set the values of each hyperparameter, then fit the model on all available data. This model can then be used to make predictions on new data.

Now that we are familiar with the hyperparameter optimization API in scikit-learn, let’s look at some worked examples.

Hyperparameter Optimization for Classification

In this section, we will use hyperparameter optimization to discover a well-performing model configuration for the sonar dataset.

The sonar dataset is a standard machine learning dataset comprising 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

Next, let’s use random search to find a good model configuration for the sonar dataset.

To keep things simple, we will focus on a linear model, the logistic regression model, and the common hyperparameters tuned for this model.

Random Search for Classification

In this section, we will explore hyperparameter optimization of the logistic regression model on the sonar dataset.

First, we will define the model that will be optimized and use default values for the hyperparameters that will not be optimized.

...
# define model
model = LogisticRegression()

We will evaluate model configurations using repeated stratified k-fold cross-validation with three repeats and 10 folds.

...
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

Next, we can define the search space.

This is a dictionary where names are arguments to the model and values are distributions from which to draw samples. We will optimize the solver, the penalty, and the C hyperparameters of the model with discrete distributions for the solver and penalty type and a log-uniform distribution from 1e-5 to 100 for the C value.

Log-uniform is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step.

...
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)

Next, we can define the search procedure with all of these elements.

Importantly, we must set the number of iterations or samples to draw from the search space via the “n_iter” argument. In this case, we will set it to 500.

...
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

Finally, we can perform the optimization and report the results.

...
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Tying this together, the complete example is listed below.

# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = LogisticRegression()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Running the example may take a minute. It is fast because we are using a small search space and a fast model to fit and evaluate. You may see some warnings during the optimization for invalid configuration combinations. These can be safely ignored.

At the end of the run, the best score and hyperparameter configuration that achieved the best performance are reported.

Your specific results will vary given the stochastic nature of the optimization procedure. Try running the example a few times.

In this case, we can see that the best configuration achieved an accuracy of about 78.9 percent, which is fair, and the specific values for the solver, penalty, and C hyperparameters used to achieve that score.

Best Score: 0.7897619047619049
Best Hyperparameters: {'C': 4.878363034905756, 'penalty': 'l2', 'solver': 'newton-cg'}

Next, let’s use grid search to find a good model configuration for the sonar dataset.

Grid Search for Classification

Using the grid search is much like using the random search for classification.

The main difference is that the search space must be a discrete grid to be searched. This means that instead of using a log-uniform distribution for C, we can specify discrete values on a log scale.

...
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

Additionally, the GridSearchCV class does not take a number of iterations, as we are only evaluating combinations of hyperparameters in the grid.

...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

Tying this together, the complete example of grid searching logistic regression configurations for the sonar dataset is listed below.

# grid search logistic regression model on the sonar dataset
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = LogisticRegression()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Running the example may take a moment. It is fast because we are using a small search space and a fast model to fit and evaluate. Again, you may see some warnings during the optimization for invalid configuration combinations. These can be safely ignored.

At the end of the run, the best score and hyperparameter configuration that achieved the best performance are reported.

Your specific results will vary given the stochastic nature of the optimization procedure. Try running the example a few times.

In this case, we can see that the best configuration achieved an accuracy of about 78.2% which is also fair and the specific values for the solver, penalty and C hyperparameters used to achieve that score. Interestingly, the results are very similar to those found via the random search.

Best Score: 0.7828571428571429
Best Hyperparameters: {'C': 1, 'penalty': 'l2', 'solver': 'newton-cg'}

Hyperparameter Optimization for Regression

In this section we will use hyper optimization to discover a top-performing model configuration for the auto insurance dataset.

The auto insurance dataset is a standard machine learning dataset comprising 63 rows of data with 1 numerical input variable and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with 3 repeats, a naive model can achieve a mean absolute error (MAE) of about 66. A top performing model can achieve a MAE on this same test harness of about 28. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the total amount in claims (thousands of Swedish Kronor) given the number of claims for different geographical regions.

No need to download the dataset, we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 63 rows of data with 1 input variable.

(63, 1) (63,)

Next, we can use hyperparameter optimization to find a good model configuration for the auto insurance dataset.

To keep things simple, we will focus on a linear model, the linear regression model and the common hyperparameters tuned for this model.

Random Search for Regression

Configuring and using the random search hyperparameter optimization procedure for regression is much like using it for classification.

In this case, we will configure the important hyperparameters of the linear regression implementation, including the solver, alpha, fit_intercept, and normalize.

We will use a discrete distribution of values in the search space for all except the “alpha” argument which is a penalty term, in which case we will use a log-uniform distribution as we did in the previous section for the “C” argument of logistic regression.

...
# define search space
space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = loguniform(1e-5, 100)
space['fit_intercept'] = [True, False]
space['normalize'] = [True, False]

The main difference in regression compared to classification is the choice of the scoring method.

For regression, performance is often measured using an error, which is minimized, with zero representing a model with perfect skill. The hyperparameter optimization procedures in scikit-learn assume a maximizing score. Therefore a version of each error metric is provided that is made negative.

This means that large positive errors become large negative errors, good performance are small negative values close to zero and perfect skill is zero.

The sign of the negative MAE can be ignored when interpreting the result.

In this case we will mean absolute error (MAE) and a maximizing version of this error is available by setting the “scoring” argument to “neg_mean_absolute_error“.

...
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=1)

Tying this together, the complete example is listed below.

# random search linear regression model on the auto insurance dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = Ridge()
# define evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = loguniform(1e-5, 100)
space['fit_intercept'] = [True, False]
space['normalize'] = [True, False]
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=1)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Running the example may take a moment. It is fast because we are using a small search space and a fast model to fit and evaluate. You may see some warnings during the optimization for invalid configuration combinations. These can be safely ignored.

At the end of the run, the best score and hyperparameter configuration that achieved the best performance are reported.

Your specific results will vary given the stochastic nature of the optimization procedure. Try running the example a few times.

In this case, we can see that the best configuration achieved a MAE of about 29.2, which is very close to the best performance on the model. We can then see the specific hyperparameter values that achieved this result.

Best Score: -29.23046315344758
Best Hyperparameters: {'alpha': 0.008301451461243866, 'fit_intercept': True, 'normalize': True, 'solver': 'sag'}

Next, let’s use grid search to find a good model configuration for the auto insurance dataset.

Grid Search for Regression

As a grid search, we cannot define a distribution to sample and instead must define a discrete grid of hyperparameter values. As such, we will specify the “alpha” argument as a range of values on a log-10 scale.

...
# define search space
space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
space['fit_intercept'] = [True, False]
space['normalize'] = [True, False]

Grid search for regression requires that the “scoring” be specified, much as we did for random search.

In this case, we will again use the negative MAE scoring function.

...
# define search
search = GridSearchCV(model, space, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv)

Tying this together, the complete example of grid searching linear regression configurations for the auto insurance dataset is listed below.

# grid search linear regression model on the auto insurance dataset
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = Ridge()
# define evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
space['fit_intercept'] = [True, False]
space['normalize'] = [True, False]
# define search
search = GridSearchCV(model, space, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Running the example may take a minute. It is fast because we are using a small search space and a fast model to fit and evaluate. Again, you may see some warnings during the optimization for invalid configuration combinations. These can be safely ignored.

At the end of the run, the best score and hyperparameter configuration that achieved the best performance are reported.

Your specific results will vary given the stochastic nature of the optimization procedure. Try running the example a few times.

In this case, we can see that the best configuration achieved a MAE of about 29.2, which is nearly identical to what we achieved with the random search in the previous section. Interestingly, the hyperparameters are also nearly identical, which is good confirmation.

Best Score: -29.275708614337326
Best Hyperparameters: {'alpha': 0.1, 'fit_intercept': True, 'normalize': False, 'solver': 'sag'}

Common Questions About Hyperparameter Optimization

This section addresses some common questions about hyperparameter optimization.

How to Choose Between Random and Grid Search?

Choose the method based on your needs. I recommend starting with grid and doing a random search if you have the time.

Grid search is appropriate for small and quick searches of hyperparameter values that are known to perform well generally.

Random search is appropriate for discovering new hyperparameter values or new combinations of hyperparameters, often resulting in better performance, although it may take more time to complete.

How to Speed-Up Hyperparameter Optimization?

Ensure that you set the “n_jobs” argument to the number of cores on your machine.

After that, more suggestions include:

Evaluate on a smaller sample of your dataset.
Explore a smaller search space.
Use fewer repeats and/or folds for cross-validation.
Execute the search on a faster machine, such as AWS EC2.
Use an alternate model that is faster to evaluate.

How to Choose Hyperparameters to Search?

Most algorithms have a subset of hyperparameters that have the most influence over the search procedure.

These are listed in most descriptions of the algorithm. For example, here are some algorithms and their most important hyperparameters:

Tune Hyperparameters for Classification Machine Learning Algorithms

If you are unsure:

Review papers that use the algorithm to get ideas.
Review the API and algorithm documentation to get ideas.
Search all hyperparameters.

How to Use Best-Performing Hyperparameters?

Define a new model and set the hyperparameter values of the model to the values found by the search.

Then fit the model on all available data and use the model to start making predictions on new data.

This is called preparing a final model. See more here:

How to Train a Final Machine Learning Model

How to Make a Prediction?

First, fit a final model (previous question).

Then call the predict() function to make a prediction.

For examples of making a prediction with a final model, see the tutorial:

How to Make Predictions With scikit-learn

Do you have another question about hyperparameter optimization?
Let me know in the comments below.

Summary

In this tutorial, you discovered hyperparameter optimization for machine learning in Python.

Specifically, you learned:

Hyperparameter optimization is required to get the most out of your machine learning models.
How to configure random and grid search hyperparameter optimization for classification tasks.
How to configure random and grid search hyperparameter optimization for regression tasks.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Hyperparameter Optimization With Random Search and Grid Search appeared first on Machine Learning Mastery.

Machine learning model selection and configuration may be the biggest challenge in applied machine learning.

Controlled experiments must be performed in order to discover what works best for a given classification or regression predictive modeling task. This can feel overwhelming given the large number of data preparation schemes, learning algorithms, and model hyperparameters that could be considered.

The common approach is to use a shortcut, such as using a popular algorithm or testing a small number of algorithms with default hyperparameters.

A modern alternative is to consider the selection of data preparation, learning algorithm, and algorithm hyperparameters one large global optimization problem. This characterization is generally referred to as Combined Algorithm Selection and Hyperparameter Optimization, or “CASH Optimization” for short.

In this post, you will discover the challenge of machine learning model selection and the modern solution referred to CASH Optimization.

After reading this post, you will know:

The challenge of machine learning model and hyperparameter selection.
The shortcuts of using popular models or making a series of sequential decisions.
The characterization of Combined Algorithm Selection and Hyperparameter Optimization that underlies modern AutoML.

Let’s get started.

Combined Algorithm Selection and Hyperparameter Optimization (CASH Optimization)
Photo by Bernard Spragg. NZ, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

Challenge of Model and Hyperparameter Selection
Solutions to Model and Hyperparameter Selection
Combined Algorithm Selection and Hyperparameter Optimization

Challenge of Model and Hyperparameter Selection

There is no definitive mapping of machine learning algorithms to predictive modeling tasks.

We cannot look at a dataset and know the best algorithm to use, let alone the best data transforms to use to prepare the data or the best configuration for a given model.

Instead, we must use controlled experiments to discover what works best for a given dataset.

As such, applied machine learning is an empirical discipline. It is engineering and art more than science.

The problem is that there are tens, if not hundreds, of machine learning algorithms to choose from. Each algorithm may have up to tens of hyperparameters to be configured.

To a beginner, the scope of the problem is overwhelming.

Where do you start?
What do you start with?
When do you discard a model?
When do you double down on a model?

There are a few standard solutions to this problem adopted by most practitioners, experienced and otherwise.

Solutions to Model and Hyperparameter Selection

Let’s look at two of the most common short-cuts to this problem of selecting data transforms, machine learning models, and model hyperparameters.

Use a Popular Algorithm

One approach is to use a popular machine learning algorithm.

It can be challenging to make the right choice when faced with these degrees of freedom, leaving many users to select algorithms based on reputation or intuitive appeal, and/or to leave hyperparameters set to default values. Of course, this approach can yield performance far worse than that of the best method and hyperparameter settings.

— Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms, 2012.

For example, if it seems like everyone is talking about “random forest,” then random forest becomes the right algorithm for all classification and regression problems you encounter, and you limit the experimentation to the hyperparameters of the random forest algorithm.

Short-Cut #1: Use a popular algorithm like “random forest” or “xgboost“.

Random forest indeed performs well on a wide range of prediction tasks. But we cannot know if it will be good or even best for a given dataset. The risk is that we may be able to achieve better results with a much simpler linear model.

A workaround might be to test a range of popular algorithms, leading into the next shortcut.

Sequentially Test Transforms, Models, and Hyperparameters

Another approach is to approach the problem as a series of sequential decisions.

For example, review the data and select data transforms that make data more Gaussian, remove outliers, etc. Then test a suite of algorithms with default hyperparameters and select one or a few that perform well. Then tune the hyperparameters of those top-performing models.

Short-Cut #2: Sequentially select data transforms, models, and model hyperparameters.

This is the approach that I recommend for getting good results quickly; for example:

Applied Machine Learning Process

This short-cut too can be effective and reduces the likelihood of missing an algorithm that performs well on your dataset. The downside here is more subtle and impacts you if you are seeking great or excellent results rather than merely good results quickly.

The risk is selecting data transforms prior to selecting models might mean that you miss the data preparation sequence that gets the most out of an algorithm.

Similarly, selecting a model or subset of models prior to selecting model hyperparameters means that you might be missing a model with hyperparameters other than the default values that performs better than any of the subset of models selected and their subsequent configurations.

Two important problems in AutoML are that (1) no single machine learning method performs best on all datasets and (2) some machine learning methods (e.g., non-linear SVMs) crucially rely on hyperparameter optimization.

— Page 115, Automated Machine Learning: Methods, Systems, Challenges, 2019.

A workaround might be to spot check good or well-performing configurations of each algorithm as part of the algorithm spot check. This is only a partial solution.

There is a better approach.

Combined Algorithm Selection and Hyperparameter Optimization

Selecting a data preparation pipeline, machine learning model, and model hyperparameters is a search problem.

The possible choices at each step define a search space, and a single combination represents a point in that space that can be evaluated with a dataset.

Navigating the search space efficiently is referred to as global optimization.

This has been well understood for a long time in the field of machine learning, although perhaps tacitly, with focus typically on one element of the problem, such as hyperparameter optimization.

The important insight is that there are dependencies between each step, which influences the size and structure of the search space.

… [the problem] can be viewed as a single hierarchical hyperparameter optimization problem, in which even the choice of algorithm itself is considered a hyperparameter.

— Page 82, Automated Machine Learning: Methods, Systems, Challenges, 2019.

This requires that the data preparation and machine learning model, along with the model hyperparameters, must form the scope of the optimization problem and that the optimization algorithm must be aware of the dependencies between.

This is a challenging global optimization problem, notably because of the dependencies, but also because estimating the performance of a machine learning model on a dataset is stochastic, resulting in a noisy distribution of performance scores (e.g. via repeated k-fold cross-validation).

… the combined space of learning algorithms and their hyperparameters is very challenging to search: the response function is noisy and the space is high dimensional, involves both categorical and continuous choices, and contains hierarchical dependencies (e.g., the hyperparameters of a learning algorithm are only meaningful if that algorithm is chosen; the algorithm choices in an ensemble method are only meaningful if that ensemble method is chosen; etc).

— Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms, 2012.

This challenge was perhaps best characterized by Chris Thornton, et al. in their 2013 paper titled “Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms.” In the paper, they refer to this problem as “Combined Algorithm Selection And Hyperparameter Optimization,” or “CASH Optimization” for short.

… a natural challenge for machine learning: given a dataset, to automatically and simultaneously choose a learning algorithm and set its hyperparameters to optimize empirical performance. We dub this the combined algorithm selection and hyperparameter optimization problem (short: CASH).

— Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms, 2012.

This characterization is also sometimes referred to as “Full Model Selection,” or FMS for short.

The FMS problem consists of the following: given a pool of preprocessing methods, feature selection and learning algorithms, select the combination of these that obtains the lowest classification error for a given data set. This task also includes the selection of hyperparameters for the considered methods, resulting in a vast search space that is well suited for stochastic optimization techniques.

— Particle Swarm Model Selection, 2009.

Thornton, et al. proceeded to use global optimization algorithms that are aware of the dependencies, so-called sequential global optimization algorithms, such as specific versions of Bayesian Optimization. They then proceeded to implement their approach for the WEKA machine learning workbench, called the AutoWEKA Projects.

A promising approach is Bayesian Optimization, and in particular Sequential Model-Based Optimization (SMBO), a versatile stochastic optimization framework that can work with both categorical and continuous hyperparameters, and that can exploit hierarchical structure stemming from conditional parameters.

— Page 85, Automated Machine Learning: Methods, Systems, Challenges, 2019.

This now provides the dominant paradigm for a field of study referred to as “Automated Machine Learning,” or AutoML for short. AutoML is concerned with providing tools that allow practitioners with modest technical skill to quickly find effective solutions to machine learning tasks, such as classification and regression predictive modeling.

AutoML aims to provide effective off-the-shelf learning systems to free experts and non-experts alike from the tedious and time-consuming tasks of selecting the right algorithm for a dataset at hand, along with the right preprocessing method and the various hyperparameters of all involved components.

— Page 136,Automated Machine Learning: Methods, Systems, Challenges, 2019.

AutoML techniques are provided by machine learning libraries and increasingly as services, so-called machine learning as a service, or MLaaS for short.

Summary

In this post, you discovered the challenge of machine learning model selection and the modern solution referred to as CASH Optimization.

Specifically, you learned:

The challenge of machine learning model and hyperparameter selection.
The shortcuts of using popular models or making a series of sequential decisions.
The characterization of Combined Algorithm Selection and Hyperparameter Optimization that underlies modern AutoML.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Combined Algorithm Selection and Hyperparameter Optimization (CASH Optimization) appeared first on Machine Learning Mastery.

AutoML provides tools to automatically discover good machine learning model pipelines for a dataset with very little user intervention.

It is ideal for domain experts new to machine learning or machine learning practitioners looking to get good results quickly for a predictive modeling task.

Open-source libraries are available for using AutoML methods with popular machine learning libraries in Python, such as the scikit-learn machine learning library.

In this tutorial, you will discover how to use top open-source AutoML libraries for scikit-learn in Python.

After completing this tutorial, you will know:

AutoML are techniques for automatically and quickly discovering a well-performing machine learning model pipeline for a predictive modeling task.
The three most popular AutoML libraries for Scikit-Learn are Hyperopt-Sklearn, Auto-Sklearn, and TPOT.
How to use AutoML libraries to discover well-performing models for predictive modeling tasks in Python.

Let’s get started.

Automated Machine Learning (AutoML) Libraries for Python
Photo by Michael Coghlan, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Automated Machine Learning
Auto-Sklearn
Tree-based Pipeline Optimization Tool (TPOT)
Hyperopt-Sklearn

Automated Machine Learning

Automated Machine Learning, or AutoML for short, involves the automatic selection of data preparation, machine learning model, and model hyperparameters for a predictive modeling task.

It refers to techniques that allow semi-sophisticated machine learning practitioners and non-experts to discover a good predictive model pipeline for their machine learning task quickly, with very little intervention other than providing a dataset.

… the user simply provides data, and the AutoML system automatically determines the approach that performs best for this particular application. Thereby, AutoML makes state-of-the-art machine learning approaches accessible to domain scientists who are interested in applying machine learning but do not have the resources to learn about the technologies behind it in detail.

— Page ix, Automated Machine Learning: Methods, Systems, Challenges, 2019.

Central to the approach is defining a large hierarchical optimization problem that involves identifying data transforms and the machine learning models themselves, in addition to the hyperparameters for the models.

Many companies now offer AutoML as a service, where a dataset is uploaded and a model pipeline can be downloaded or hosted and used via web service (i.e. MLaaS). Popular examples include service offerings from Google, Microsoft, and Amazon.

Additionally, open-source libraries are available that implement AutoML techniques, focusing on the specific data transforms, models, and hyperparameters used in the search space and the types of algorithms used to navigate or optimize the search space of possibilities, with versions of Bayesian Optimization being the most common.

There are many open-source AutoML libraries, although, in this tutorial, we will focus on the best-of-breed libraries that can be used in conjunction with the popular scikit-learn Python machine learning library.

They are: Hyperopt-Sklearn, Auto-Sklearn, and TPOT.

Did I miss your favorite AutoML library for scikit-learn?
Let me know in the comments below.

We will take a closer look at each, providing the basis for you to evaluate and consider which library might be appropriate for your project.

Auto-Sklearn

Auto-Sklearn is an open-source Python library for AutoML using machine learning models from the scikit-learn machine learning library.

It was developed by Matthias Feurer, et al. and described in their 2015 paper titled “Efficient and Robust Automated Machine Learning.”

… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters).

— Efficient and Robust Automated Machine Learning, 2015.

The first step is to install the Auto-Sklearn library, which can be achieved using pip, as follows:

sudo pip install autosklearn

Once installed, we can import the library and print the version number to confirm it was installed successfully:

# print autosklearn version
import autosklearn
print('autosklearn: %s' % autosklearn.__version__)

Running the example prints the version number. Your version number should be the same or higher.

autosklearn: 0.6.0

Next, we can demonstrate using Auto-Sklearn on a synthetic classification task.

We can define an AutoSklearnClassifier class that controls the search and configure it to run for two minutes (120 seconds) and kill any single model that takes more than 30 seconds to evaluate. At the end of the run, we can report the statistics of the search and evaluate the best performing model on a holdout dataset.

The complete example is listed below.

# example of auto-sklearn for a classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier
# define dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = AutoSklearnClassifier(time_left_for_this_task=2*60, per_run_time_limit=30, n_jobs=8)
# perform the search
model.fit(X_train, y_train)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Running the example will take about two minutes, given the hard limit we imposed on the run.

At the end of the run, a summary is printed showing that 599 models were evaluated and the estimated performance of the final model was 95.6 percent.

auto-sklearn results:
Dataset name: 771625f7c0142be6ac52bcd108459927
Metric: accuracy
Best validation score: 0.956522
Number of target algorithm runs: 653
Number of successful target algorithm runs: 599
Number of crashed target algorithm runs: 54
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0

We then evaluate the model on the holdout dataset and see that a classification accuracy of 97 percent was achieved, which is reasonably skillful.

Accuracy: 0.970

For more on the Auto-Sklearn library, see:

Tree-based Pipeline Optimization Tool (TPOT)

Tree-based Pipeline Optimization Tool, or TPOT for short, is a Python library for automated machine learning.

TPOT uses a tree-based structure to represent a model pipeline for a predictive modeling problem, including data preparation and modeling algorithms, and model hyperparameters.

… an evolutionary algorithm called the Tree-based Pipeline Optimization Tool (TPOT) that automatically designs and optimizes machine learning pipelines.

— Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016.

The first step is to install the TPOT library, which can be achieved using pip, as follows:

pip install tpot

Once installed, we can import the library and print the version number to confirm it was installed successfully:

# check tpot version
import tpot
print('tpot: %s' % tpot.__version__)

Running the example prints the version number. Your version number should be the same or higher.

tpot: 0.11.1

Next, we can demonstrate using TPOT on a synthetic classification task.

This involves configuring a TPOTClassifier instance with the population size and number of generations for the evolutionary search, as well as the cross-validation procedure and metric used to evaluate models. The algorithm will then run the search procedure and save the best discovered model pipeline to file.

The complete example is listed below.

# example of tpot for a classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier
# define dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_best_model.py')

Running the example may take a few minutes, and you will see a progress bar on the command line.

The accuracy of top-performing models will be reported along the way.

Your specific results will vary given the stochastic nature of the search procedure.

Generation 1 - Current best internal CV score: 0.9166666666666666
Generation 2 - Current best internal CV score: 0.9166666666666666
Generation 3 - Current best internal CV score: 0.9266666666666666
Generation 4 - Current best internal CV score: 0.9266666666666666
Generation 5 - Current best internal CV score: 0.9266666666666666

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.35000000000000003, min_samples_leaf=2, min_samples_split=6, n_estimators=100)

In this case, we can see that the top-performing pipeline achieved the mean accuracy of about 92.6 percent.

The top-performing pipeline is then saved to a file named “tpot_best_model.py“.

Opening this file, you can see that there is some generic code for loading a dataset and fitting the pipeline. An example is listed below.

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: 0.9266666666666666
exported_pipeline = ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.35000000000000003, min_samples_leaf=2, min_samples_split=6, n_estimators=100)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

You can then retrieve the code for creating the model pipeline and integrate it into your project.

For more on TPOT, see the following resources:

Hyperopt-Sklearn

HyperOpt is an open-source Python library for Bayesian optimization developed by James Bergstra.

It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

… we introduce Hyperopt-Sklearn: a project that brings the benefits of automatic algorithm configuration to users of Python and scikit-learn. Hyperopt-Sklearn uses Hyperopt to describe a search space over possible configurations of Scikit-Learn components, including preprocessing and classification modules.

— Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn, 2014.

Now that we are familiar with HyperOpt and HyperOpt-Sklearn, let’s look at how to use HyperOpt-Sklearn.

The first step is to install the HyperOpt library.

This can be achieved using the pip package manager as follows:

sudo pip install hyperopt

Next, we must install the HyperOpt-Sklearn library.

This too can be installed using pip, although we must perform this operation manually by cloning the repository and running the installation from the local files, as follows:

git clone git@github.com:hyperopt/hyperopt-sklearn.git
cd hyperopt-sklearn
sudo pip install .
cd ..

We can confirm that the installation was successful by checking the version number with the following command:

sudo pip show hpsklearn

This will summarize the installed version of HyperOpt-Sklearn, confirming that a modern version is being used.

Name: hpsklearn
Version: 0.0.3
Summary: Hyperparameter Optimization for sklearn
Home-page: http://hyperopt.github.com/hyperopt-sklearn/
Author: James Bergstra
Author-email: anon@anon.com
License: BSD
Location: ...
Requires: nose, scikit-learn, numpy, scipy, hyperopt
Required-by:

Next, we can demonstrate using Hyperopt-Sklearn on a synthetic classification task.

We can configure a HyperoptEstimator instance that runs the search, including the classifiers to consider in the search space, the pre-processing steps, and the search algorithm to use. In this case, we will use TPE, or Tree of Parzen Estimators, and perform 50 evaluations.

At the end of the search, the best performing model pipeline is evaluated and summarized.

The complete example is listed below.

# example of hyperopt-sklearn for a classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from hpsklearn import HyperoptEstimator
from hpsklearn import any_classifier
from hpsklearn import any_preprocessing
from hyperopt import tpe
# define dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = HyperoptEstimator(classifier=any_classifier('cla'), preprocessing=any_preprocessing('pre'), algo=tpe.suggest, max_evals=50, trial_timeout=30)
# perform the search
model.fit(X_train, y_train)
# summarize performance
acc = model.score(X_test, y_test)
print("Accuracy: %.3f" % acc)
# summarize the best model
print(model.best_model())

Running the example may take a few minutes.

The progress of the search will be reported and you will see some warnings that you can safely ignore.

At the end of the run, the best-performing model is evaluated on the holdout dataset and the Pipeline discovered is printed for later use.

Your specific results may differ given the stochastic nature of the learning algorithm and search process. Try running the example a few times.

In this case, we can see that the chosen model achieved an accuracy of about 84.8 percent on the holdout test set. The Pipeline involves a SGDClassifier model with no pre-processing.

Accuracy: 0.848
{'learner': SGDClassifier(alpha=0.0012253733891387925, average=False,
              class_weight='balanced', early_stopping=False, epsilon=0.1,
              eta0=0.0002555872679483392, fit_intercept=True,
              l1_ratio=0.628343459087075, learning_rate='optimal',
              loss='perceptron', max_iter=64710625.0, n_iter_no_change=5,
              n_jobs=1, penalty='l2', power_t=0.42312829309173644,
              random_state=1, shuffle=True, tol=0.0005437535215080966,
              validation_fraction=0.1, verbose=False, warm_start=False), 'preprocs': (), 'ex_preprocs': ()}

The printed model can then be used directly, e.g. the code copy-pasted into another project.

For more on Hyperopt-Sklearn, see:

Summary

In this tutorial, you discovered how to use top open-source AutoML libraries for scikit-learn in Python.

Specifically, you learned:

AutoML are techniques for automatically and quickly discovering a well-performing machine learning model pipeline for a predictive modeling task.
The three most popular AutoML libraries for Scikit-Learn are Hyperopt-Sklearn, Auto-Sklearn, and TPOT.
How to use AutoML libraries to discover well-performing models for predictive modeling tasks in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Automated Machine Learning (AutoML) Libraries for Python appeared first on Machine Learning Mastery.

Many computationally expensive tasks for machine learning can be made parallel by splitting the work across multiple CPU cores, referred to as multi-core processing.

Common machine learning tasks that can be made parallel include training models like ensembles of decision trees, evaluating models using resampling procedures like k-fold cross-validation, and tuning model hyperparameters, such as grid and random search.

Using multiple cores for common machine learning tasks can dramatically decrease the execution time as a factor of the number of cores available on your system. A common laptop and desktop computer may have 2, 4, or 8 cores. Larger server systems may have 32, 64, or more cores available, allowing machine learning tasks that take hours to be completed in minutes.

In this tutorial, you will discover how to configure scikit-learn for multi-core machine learning.

After completing this tutorial, you will know:

How to train machine learning models using multiple cores.
How to make the evaluation of machine learning models parallel.
How to use multiple cores to tune machine learning model hyperparameters.

Let’s get started.

Multi-Core Machine Learning in Python With Scikit-Learn
Photo by ER Bauer, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

Multi-Core Scikit-Learn
Multi-Core Model Training
Multi-Core Model Evaluation
Multi-Core Hyperparameter Tuning
Recommendations

Multi-Core Scikit-Learn

Machine learning can be computationally expensive.

There are three main centers of this computational cost; they are:

Training machine learning models.
Evaluating machine learning models.
Hyperparameter tuning machine learning models.

Worse, these concerns compound.

For example, evaluating machine learning models using a resampling technique like k-fold cross-validation requires that the training process is repeated multiple times.

Evaluation Requires Repeated Training

Tuning model hyperparameters compounds this further as it requires the evaluation procedure repeated for each combination of hyperparameters tested.

Tuning Requires Repeated Evaluation

Most, if not all, modern computers have multi-core CPUs. This includes your workstation, your laptop, as well as larger servers.

You can configure your machine learning models to harness multiple cores of your computer, dramatically speeding up computationally expensive operations.

The scikit-learn Python machine learning library provides this capability via the n_jobs argument on key machine learning tasks, such as model training, model evaluation, and hyperparameter tuning.

This configuration argument allows you to specify the number of cores to use for the task. The default is None, which will use a single core. You can also specify a number of cores as an integer, such as 1 or 2. Finally, you can specify -1, in which case the task will use all of the cores available on your system.

n_jobs: Specify the number of cores to use for key machine learning tasks.

Common values are:

n_jobs=None: Use a single core or the default configured by your backend library.
n_jobs=4: Use the specified number of cores, in this case 4.
n_jobs=-1: Use all available cores.

What is a core?

A CPU may have multiple physical CPU cores, which is essentially like having multiple CPUs. Each core may also have hyper-threading, a technology that under many circumstances allows you to double the number of cores.

For example, my workstation has four physical cores, which are doubled to eight cores due to hyper-threading. Therefore, I can experiment with 1-8 cores or specify -1 to use all cores on my workstation.

Now that we are familiar with the scikit-learn library’s capability to support multi-core parallel processing for machine learning, let’s work through some examples.

You will get different timings for all of the examples in this tutorial; share your results in the comments. You may also need to change the number of cores to match the number of cores on your system.

Note: Yes, I am aware of the timeit API, but chose against it for this tutorial. We are not profiling the code examples per se; instead, I want you to focus on how and when to use the multi-core capabilities of scikit-learn and that they offer real benefits. I wanted the code examples to be clean and simple to read, even for beginners. I set it as an extension to update all examples to use the timeit API and get more accurate timings. Share your results in the comments.

Multi-Core Model Training

Many machine learning algorithms support multi-core training via an n_jobs argument when the model is defined.

This affects not just the training of the model, but also the use of the model when making predictions.

A popular example is the ensemble of decision trees, such as bagged decision trees, random forest, and gradient boosting.

In this section we will explore accelerating the training of a RandomForestClassifier model using multiple cores. We will use a synthetic classification task for our experiments.

In this case, we will define a random forest model with 500 trees and use a single core to train the model.

...
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=1)

We can record the time before and after the call to the train() function using the time() function. We can then subtract the start time from the end time and report the execution time in the number of seconds.

The complete example of evaluating the execution time of training a random forest model with a single core is listed below.

# example of timing the training of a random forest model on one core
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=1)
# record current time
start = time()
# fit the model
model.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that it takes about 10 seconds.

How long does it take on your system? Share your results in the comments below.

10.702 seconds

We can now change the example to use all of the physical cores on the system, in this case, four.

...
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=4)

The complete example of multi-core training of the model with four cores is listed below.

# example of timing the training of a random forest model on 4 cores
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=4)
# record current time
start = time()
# fit the model
model.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that the speed of execution more than halved to about 3.151 seconds.

How long does it take on your system? Share your results in the comments below.

3.151 seconds

We can now change the number of cores to eight to account for the hyper-threading supported by the four physical cores.

...
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=8)

We can achieve the same effect by setting n_jobs to -1 to automatically use all cores; for example:

...
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=-1)

We will stick to manually specifying the number of cores for now.

The complete example of multi-core training of the model with eight cores is listed below.

# example of timing the training of a random forest model on 8 cores
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=500, n_jobs=8)
# record current time
start = time()
# fit the model
model.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that we got another drop in execution speed from about 3.151 to about 2.521 by using all cores.

How long does it take on your system? Share your results in the comments below.

2.521 seconds

We can make the relationship between the number of cores used during training and execution speed more concrete by comparing all values between one and eight and plotting the result.

The complete example is listed below.

# example of comparing number of cores used during training to execution speed
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
results = list()
# compare timing for number of cores
n_cores = [1, 2, 3, 4, 5, 6, 7, 8]
for n in n_cores:
	# capture current time
	start = time()
	# define the model
	model = RandomForestClassifier(n_estimators=500, n_jobs=n)
	# fit the model
	model.fit(X, y)
	# capture current time
	end = time()
	# store execution time
	result = end - start
	print('>cores=%d: %.3f seconds' % (n, result))
	results.append(result)
pyplot.plot(n_cores, results)
pyplot.show()

Running the example first reports the execution speed for each number of cores used during training.

We can see a steady decrease in execution speed from one to eight cores, although the dramatic benefits stop after four physical cores.

How long does it take on your system? Share your results in the comments below.

>cores=1: 10.798 seconds
>cores=2: 5.743 seconds
>cores=3: 3.964 seconds
>cores=4: 3.158 seconds
>cores=5: 2.868 seconds
>cores=6: 2.631 seconds
>cores=7: 2.528 seconds
>cores=8: 2.440 seconds

A plot is also created to show the relationship between the number of cores used during training and the execution speed, showing that we continue to see a benefit all the way to eight cores.

Line Plot of Number of Cores Used During Training vs. Execution Speed

Now that we are familiar with the benefit of multi-core training of machine learning models, let’s look at multi-core model evaluation.

Multi-Core Model Evaluation

The gold standard for model evaluation is k-fold cross-validation.

This is a resampling procedure that requires that the model is trained and evaluated k times on different partitioned subsets of the dataset. The result is an estimate of the performance of a model when making predictions on data not used during training that can be used to compare and select a good or best model for a dataset.

In addition, it is also a good practice to repeat this evaluation process multiple times, referred to as repeated k-fold cross-validation.

The evaluation procedure can be configured to use multiple cores, where each model training and evaluation happens on a separate core. This can be done by setting the n_jobs argument on the call to cross_val_score() function; for example:

We can explore the effect of multiple cores on model evaluation.

First, let’s evaluate the model using a single core.

...
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)

We will evaluate the random forest model and use a single core in the training of the model (for now).

...
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=1)

The complete example is listed below.

# example of evaluating a model using a single core
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=1)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# record current time
start = time()
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example evaluates the model using 10-fold cross-validation with three repeats.

In this case, we see that the evaluation of the model took about 6.412 seconds.

How long does it take on your system? Share your results in the comments below.

6.412 seconds

We can update the example to use all eight cores of the system and expect a large speedup.

...
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=8)

The complete example is listed below.

# example of evaluating a model using 8 cores
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=1)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# record current time
start = time()
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=8)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example evaluates the model using multiple cores.

In this case, we can see the execution timing dropped from 6.412 seconds to about 2.371 seconds, giving a welcome speedup.

How long does it take on your system? Share your results in the comments below.

2.371 seconds

As we did in the previous section, we can time the execution speed for each number of cores from one to eight to get an idea of the relationship.

The complete example is listed below.

# compare execution speed for model evaluation vs number of cpu cores
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
results = list()
# compare timing for number of cores
n_cores = [1, 2, 3, 4, 5, 6, 7, 8]
for n in n_cores:
	# define the model
	model = RandomForestClassifier(n_estimators=100, n_jobs=1)
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# record the current time
	start = time()
	# evaluate the model
	n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=n)
	# record the current time
	end = time()
	# store execution time
	result = end - start
	print('>cores=%d: %.3f seconds' % (n, result))
	results.append(result)
pyplot.plot(n_cores, results)
pyplot.show()

Running the example first reports the execution time in seconds for each number of cores for evaluating the model.

We can see that there is not a dramatic improvement above four physical cores.

We can also see a difference here when training with eight cores from the previous experiment. In this case, evaluating performance took 1.492 seconds whereas the standalone case took about 2.371 seconds.

This highlights the limitation of the evaluation methodology we are using where we are reporting the performance of a single run rather than repeated runs. There is some spin-up time required to load classes into memory and perform any JIT optimization.

Regardless of the accuracy of our flimsy profiling, we do see the familiar speedup of model evaluation with the increase of cores used during the process.

How long does it take on your system? Share your results in the comments below.

>cores=1: 6.339 seconds
>cores=2: 3.765 seconds
>cores=3: 2.404 seconds
>cores=4: 1.826 seconds
>cores=5: 1.806 seconds
>cores=6: 1.686 seconds
>cores=7: 1.587 seconds
>cores=8: 1.492 seconds

A plot of the relationship between the number of cores and the execution speed is also created.

Line Plot of Number of Cores Used During Evaluation vs. Execution Speed

We can also make the model training process parallel during the model evaluation procedure.

Although this is possible, should we?

To explore this question, let’s first consider the case where model training uses all cores and model evaluation uses a single core.

...
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=8)
...
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)

The complete example is listed below.

# example of using multiple cores for model training but not model evaluation
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=8)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# record current time
start = time()
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example evaluates the model using a single core, but each trained model uses a single core.

In this case, we can see that the model evaluation takes more than 10 seconds, much longer than the 1 or 2 seconds when we use a single core for training and all cores for parallel model evaluation.

How long does it take on your system? Share your results in the comments below.

10.461 seconds

What if we split the number of cores between the training and evaluation procedures?

...
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=4)
...
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=4)

The complete example is listed below.

# example of using multiple cores for model training and evaluation
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=8)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=4)
# record current time
start = time()
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=4)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example evaluates the model using four cores, and each model is trained using four different cores.

We can see an improvement over training with all cores and evaluating with one core, but at least for this model on this dataset, it is more efficient to use all cores for model evaluation and a single core for model training.

How long does it take on your system? Share your results in the comments below.

3.434 seconds

Multi-Core Hyperparameter Tuning

It is common to tune the hyperparameters of a machine learning model using a grid search or a random search.

The scikit-learn library provides these capabilities via the GridSearchCV and RandomizedSearchCV classes respectively.

Both of these search procedures can be made parallel by setting the n_jobs argument, assigning each hyperparameter configuration to a core for evaluation.

The model evaluation itself could also be multi-core, as we saw in the previous section, and the model training for a given evaluation can also be training as we saw in the second before that. Therefore, the stack of potentially multi-core processes is starting to get challenging to configure.

In this specific implementation, we can make the model training parallel, but we don’t have control over how each model hyperparameter and how each model evaluation is made multi-core. The documentation is not clear at the time of writing, but I would guess that each model evaluation using a single core hyperparameter configuration is split into jobs.

Let’s explore the benefits of performing model hyperparameter tuning using multiple cores.

First, let’s evaluate a grid of different configurations of the random forest algorithm using a single core.

...
# define grid search
search = GridSearchCV(model, grid, n_jobs=1, cv=cv)

The complete example is listed below.

# example of tuning model hyperparameters with a single core
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=1)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['max_features'] = [1, 2, 3, 4, 5]
# define grid search
search = GridSearchCV(model, grid, n_jobs=1, cv=cv)
# record current time
start = time()
# perform search
search.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example tests different values of the max_features configuration for random forest, where each configuration is evaluated using repeated k-fold cross-validation.

In this case, the grid search on a single core takes about 28.838 seconds.

How long does it take on your system? Share your results in the comments below.

28.838 seconds

We can now configure the grid search to use all available cores on the system, in this case, eight cores.

...
# define grid search
search = GridSearchCV(model, grid, n_jobs=8, cv=cv)

We can then evaluate how long this multi-core grids search takes to execute. The complete example is listed below.

# example of tuning model hyperparameters with 8 cores
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=1)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['max_features'] = [1, 2, 3, 4, 5]
# define grid search
search = GridSearchCV(model, grid, n_jobs=8, cv=cv)
# record current time
start = time()
# perform search
search.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

Running the example reports execution time for the grid search.

In this case, we see a factor of about four speed up from roughly 28.838 seconds to around 7.418 seconds.

How long does it take on your system? Share your results in the comments below.

7.418 seconds

Intuitively, we would expect that making the grid search multi-core should be the focus and not model training.

Nevertheless, we can divide the number of cores between model training and the grid search to see if it offers a benefit for this model on this dataset.

...
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=4)
...
# define grid search
search = GridSearchCV(model, grid, n_jobs=4, cv=cv)

The complete example of multi-core model training and multi-core hyperparameter tuning is listed below.

# example of multi-core model training and hyperparameter tuning
from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=4)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['max_features'] = [1, 2, 3, 4, 5]
# define grid search
search = GridSearchCV(model, grid, n_jobs=4, cv=cv)
# record current time
start = time()
# perform search
search.fit(X, y)
# record current time
end = time()
# report execution time
result = end - start
print('%.3f seconds' % result)

In this case, we do see a decrease in execution speed compared to a single core case, but not as much benefit as assigning all cores to the grid search process.

How long does it take on your system? Share your results in the comments below.

14.148 seconds

Recommendations

This section lists some general recommendations when using multiple cores for machine learning.

Confirm the number of cores available on your system.
Consider using an AWS EC2 instance with many cores to get an immediate speed up.
Check the API documentation to see if the model/s you are using support multi-core training.
Confirm multi-core training offers a measurable benefit on your system.
When using k-fold cross-validation, it is probably better to assign cores to the resampling procedure and leave model training single core.
When using hyperparamter tuning, it is probably better to make the search multi-core and leave the model training and evaluation single core.

Do you have any recommendations of your own?

Summary

In this tutorial, you discovered how to configure scikit-learn for multi-core machine learning.

Specifically, you learned:

How to train machine learning models using multiple cores.
How to make the evaluation of machine learning models parallel.
How to use multiple cores to tune machine learning model hyperparameters.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Multi-Core Machine Learning in Python With Scikit-Learn appeared first on Machine Learning Mastery.

Linear Discriminant Analysis is a linear classification machine learning algorithm.

The algorithm involves developing a probabilistic model per class based on the specific distribution of observations for each input variable. A new example is then classified by calculating the conditional probability of it belonging to each class and selecting the class with the highest probability.

As such, it is a relatively simple probabilistic classification model that makes strong assumptions about the distribution of each input variable, although it can make effective predictions even when these expectations are violated (e.g. it fails gracefully).

In this tutorial, you will discover the Linear Discriminant Analysis classification machine learning algorithm in Python.

After completing this tutorial, you will know:

The Linear Discriminant Analysis is a simple linear machine learning algorithm for classification.
How to fit, evaluate, and make predictions with the Linear Discriminant Analysis model with Scikit-Learn.
How to tune the hyperparameters of the Linear Discriminant Analysis algorithm on a given dataset.

Let’s get started.

Linear Discriminant Analysis With Python
Photo by Mihai Lucîț, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Linear Discriminant Analysis
Linear Discriminant Analysis With scikit-learn
Tune LDA Hyperparameters

Linear Discriminant Analysis

Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm.

It works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. These statistics represent the model learned from the training data. In practice, linear algebra operations are used to calculate the required quantities efficiently via matrix decomposition.

Predictions are made by estimating the probability that a new example belongs to each class label based on the values of each input feature. The class that results in the largest probability is then assigned to the example. As such, LDA may be considered a simple application of Bayes Theorem for classification.

LDA assumes that the input variables are numeric and normally distributed and that they have the same variance (spread). If this is not the case, it may be desirable to transform the data to have a Gaussian distribution and standardize or normalize the data prior to modeling.

… the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance

— Page 142, An Introduction to Statistical Learning with Applications in R, 2014.

It also assumes that the input variables are not correlated; if they are, a PCA transform may be helpful to remove the linear dependence.

… practitioners should be particularly rigorous in pre-processing data before using LDA. We recommend that predictors be centered and scaled and that near-zero variance predictors be removed.

— Page 293, Applied Predictive Modeling, 2013.

Nevertheless, the model can perform well, even when violating these expectations.

The LDA model is naturally multi-class. This means that it supports two-class classification problems and extends to more than two classes (multi-class classification) without modification or augmentation.

It is a linear classification algorithm, like logistic regression. This means that classes are separated in the feature space by lines or hyperplanes. Extensions of the method can be used that allow other shapes, like Quadratic Discriminant Analysis (QDA), which allows curved shapes in the decision boundary.

… unlike LDA, QDA assumes that each class has its own covariance matrix.

— Page 149, An Introduction to Statistical Learning with Applications in R, 2014.

Now that we are familiar with LDA, let’s look at how to fit and evaluate models using the scikit-learn library.

Linear Discriminant Analysis With scikit-learn

The Linear Discriminant Analysis is available in the scikit-learn Python machine learning library via the LinearDiscriminantAnalysis class.

The method can be used directly without configuration, although the implementation does offer arguments for customization, such as the choice of solver and the use of a penalty.

...
# create the lda model
model = LinearDiscriminantAnalysis()

We can demonstrate the Linear Discriminant Analysis method with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 10 input variables.

The example creates and summarizes the dataset.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 10) (1000,)

We can fit and evaluate a Linear Discriminant Analysis model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

The complete example of evaluating the Linear Discriminant Analysis model for the synthetic binary classification task is listed below.

# evaluate a lda model on the dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = LinearDiscriminantAnalysis()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Linear Discriminant Analysis algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 89.3 percent.

Mean Accuracy: 0.893 (0.033)

We may decide to use the Linear Discriminant Analysis as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a lda model on the dataset
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = LinearDiscriminantAnalysis()
# fit model
model.fit(X, y)
# define new data
row = [0.12777556,-3.64400522,-2.23268854,-1.82114386,1.75466361,0.1243966,1.03397657,2.35822076,1.01001752,0.56768485]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 1

Next, we can look at configuring the model hyperparameters.

Tune LDA Hyperparameters

The hyperparameters for the Linear Discriminant Analysis method must be configured for your specific dataset.

An important hyperparameter is the solver, which defaults to ‘svd‘ but can also be set to other values for solvers that support the shrinkage capability.

The example below demonstrates this using the GridSearchCV class with a grid of different solver values.

# grid search solver for lda
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = LinearDiscriminantAnalysis()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['solver'] = ['svd', 'lsqr', 'eigen']
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the default SVD solver performs the best compared to the other built-in solvers.

Mean Accuracy: 0.893
Config: {'solver': 'svd'}

Next, we can explore whether using shrinkage with the model improves performance.

Shrinkage adds a penalty to the model that acts as a type of regularizer, reducing the complexity of the model.

Regularization reduces the variance associated with the sample based estimate at the expense of potentially increased bias. This bias variance trade-off is generally regulated by one or more (degree-of-belief) parameters that control the strength of the biasing towards the “plausible” set of (population) parameter values.

— Regularized Discriminant Analysis, 1989.

This can be set via the “shrinkage” argument and can be set to a value between 0 and 1. We will test values on a grid with a spacing of 0.01.

In order to use the penalty, a solver must be chosen that supports this capability, such as ‘eigen’ or ‘lsqr‘. We will use the latter in this case.

The complete example of tuning the shrinkage hyperparameter is listed below.

# grid search shrinkage for lda
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = LinearDiscriminantAnalysis(solver='lsqr')
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['shrinkage'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that using shrinkage offers a slight lift in performance from about 89.3 percent to about 89.4 percent, with a value of 0.02.

Mean Accuracy: 0.894
Config: {'shrinkage': 0.02}

Summary

In this tutorial, you discovered the Linear Discriminant Analysis classification machine learning algorithm in Python.

Specifically, you learned:

The Linear Discriminant Analysis is a simple linear machine learning algorithm for classification.
How to fit, evaluate, and make predictions with the Linear Discriminant Analysis model with Scikit-Learn.
How to tune the hyperparameters of the Linear Discriminant Analysis algorithm on a given dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Linear Discriminant Analysis With Python appeared first on Machine Learning Mastery.

Classification Algorithms Overview

Logistic Regression

Ridge Classifier

K-Nearest Neighbors (KNN)

Support Vector Machine (SVM)

Bagged Decision Trees (Bagging)

Random Forest

Stochastic Gradient Boosting

Further Reading

Summary

Overview

Value of Small Machine Learning Datasets

Definition of a Standard Machine Learning Dataset

Standard Machine Learning Datasets

Good Results for Standard Datasets

Model Evaluation Methodology

Results for Classification Datasets

Binary Classification Datasets

Ionosphere

Pima Indian Diabetes

Sonar

Wisconsin Breast Cancer

Horse Colic

Multiclass Classification Datasets

Iris Flowers

Glass

Wine

Wheat Seeds

Results for Regression Datasets

Housing

Auto Insurance

Abalone

Auto Imports

Further Reading

Tutorials

Articles

Summary

Tutorial Overview

Role of Distance Measures

Hamming Distance

Euclidean Distance

Manhattan Distance (Taxicab or City Block Distance)

Minkowski Distance

Further Reading

Books

APIs

Articles

Summary

Tutorial Overview

Train-Test Split Evaluation

When to Use the Train-Test Split

How to Configure the Train-Test Split

Train-Test Split Procedure in Scikit-Learn

Repeatable Train-Test Splits

Stratified Train-Test Splits

Train-Test Split to Evaluate Machine Learning Models

Train-Test Split for Classification

Train-Test Split for Regression

Further Reading

Summary

Tutorial Overview

LOOCV Model Evaluation

LOOCV Procedure in Scikit-Learn

LOOCV to Evaluate Machine Learning Models

LOOCV for Classification

LOOCV for Regression

Further Reading

Tutorials

APIs

Summary

Tutorial Overview

Combined Hyperparameter Tuning and Model Selection

What Is Nested Cross-Validation

What Is the Cost of Nested Cross-Validation?

How Do You Set k?

How Do You Configure the Final Model?

Nested Cross-Validation With Scikit-Learn

Further Reading

Tutorials

Papers