Quantcast
Channel: MachineLearningMastery.com
Viewing all 269 articles
Browse latest View live

Multiprocessing in Python

$
0
0

When you are working on a compute vision project, probably you need to preprocess a lot of image data. This is time-consuming and it would be great if you can process multiple images in parallel. Multiprocessing is the ability of a system to run multiple processors at one time. If you had a computer with a single processor, it would switch between multiple processes to keep all of them running. However, most computers today have at least a multi-core processor, which allows for several processes to be executed at once. The Python Multiprocessing Module is a tool for you to increase your scripts’ efficiency by allocating tasks to different processes.

After completing this tutorial, you will know:

  • Why we would want to use multiprocessing
  • How to use basic tools in the Python multiprocessing module

Let’s get started.

Multiprocessing in Python
Photo by Thirdman. Some rights reserved.

Overview

This tutorial is divided into three parts:

  • Benefits of multiprocessing
  • Basic multiprocessing
  • Multiprocessing for real use
  • Using joblib

Benefits of multiprocessing

You may ask, “Why Multiprocessing?” Multiprocessing can make a program substantially more efficient by running multiple tasks in parallel instead of sequentially. A similar term is multithreading, but they are different.

A process is a program loaded into memory to run and does not share its memory with other processes. A thread is an execution unit within a process. Multiple threads run in a process and share the process’s memory space with each other.

Python’s Global Interpreter Lock (GIL) only allows one thread to be run at a time under the interpreter, which means you can’t enjoy the performance benefit of multithreading if the Python interpreter is required. This is what gives multiprocessing an upper hand over threading in Python. Multiple processes can be run in parallel because each process has its own interpreter that executes the instructions allocated to it. Also, the OS would see your program in multiple processes and schedule them separately, i.e., your program get a larger share of computer resources in total. So, multiprocessing is faster when the program is CPU-bound. In cases where there is a lot of I/O in your program, threading may be more efficient because most of the time your program is waiting for the I/O to complete. However, multiprocessing is generally more efficient because it runs concurrently.

Basic multiprocessing

Let’s use the Python Multiprocessing module to write a basic program that demonstrates how to do concurrent programming.

Let’s look at this function, task(), that sleeps for 0.5 seconds and prints before and after the sleep:

import time

def task():
    print('Sleeping for 0.5 seconds')
    time.sleep(0.5)
    print('Finished sleeping')

To create a process, we simply say so using the multiprocessing module

...
import multiprocessing
p1 = multiprocessing.Process(target=task)
p2 = multiprocessing.Process(target=task)

The target argument to the Process() specifies the target function that the process runs. But these processes are not running immediately until we started them:

...
p1.start()
p2.start()

A complete concurrent program would be as follows:

import multiprocessing
import time

def task():
    print('Sleeping for 0.5 seconds')
    time.sleep(0.5)
    print('Finished sleeping')

if __name__ == "__main__":
    start_time = time.perf_counter()

    # Creates two processes
    p1 = multiprocessing.Process(target=task)
    p2 = multiprocessing.Process(target=task)

    # Starts both processes
    p1.start()
    p2.start()

    finish_time = time.perf_counter()

    print(f"Program finished in {finish_time-start_time} seconds")

We must fence ouur main program under if __name__ == "__main__" or otherwise the multiprocessing module will complain. This is a safety construct to guarantee Python finish analyzing the program before the sub-process is created.

However, there is a problem with the code, as the program timer is printed before the processes we created are even executed. Here’s the output for the code above:

Program finished in 0.012921249988721684 seconds
Sleeping for 0.5 seconds
Sleeping for 0.5 seconds
Finished sleeping
Finished sleeping

We need to call the join() function on the two processes to make them run before the time prints. This is because there are three processes going on: p1, p2, and the main process. The main process is the one that keeps track of the time and prints the time taken to execute. We should make the line of finish_time to run no earlier than the processes p1 and p2 finished. We just need to add this snippet of code immediately after the start() function calls:

...
p1.join()
p2.join()

The join() function allows us to make other processes wait until the processes that had join() called on it complete. Here’s the output with the join statements added:

Sleeping for 0.5 seconds
Sleeping for 0.5 seconds
Finished sleeping
Finished sleeping
Program finished in 0.5688213340181392 seconds

With the similar reasoning, we can make more processes to run. The following is the complete code modified from above, to have 10 processes:

import multiprocessing
import time

def task():
    print('Sleeping for 0.5 seconds')
    time.sleep(0.5)
    print('Finished sleeping')

if __name__ == "__main__": 
    start_time = time.perf_counter()
    processes = []

    # Creates 10 processes then starts them
    for i in range(10):
        p = multiprocessing.Process(target = task)
        p.start()
        processes.append(p)
    
    # Joins all the processes 
    for p in processes:
        p.join()

    finish_time = time.perf_counter()

    print(f"Program finished in {finish_time-start_time} seconds")

Multiprocessing for real use

Starting a new process and then join it back to the main process is how multiprocessing works in Python (so as in many other languages). The reason we want to run multiprocessing is probably to execute many different tasks concurrently for speed. It can be an image processing function and we need to do that on thousands of images. It can also be converting PDF into plaintext for the subsequent natural language processing tasks and we need to process a thousand PDF. Usually we will create a function that takes an argument (e.g., filename) for such tasks.

Let’s consider a function:

def cube(x):
    return x**3

If we want to run it with arguments 1 to 1000, we can create 1000 process and run it in parallel:

import multiprocessing

def cube(x):
    return x**3

if __name__ == "__main__":
    # this does not work
    processes = [multiprocessing.Process(target=cube, args=(x,)) for x in range(1,1000)]
    [p.start() for p in processes]
    result = [p.join() for p in processes]
    print(result)

but this is not going to work as you probably has only a handful of cores in your computer. Running 1000 processes is creating too much overhead and overwhelming the capacity of your OS. Also you may have exhausted your memory. The better way is to run a process pool to limit the number of process can be run at a time:

import multiprocessing
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    pool = multiprocessing.Pool(3)
    start_time = time.perf_counter()
    processes = [pool.apply_async(cube, args=(x,)) for x in range(1,1000)]
    result = [p.get() for p in processes]
    finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

The argument to multiprocessing.Pool() is the number of processes to create in the pool. If omitted, Python will make it equals to the number of cores you have in your computer.

We used apply_async() function to pass the arguments to the function cube in a list comprehension. This will create tasks for the pool to run. It is called “async” (asynchronous) because we didn’t wait for the task to finish and the main process may continue to run. Therefore the apply_async() function does not return the result, but an object that we can use get() to wait for the task finish and retrieve the result. Since we get the result in a list comprehension, the order of the result is corresponding to the arguments we created the asynchronous tasks. However, this does not mean the process are started or finished in this order inside the pool.

If you think writing lines of code to start processes and join them is too explicit, you can consider using map() instead:

import multiprocessing
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    pool = multiprocessing.Pool(3)
    start_time = time.perf_counter()
    result = pool.map(cube, range(1,1000))
    finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

We don’t have the start and join here because it is hidden behind the pool.map() function. What it does is to split the iterable range(1,1000) into chunks and run each chunk in the pool. The map function is a parallel version of the list comprehension:

result = [cube(x) for x in range(1,1000)]

But the modern day alternative is to use map from concurrent.futures, as follows:

import concurrent.futures
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(3) as executor:
        start_time = time.perf_counter()
        result = list(executor.map(cube, range(1,1000)))
        finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

This code is running the multiprocessing module under the hood. The beauty of doing so is that we can change the program from multiprocessing to multithreading by simply replacing ProcessPoolExecutor with ThreadPoolExecutor. Of course, you have to consider whether the global interpreter lock is an issue for your code.

Using joblib

The package joblib is a set of tools to make parallel computing easier. It is a common third-party library for multiprocessing. It also provides caching and serialization functions. To install the joblib package, use the command in terminal:

pip install joblib

We can convert our previouus example into the following to use joblib:

import time
from joblib import Parallel, delayed

def cube(x):
    return x**3

start_time = time.perf_counter()
result = Parallel(n_jobs=3)(delayed(cube)(i) for i in range(1,1000))
finish_time = time.perf_counter()
print(f"Program finished in {finish_time-start_time} seconds")
print(result)

Indeed, it is intuitive to see what it does. The delayed() function is a wrapper to another function to make a “delayed” version of the function call. Which means it will not execute the function immediately when it is called.

Then we call the delayed function multiple times with different set of arguments we want to pass to it. For example, when we give integer 1 to the delayed version of function cube, instead of computing the result, we produced a tuple, (cube, (1,), {}) for the function object, the positional arguments, and keyword arguments respectively.

We created the engine instance with Parallel(). When it is invoked like a function with the list of tuples as argument, it will actually execute the job as specified by each tuple in parallel, and collect the result as a list after all jobs finished. Here we created the Parallel() instance with n_jobs=3 so there will be three processes running in parallel.

We can also write the tuples directly. Hence the code above can be rewritten as:

result = Parallel(n_jobs=3)((cube, (i,), {}) for i in range(1,1000))

The benefit of using joblib is that we can run the code in multithread by simply adding an additional argument:

result = Parallel(n_jobs=3, prefer="threads")(delayed(cube)(i) for i in range(1,1000))

and this hides all the details of running functions in parallel. We simply use a syntax not too much different from a plain list comprehension

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

APIs

Summary

In this tutorial, you learned how we run Python functions in parallel for speed. In particular, you learned

  • How to use the multiprocessing module in Python to create new processes that runs a function
  • The mechanism of launching and completing a process
  • The use of process pool in multiprocessing for controlled multiprocessing, and the counterpart syntax in concurrent.futures
  • How to use the third-party library joblib for multiprocessing

The post Multiprocessing in Python appeared first on Machine Learning Mastery.


How to Identify Overfitting Machine Learning Models in Scikit-Learn

$
0
0

Overfitting is a common explanation for the poor performance of a predictive model.

An analysis of learning dynamics can help to identify whether a model has overfit the training dataset and may suggest an alternate configuration to use that could result in better predictive performance.

Performing an analysis of learning dynamics is straightforward for algorithms that learn incrementally, like neural networks, but it is less clear how we might perform the same analysis with other algorithms that do not learn incrementally, such as decision trees, k-nearest neighbors, and other general algorithms in the scikit-learn machine learning library.

In this tutorial, you will discover how to identify overfitting for machine learning models in Python.

After completing this tutorial, you will know:

  • Overfitting is a possible cause of poor generalization performance of a predictive model.
  • Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
  • Although overfitting is a useful tool for analysis, it must not be confused with model selection.

Let’s get started.

Identify Overfitting Machine Learning Models With Scikit-Learn

Identify Overfitting Machine Learning Models With Scikit-Learn
Photo by Bonnie Moreland, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. What Is Overfitting
  2. How to Perform an Overfitting Analysis
  3. Example of Overfitting in Scikit-Learn
  4. Counterexample of Overfitting in Scikit-Learn
  5. Separate Overfitting Analysis From Model Selection

What Is Overfitting

Overfitting refers to an unwanted behavior of a machine learning algorithm used for predictive modeling.

It is the case where model performance on the training dataset is improved at the cost of worse performance on data not seen during training, such as a holdout test dataset or new data.

We can identify if a machine learning model has overfit by first evaluating the model on the training dataset and then evaluating the same model on a holdout test dataset.

If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model may have overfit the training dataset.

We care about overfitting because it is a common cause for “poor generalization” of the model as measured by high “generalization error.” That is error made by the model when making predictions on new data.

This means, if our model has poor performance, maybe it is because it has overfit.

But what does it mean if a model’s performance is “significantly better” on the training set compared to the test set?

For example, it is common and perhaps normal for the model to have better performance on the training set than the test set.

As such, we can perform an analysis of the algorithm on the dataset to better expose the overfitting behavior.

How to Perform an Overfitting Analysis

An overfitting analysis is an approach for exploring how and when a specific model is overfitting on a specific dataset.

It is a tool that can help you learn more about the learning dynamics of a machine learning model.

This might be achieved by reviewing the model behavior during a single run for algorithms like neural networks that are fit on the training dataset incrementally.

A plot of the model performance on the train and test set can be calculated at each point during training and plots can be created. This plot is often called a learning curve plot, showing one curve for model performance on the training set and one curve for the test set for each increment of learning.

If you would like to learn more about learning curves for algorithms that learn incrementally, see the tutorial:

The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall or accuracy continues to rise) and performance on the test or validation set improves to a point and then begins to get worse.

If this pattern is observed, then training should stop at that point where performance gets worse on the test or validation set for algorithms that learn incrementally

This makes sense for algorithms that learn incrementally like neural networks, but what about other algorithms?

  • How do you perform an overfitting analysis for machine learning algorithms in scikit-learn?

One approach for performing an overfitting analysis on algorithms that do not learn incrementally is by varying a key model hyperparameter and evaluating the model performance on the train and test sets for each configuration.

To make this clear, let’s explore a case of analyzing a model for overfitting in the next section.

Example of Overfitting in Scikit-Learn

In this section, we will look at an example of overfitting a machine learning model to a training dataset.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to define a binary (two class) classification prediction problem with 10,000 examples (rows) and 20 input features (columns).

The example below creates the dataset and summarizes the shape of the input and output components.

# synthetic classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and reports the shape, confirming our expectations.

(10000, 20) (10000,)

Next, we need to split the dataset into train and test subsets.

We will use the train_test_split() function and split the data into 70 percent for training a model and 30 percent for evaluating it.

# split a dataset into train and test sets
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset and we can confirm that we have 7,000 examples for training and 3,000 for evaluating a model.

(7000, 20) (3000, 20) (7000,) (3000,)

Next, we can explore a machine learning model overfitting the training dataset.

We will use a decision tree via the DecisionTreeClassifier and test different tree depths with the “max_depth” argument.

Shallow decision trees (e.g. few levels) generally do not overfit but have poor performance (high bias, low variance). Whereas deep trees (e.g. many levels) generally do overfit and have good performance (low bias, high variance). A desirable tree is one that is not so shallow that it has low skill and not so deep that it overfits the training dataset.

We evaluate decision tree depths from 1 to 20.

...
# define the tree depths to evaluate
values = [i for i in range(1, 21)]

We will enumerate each tree depth, fit a tree with a given depth on the training dataset, then evaluate the tree on both the train and test sets.

The expectation is that as the depth of the tree increases, performance on train and test will improve to a point, and as the tree gets too deep, it will begin to overfit the training dataset at the expense of worse performance on the holdout test set.

...
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = DecisionTreeClassifier(max_depth=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

At the end of the run, we will then plot all model accuracy scores on the train and test sets for visual comparison.

...
# plot of train and test scores vs tree depth
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Tying this together, the complete example of exploring different tree depths on the synthetic binary classification dataset is listed below.

# evaluate decision tree performance on train and test sets with different tree depths
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 21)]
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = DecisionTreeClassifier(max_depth=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs tree depth
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Running the example fits and evaluates a decision tree on the train and test sets for each tree depth and reports the accuracy scores.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a trend of increasing accuracy on the training dataset with the tree depth to a point around a depth of 19-20 levels where the tree fits the training dataset perfectly.

We can also see that the accuracy on the test set improves with tree depth until a depth of about eight or nine levels, after which accuracy begins to get worse with each increase in tree depth.

This is exactly what we would expect to see in a pattern of overfitting.

We would choose a tree depth of eight or nine before the model begins to overfit the training dataset.

>1, train: 0.769, test: 0.761
>2, train: 0.808, test: 0.804
>3, train: 0.879, test: 0.878
>4, train: 0.902, test: 0.896
>5, train: 0.915, test: 0.903
>6, train: 0.929, test: 0.918
>7, train: 0.942, test: 0.921
>8, train: 0.951, test: 0.924
>9, train: 0.959, test: 0.926
>10, train: 0.968, test: 0.923
>11, train: 0.977, test: 0.925
>12, train: 0.983, test: 0.925
>13, train: 0.987, test: 0.926
>14, train: 0.992, test: 0.921
>15, train: 0.995, test: 0.920
>16, train: 0.997, test: 0.913
>17, train: 0.999, test: 0.918
>18, train: 0.999, test: 0.918
>19, train: 1.000, test: 0.914
>20, train: 1.000, test: 0.913

A figure is also created that shows line plots of the model accuracy on the train and test sets with different tree depths.

The plot clearly shows that increasing the tree depth in the early stages results in a corresponding improvement in both train and test sets.

This continues until a depth of around 10 levels, after which the model is shown to overfit the training dataset at the cost of worse performance on the holdout dataset.

Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths

Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths

This analysis is interesting. It shows why the model has a worse hold-out test set performance when “max_depth” is set to large values.

But it is not required.

We can just as easily choose a “max_depth” using a grid search without performing an analysis on why some values result in better performance and some result in worse performance.

In fact, in the next section, we will show where this analysis can be misleading.

Counterexample of Overfitting in Scikit-Learn

Sometimes, we may perform an analysis of machine learning model behavior and be deceived by the results.

A good example of this is varying the number of neighbors for the k-nearest neighbors algorithms, which we can implement using the KNeighborsClassifier class and configure via the “n_neighbors” argument.

Let’s forget how KNN works for the moment.

We can perform the same analysis of the KNN algorithm as we did in the previous section for the decision tree and see if our model overfits for different configuration values. In this case, we will vary the number of neighbors from 1 to 50 to get more of the effect.

The complete example is listed below.

# evaluate knn performance on train and test sets with different numbers of neighbors
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 51)]
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = KNeighborsClassifier(n_neighbors=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs number of neighbors
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Running the example fits and evaluates a KNN model on the train and test sets for each number of neighbors and reports the accuracy scores.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Recall, we are looking for a pattern where performance on the test set improves and then starts to get worse, and performance on the training set continues to improve.

We do not see this pattern.

Instead, we see that accuracy on the training dataset starts at perfect accuracy and falls with almost every increase in the number of neighbors.

We also see that performance of the model on the holdout test improves to a value of about five neighbors, holds level and begins a downward trend after that.

>1, train: 1.000, test: 0.919
>2, train: 0.965, test: 0.916
>3, train: 0.962, test: 0.932
>4, train: 0.957, test: 0.932
>5, train: 0.954, test: 0.935
>6, train: 0.953, test: 0.934
>7, train: 0.952, test: 0.932
>8, train: 0.951, test: 0.933
>9, train: 0.949, test: 0.933
>10, train: 0.950, test: 0.935
>11, train: 0.947, test: 0.934
>12, train: 0.947, test: 0.933
>13, train: 0.945, test: 0.932
>14, train: 0.945, test: 0.932
>15, train: 0.944, test: 0.932
>16, train: 0.944, test: 0.934
>17, train: 0.943, test: 0.932
>18, train: 0.943, test: 0.935
>19, train: 0.942, test: 0.933
>20, train: 0.943, test: 0.935
>21, train: 0.942, test: 0.933
>22, train: 0.943, test: 0.933
>23, train: 0.941, test: 0.932
>24, train: 0.942, test: 0.932
>25, train: 0.942, test: 0.931
>26, train: 0.941, test: 0.930
>27, train: 0.941, test: 0.932
>28, train: 0.939, test: 0.932
>29, train: 0.938, test: 0.931
>30, train: 0.938, test: 0.931
>31, train: 0.937, test: 0.931
>32, train: 0.938, test: 0.931
>33, train: 0.937, test: 0.930
>34, train: 0.938, test: 0.931
>35, train: 0.937, test: 0.930
>36, train: 0.937, test: 0.928
>37, train: 0.936, test: 0.930
>38, train: 0.937, test: 0.930
>39, train: 0.935, test: 0.929
>40, train: 0.936, test: 0.929
>41, train: 0.936, test: 0.928
>42, train: 0.936, test: 0.929
>43, train: 0.936, test: 0.930
>44, train: 0.935, test: 0.929
>45, train: 0.935, test: 0.929
>46, train: 0.934, test: 0.929
>47, train: 0.935, test: 0.929
>48, train: 0.934, test: 0.929
>49, train: 0.934, test: 0.929
>50, train: 0.934, test: 0.929

A figure is also created that shows line plots of the model accuracy on the train and test sets with different numbers of neighbors.

The plots make the situation clearer. It looks as though the line plot for the training set is dropping to converge with the line for the test set. Indeed, this is exactly what is happening.

Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors

Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors

Now, recall how KNN works.

The “model” is really just the entire training dataset stored in an efficient data structure. Skill for the “model” on the training dataset should be 100 percent and anything less is unforgivable.

In fact, this argument holds for any machine learning algorithm and slices to the core of the confusion around overfitting for beginners.

Separate Overfitting Analysis From Model Selection

Overfitting can be an explanation for poor performance of a predictive model.

Creating learning curve plots that show the learning dynamics of a model on the train and test dataset is a helpful analysis for learning more about a model on a dataset.

But overfitting should not be confused with model selection.

We choose a predictive model or model configuration based on its out-of-sample performance. That is, its performance on new data not seen during training.

The reason we do this is that in predictive modeling, we are primarily interested in a model that makes skillful predictions. We want the model that can make the best possible predictions given the time and computational resources we have available.

This might mean we choose a model that looks like it has overfit the training dataset. In which case, an overfit analysis might be misleading.

It might also mean that the model has poor or terrible performance on the training dataset.

In general, if we cared about model performance on the training dataset in model selection, then we would expect a model to have perfect performance on the training dataset. It’s data we have available; we should not tolerate anything less.

As we saw with the KNN example above, we can achieve perfect performance on the training set by storing the training set directly and returning predictions with one neighbor at the cost of poor performance on any new data.

  • Wouldn’t a model that performs well on both train and test datasets be a better model?

Maybe. But, maybe not.

This argument is based on the idea that a model that performs well on both train and test sets has a better understanding of the underlying problem.

A corollary is that a model that performs well on the test set but poor on the training set is lucky (e.g. a statistical fluke) and a model that performs well on the train set but poor on the test set is overfit.

I believe this is the sticking point for beginners that often ask how to fix overfitting for their scikit-learn machine learning model.

The worry is that a model must perform well on both train and test sets, otherwise, they are in trouble.

This is not the case.

Performance on the training set is not relevant during model selection. You must focus on the out-of-sample performance only when choosing a predictive model.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to identify overfitting for machine learning models in Python.

Specifically, you learned:

  • Overfitting is a possible cause of poor generalization performance of a predictive model.
  • Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
  • Although overfitting is a useful tool for analysis, it must not be confused with model selection.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Identify Overfitting Machine Learning Models in Scikit-Learn appeared first on Machine Learning Mastery.

How to Identify Overfitting Machine Learning Models in Scikit-Learn

$
0
0

Overfitting is a common explanation for the poor performance of a predictive model.

An analysis of learning dynamics can help to identify whether a model has overfit the training dataset and may suggest an alternate configuration to use that could result in better predictive performance.

Performing an analysis of learning dynamics is straightforward for algorithms that learn incrementally, like neural networks, but it is less clear how we might perform the same analysis with other algorithms that do not learn incrementally, such as decision trees, k-nearest neighbors, and other general algorithms in the scikit-learn machine learning library.

In this tutorial, you will discover how to identify overfitting for machine learning models in Python.

After completing this tutorial, you will know:

  • Overfitting is a possible cause of poor generalization performance of a predictive model.
  • Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
  • Although overfitting is a useful tool for analysis, it must not be confused with model selection.

Let’s get started.

Identify Overfitting Machine Learning Models With Scikit-Learn

Identify Overfitting Machine Learning Models With Scikit-Learn
Photo by Bonnie Moreland, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. What Is Overfitting
  2. How to Perform an Overfitting Analysis
  3. Example of Overfitting in Scikit-Learn
  4. Counterexample of Overfitting in Scikit-Learn
  5. Separate Overfitting Analysis From Model Selection

What Is Overfitting

Overfitting refers to an unwanted behavior of a machine learning algorithm used for predictive modeling.

It is the case where model performance on the training dataset is improved at the cost of worse performance on data not seen during training, such as a holdout test dataset or new data.

We can identify if a machine learning model has overfit by first evaluating the model on the training dataset and then evaluating the same model on a holdout test dataset.

If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model may have overfit the training dataset.

We care about overfitting because it is a common cause for “poor generalization” of the model as measured by high “generalization error.” That is error made by the model when making predictions on new data.

This means, if our model has poor performance, maybe it is because it has overfit.

But what does it mean if a model’s performance is “significantly better” on the training set compared to the test set?

For example, it is common and perhaps normal for the model to have better performance on the training set than the test set.

As such, we can perform an analysis of the algorithm on the dataset to better expose the overfitting behavior.

How to Perform an Overfitting Analysis

An overfitting analysis is an approach for exploring how and when a specific model is overfitting on a specific dataset.

It is a tool that can help you learn more about the learning dynamics of a machine learning model.

This might be achieved by reviewing the model behavior during a single run for algorithms like neural networks that are fit on the training dataset incrementally.

A plot of the model performance on the train and test set can be calculated at each point during training and plots can be created. This plot is often called a learning curve plot, showing one curve for model performance on the training set and one curve for the test set for each increment of learning.

If you would like to learn more about learning curves for algorithms that learn incrementally, see the tutorial:

The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall or accuracy continues to rise) and performance on the test or validation set improves to a point and then begins to get worse.

If this pattern is observed, then training should stop at that point where performance gets worse on the test or validation set for algorithms that learn incrementally

This makes sense for algorithms that learn incrementally like neural networks, but what about other algorithms?

  • How do you perform an overfitting analysis for machine learning algorithms in scikit-learn?

One approach for performing an overfitting analysis on algorithms that do not learn incrementally is by varying a key model hyperparameter and evaluating the model performance on the train and test sets for each configuration.

To make this clear, let’s explore a case of analyzing a model for overfitting in the next section.

Example of Overfitting in Scikit-Learn

In this section, we will look at an example of overfitting a machine learning model to a training dataset.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to define a binary (two class) classification prediction problem with 10,000 examples (rows) and 20 input features (columns).

The example below creates the dataset and summarizes the shape of the input and output components.

# synthetic classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and reports the shape, confirming our expectations.

(10000, 20) (10000,)

Next, we need to split the dataset into train and test subsets.

We will use the train_test_split() function and split the data into 70 percent for training a model and 30 percent for evaluating it.

# split a dataset into train and test sets
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset and we can confirm that we have 7,000 examples for training and 3,000 for evaluating a model.

(7000, 20) (3000, 20) (7000,) (3000,)

Next, we can explore a machine learning model overfitting the training dataset.

We will use a decision tree via the DecisionTreeClassifier and test different tree depths with the “max_depth” argument.

Shallow decision trees (e.g. few levels) generally do not overfit but have poor performance (high bias, low variance). Whereas deep trees (e.g. many levels) generally do overfit and have good performance (low bias, high variance). A desirable tree is one that is not so shallow that it has low skill and not so deep that it overfits the training dataset.

We evaluate decision tree depths from 1 to 20.

...
# define the tree depths to evaluate
values = [i for i in range(1, 21)]

We will enumerate each tree depth, fit a tree with a given depth on the training dataset, then evaluate the tree on both the train and test sets.

The expectation is that as the depth of the tree increases, performance on train and test will improve to a point, and as the tree gets too deep, it will begin to overfit the training dataset at the expense of worse performance on the holdout test set.

...
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = DecisionTreeClassifier(max_depth=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

At the end of the run, we will then plot all model accuracy scores on the train and test sets for visual comparison.

...
# plot of train and test scores vs tree depth
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Tying this together, the complete example of exploring different tree depths on the synthetic binary classification dataset is listed below.

# evaluate decision tree performance on train and test sets with different tree depths
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 21)]
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = DecisionTreeClassifier(max_depth=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs tree depth
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Running the example fits and evaluates a decision tree on the train and test sets for each tree depth and reports the accuracy scores.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a trend of increasing accuracy on the training dataset with the tree depth to a point around a depth of 19-20 levels where the tree fits the training dataset perfectly.

We can also see that the accuracy on the test set improves with tree depth until a depth of about eight or nine levels, after which accuracy begins to get worse with each increase in tree depth.

This is exactly what we would expect to see in a pattern of overfitting.

We would choose a tree depth of eight or nine before the model begins to overfit the training dataset.

>1, train: 0.769, test: 0.761
>2, train: 0.808, test: 0.804
>3, train: 0.879, test: 0.878
>4, train: 0.902, test: 0.896
>5, train: 0.915, test: 0.903
>6, train: 0.929, test: 0.918
>7, train: 0.942, test: 0.921
>8, train: 0.951, test: 0.924
>9, train: 0.959, test: 0.926
>10, train: 0.968, test: 0.923
>11, train: 0.977, test: 0.925
>12, train: 0.983, test: 0.925
>13, train: 0.987, test: 0.926
>14, train: 0.992, test: 0.921
>15, train: 0.995, test: 0.920
>16, train: 0.997, test: 0.913
>17, train: 0.999, test: 0.918
>18, train: 0.999, test: 0.918
>19, train: 1.000, test: 0.914
>20, train: 1.000, test: 0.913

A figure is also created that shows line plots of the model accuracy on the train and test sets with different tree depths.

The plot clearly shows that increasing the tree depth in the early stages results in a corresponding improvement in both train and test sets.

This continues until a depth of around 10 levels, after which the model is shown to overfit the training dataset at the cost of worse performance on the holdout dataset.

Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths

Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths

This analysis is interesting. It shows why the model has a worse hold-out test set performance when “max_depth” is set to large values.

But it is not required.

We can just as easily choose a “max_depth” using a grid search without performing an analysis on why some values result in better performance and some result in worse performance.

In fact, in the next section, we will show where this analysis can be misleading.

Counterexample of Overfitting in Scikit-Learn

Sometimes, we may perform an analysis of machine learning model behavior and be deceived by the results.

A good example of this is varying the number of neighbors for the k-nearest neighbors algorithms, which we can implement using the KNeighborsClassifier class and configure via the “n_neighbors” argument.

Let’s forget how KNN works for the moment.

We can perform the same analysis of the KNN algorithm as we did in the previous section for the decision tree and see if our model overfits for different configuration values. In this case, we will vary the number of neighbors from 1 to 50 to get more of the effect.

The complete example is listed below.

# evaluate knn performance on train and test sets with different numbers of neighbors
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 51)]
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = KNeighborsClassifier(n_neighbors=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs number of neighbors
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Running the example fits and evaluates a KNN model on the train and test sets for each number of neighbors and reports the accuracy scores.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Recall, we are looking for a pattern where performance on the test set improves and then starts to get worse, and performance on the training set continues to improve.

We do not see this pattern.

Instead, we see that accuracy on the training dataset starts at perfect accuracy and falls with almost every increase in the number of neighbors.

We also see that performance of the model on the holdout test improves to a value of about five neighbors, holds level and begins a downward trend after that.

>1, train: 1.000, test: 0.919
>2, train: 0.965, test: 0.916
>3, train: 0.962, test: 0.932
>4, train: 0.957, test: 0.932
>5, train: 0.954, test: 0.935
>6, train: 0.953, test: 0.934
>7, train: 0.952, test: 0.932
>8, train: 0.951, test: 0.933
>9, train: 0.949, test: 0.933
>10, train: 0.950, test: 0.935
>11, train: 0.947, test: 0.934
>12, train: 0.947, test: 0.933
>13, train: 0.945, test: 0.932
>14, train: 0.945, test: 0.932
>15, train: 0.944, test: 0.932
>16, train: 0.944, test: 0.934
>17, train: 0.943, test: 0.932
>18, train: 0.943, test: 0.935
>19, train: 0.942, test: 0.933
>20, train: 0.943, test: 0.935
>21, train: 0.942, test: 0.933
>22, train: 0.943, test: 0.933
>23, train: 0.941, test: 0.932
>24, train: 0.942, test: 0.932
>25, train: 0.942, test: 0.931
>26, train: 0.941, test: 0.930
>27, train: 0.941, test: 0.932
>28, train: 0.939, test: 0.932
>29, train: 0.938, test: 0.931
>30, train: 0.938, test: 0.931
>31, train: 0.937, test: 0.931
>32, train: 0.938, test: 0.931
>33, train: 0.937, test: 0.930
>34, train: 0.938, test: 0.931
>35, train: 0.937, test: 0.930
>36, train: 0.937, test: 0.928
>37, train: 0.936, test: 0.930
>38, train: 0.937, test: 0.930
>39, train: 0.935, test: 0.929
>40, train: 0.936, test: 0.929
>41, train: 0.936, test: 0.928
>42, train: 0.936, test: 0.929
>43, train: 0.936, test: 0.930
>44, train: 0.935, test: 0.929
>45, train: 0.935, test: 0.929
>46, train: 0.934, test: 0.929
>47, train: 0.935, test: 0.929
>48, train: 0.934, test: 0.929
>49, train: 0.934, test: 0.929
>50, train: 0.934, test: 0.929

A figure is also created that shows line plots of the model accuracy on the train and test sets with different numbers of neighbors.

The plots make the situation clearer. It looks as though the line plot for the training set is dropping to converge with the line for the test set. Indeed, this is exactly what is happening.

Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors

Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors

Now, recall how KNN works.

The “model” is really just the entire training dataset stored in an efficient data structure. Skill for the “model” on the training dataset should be 100 percent and anything less is unforgivable.

In fact, this argument holds for any machine learning algorithm and slices to the core of the confusion around overfitting for beginners.

Separate Overfitting Analysis From Model Selection

Overfitting can be an explanation for poor performance of a predictive model.

Creating learning curve plots that show the learning dynamics of a model on the train and test dataset is a helpful analysis for learning more about a model on a dataset.

But overfitting should not be confused with model selection.

We choose a predictive model or model configuration based on its out-of-sample performance. That is, its performance on new data not seen during training.

The reason we do this is that in predictive modeling, we are primarily interested in a model that makes skillful predictions. We want the model that can make the best possible predictions given the time and computational resources we have available.

This might mean we choose a model that looks like it has overfit the training dataset. In which case, an overfit analysis might be misleading.

It might also mean that the model has poor or terrible performance on the training dataset.

In general, if we cared about model performance on the training dataset in model selection, then we would expect a model to have perfect performance on the training dataset. It’s data we have available; we should not tolerate anything less.

As we saw with the KNN example above, we can achieve perfect performance on the training set by storing the training set directly and returning predictions with one neighbor at the cost of poor performance on any new data.

  • Wouldn’t a model that performs well on both train and test datasets be a better model?

Maybe. But, maybe not.

This argument is based on the idea that a model that performs well on both train and test sets has a better understanding of the underlying problem.

A corollary is that a model that performs well on the test set but poor on the training set is lucky (e.g. a statistical fluke) and a model that performs well on the train set but poor on the test set is overfit.

I believe this is the sticking point for beginners that often ask how to fix overfitting for their scikit-learn machine learning model.

The worry is that a model must perform well on both train and test sets, otherwise, they are in trouble.

This is not the case.

Performance on the training set is not relevant during model selection. You must focus on the out-of-sample performance only when choosing a predictive model.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to identify overfitting for machine learning models in Python.

Specifically, you learned:

  • Overfitting is a possible cause of poor generalization performance of a predictive model.
  • Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
  • Although overfitting is a useful tool for analysis, it must not be confused with model selection.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Identify Overfitting Machine Learning Models in Scikit-Learn appeared first on Machine Learning Mastery.

A Gentle Introduction to PyCaret for Machine Learning

$
0
0

PyCaret is a Python open source machine learning library designed to make performing standard tasks in a machine learning project easy.

It is a Python version of the Caret machine learning package in R, popular because it allows models to be evaluated, compared, and tuned on a given dataset with just a few lines of code.

The PyCaret library provides these features, allowing the machine learning practitioner in Python to spot check a suite of standard machine learning algorithms on a classification or regression dataset with a single function call.

In this tutorial, you will discover the PyCaret Python open source library for machine learning.

After completing this tutorial, you will know:

  • PyCaret is a Python version of the popular and widely used caret machine learning package in R.
  • How to use PyCaret to easily evaluate and compare standard machine learning models on a dataset.
  • How to use PyCaret to easily tune the hyperparameters of a well-performing machine learning model.

Let’s get started.

A Gentle Introduction to PyCaret for Machine Learning

A Gentle Introduction to PyCaret for Machine Learning
Photo by Thomas, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. What Is PyCaret?
  2. Sonar Dataset
  3. Comparing Machine Learning Models
  4. Tuning Machine Learning Models

What Is PyCaret?

PyCaret is an open source Python machine learning library inspired by the caret R package.

The goal of the caret package is to automate the major steps for evaluating and comparing machine learning algorithms for classification and regression. The main benefit of the library is that a lot can be achieved with very few lines of code and little manual configuration. The PyCaret library brings these capabilities to Python.

PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding.

PyCaret Homepage

The PyCaret library automates many steps of a machine learning project, such as:

  • Defining the data transforms to perform (setup())
  • Evaluating and comparing standard models (compare_models())
  • Tuning model hyperparameters (tune_model())

As well as many more features not limited to creating ensembles, saving models, and deploying models.

The PyCaret library has a wealth of documentation for using the API; you can get started here:

We will not explore all of the features of the library in this tutorial; instead, we will focus on simple machine learning model comparison and hyperparameter tuning.

You can install PyCaret using your Python package manager, such as pip. For example:

pip install pycaret

Once installed, we can confirm that the library is available in your development environment and is working correctly by printing the installed version.

# check pycaret version
import pycaret
print('PyCaret: %s' % pycaret.__version__)

Running the example will load the PyCaret library and print the installed version number.

Your version number should be the same or higher.

PyCaret: 2.0.0

If you need help installing PyCaret for your system, you can see the installation instructions here:

Now that we are familiar with what PyCaret is, let’s explore how we might use it on a machine learning project.

Sonar Dataset

We will use the Sonar standard binary classification dataset. You can learn more about it here:

We can download the dataset directly from the URL and load it as a Pandas DataFrame.

...
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
# load the dataset
df = read_csv(url, header=None)
# summarize the shape of the dataset
print(df.shape)

The PyCaret seems to require that a dataset has column names, and our dataset does not have column names, so we can set the column number as the column name directly.

...
# set column names as the column number
n_cols = df.shape[1]
df.columns = [str(i) for i in range(n_cols)]

Finally, we can summarize the first few rows of data.

...
# summarize the first few rows of data
print(df.head())

Tying this together, the complete example of loading and summarizing the Sonar dataset is listed below.

# load the sonar dataset
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
# load the dataset
df = read_csv(url, header=None)
# summarize the shape of the dataset
print(df.shape)
# set column names as the column number
n_cols = df.shape[1]
df.columns = [str(i) for i in range(n_cols)]
# summarize the first few rows of data
print(df.head())

Running the example first loads the dataset and reports the shape, showing it has 208 rows and 61 columns.

The first five rows are then printed showing that the input variables are all numeric and the target variable is column “60” and has string labels.

(208, 61)
0 1 2 3 4 ... 56 57 58 59 60
0 0.0200 0.0371 0.0428 0.0207 0.0954 ... 0.0180 0.0084 0.0090 0.0032 R
1 0.0453 0.0523 0.0843 0.0689 0.1183 ... 0.0140 0.0049 0.0052 0.0044 R
2 0.0262 0.0582 0.1099 0.1083 0.0974 ... 0.0316 0.0164 0.0095 0.0078 R
3 0.0100 0.0171 0.0623 0.0205 0.0205 ... 0.0050 0.0044 0.0040 0.0117 R
4 0.0762 0.0666 0.0481 0.0394 0.0590 ... 0.0072 0.0048 0.0107 0.0094 R

Next, we can use PyCaret to evaluate and compare a suite of standard machine learning algorithms to quickly discover what works well on this dataset.

PyCaret for Comparing Machine Learning Models

In this section, we will evaluate and compare the performance of standard machine learning models on the Sonar classification dataset.

First, we must set the dataset with the PyCaret library via the setup() function. This requires that we provide the Pandas DataFrame and specify the name of the column that contains the target variable.

The setup() function also allows you to configure simple data preparation, such as scaling, power transforms, missing data handling, and PCA transforms.

We will specify the data, target variable, and turn off HTML output, verbose output, and requests for user feedback.

...
# setup the dataset
grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)

Next, we can compare standard machine learning models by calling the compare_models() function.

By default, it will evaluate models using 10-fold cross-validation, sort results by classification accuracy, and return the single best model.

These are good defaults, and we don’t need to change a thing.

...
# evaluate models and compare models
best = compare_models()

Call the compare_models() function will also report a table of results summarizing all of the models that were evaluated and their performance.

Finally, we can report the best-performing model and its configuration.

Tying this together, the complete example of evaluating a suite of standard models on the Sonar classification dataset is listed below.

# compare machine learning algorithms on the sonar classification dataset
from pandas import read_csv
from pycaret.classification import setup
from pycaret.classification import compare_models
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
# load the dataset
df = read_csv(url, header=None)
# set column names as the column number
n_cols = df.shape[1]
df.columns = [str(i) for i in range(n_cols)]
# setup the dataset
grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)
# evaluate models and compare models
best = compare_models()
# report the best model
print(best)

Running the example will load the dataset, configure the PyCaret library, evaluate a suite of standard models, and report the best model found for the dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the “Extra Trees Classifier” has the best accuracy on the dataset with a score of about 86.95 percent.

We can then see the configuration of the model that was used, which looks like it used default hyperparameter values.

Model  Accuracy     AUC  Recall   Prec.      F1  \
0            Extra Trees Classifier    0.8695  0.9497  0.8571  0.8778  0.8631
1               CatBoost Classifier    0.8695  0.9548  0.8143  0.9177  0.8508
2   Light Gradient Boosting Machine    0.8219  0.9096  0.8000  0.8327  0.8012
3      Gradient Boosting Classifier    0.8010  0.8801  0.7690  0.8110  0.7805
4              Ada Boost Classifier    0.8000  0.8474  0.7952  0.8071  0.7890
5            K Neighbors Classifier    0.7995  0.8613  0.7405  0.8276  0.7773
6         Extreme Gradient Boosting    0.7995  0.8934  0.7833  0.8095  0.7802
7          Random Forest Classifier    0.7662  0.8778  0.6976  0.8024  0.7345
8          Decision Tree Classifier    0.7533  0.7524  0.7119  0.7655  0.7213
9                  Ridge Classifier    0.7448  0.0000  0.6952  0.7574  0.7135
10                      Naive Bayes    0.7214  0.8159  0.8286  0.6700  0.7308
11              SVM - Linear Kernel    0.7181  0.0000  0.6286  0.7146  0.6309
12              Logistic Regression    0.7100  0.8104  0.6357  0.7263  0.6634
13     Linear Discriminant Analysis    0.6924  0.7510  0.6667  0.6762  0.6628
14  Quadratic Discriminant Analysis    0.5800  0.6308  0.1095  0.5000  0.1750

     Kappa     MCC  TT (Sec)
0   0.7383  0.7446    0.1415
1   0.7368  0.7552    1.9930
2   0.6410  0.6581    0.0134
3   0.5989  0.6090    0.1413
4   0.5979  0.6123    0.0726
5   0.5957  0.6038    0.0019
6   0.5970  0.6132    0.0287
7   0.5277  0.5438    0.1107
8   0.5028  0.5192    0.0035
9   0.4870  0.5003    0.0030
10  0.4488  0.4752    0.0019
11  0.4235  0.4609    0.0024
12  0.4143  0.4285    0.0059
13  0.3825  0.3927    0.0034
14  0.1172  0.1792    0.0033
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=2728, verbose=0,
                     warm_start=False)

We could use this configuration directly and fit a model on the entire dataset and use it to make predictions on new data.

We can also use the table of results to get an idea of the types of models that perform well on the dataset, in this case, ensembles of decision trees.

Now that we are familiar with how to compare machine learning models using PyCaret, let’s look at how we might use the library to tune model hyperparameters.

Tuning Machine Learning Models

In this section, we will tune the hyperparameters of a machine learning model on the Sonar classification dataset.

We must load and set up the dataset as we did before when comparing models.

...
# setup the dataset
grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)

We can tune model hyperparameters using the tune_model() function in the PyCaret library.

The function takes an instance of the model to tune as input and knows what hyperparameters to tune automatically. A random search of model hyperparameters is performed and the total number of evaluations can be controlled via the “n_iter” argument.

By default, the function will optimize the ‘Accuracy‘ and will evaluate the performance of each configuration using 10-fold cross-validation, although this sensible default configuration can be changed.

We can perform a random search of the extra trees classifier as follows:

...
# tune model hyperparameters
best = tune_model(ExtraTreesClassifier(), n_iter=200)

The function will return the best-performing model, which can be used directly or printed to determine the hyperparameters that were selected.

It will also print a table of the results for the best configuration across the number of folds in the k-fold cross-validation (e.g. 10 folds).

Tying this together, the complete example of tuning the hyperparameters of the extra trees classifier on the Sonar dataset is listed below.

# tune model hyperparameters on the sonar classification dataset
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
from pycaret.classification import setup
from pycaret.classification import tune_model
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
# load the dataset
df = read_csv(url, header=None)
# set column names as the column number
n_cols = df.shape[1]
df.columns = [str(i) for i in range(n_cols)]
# setup the dataset
grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)
# tune model hyperparameters
best = tune_model(ExtraTreesClassifier(), n_iter=200, choose_better=True)
# report the best model
print(best)

Running the example first loads the dataset and configures the PyCaret library.

A grid search is then performed reporting the performance of the best-performing configuration across the 10 folds of cross-validation and the mean accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the random search found a configuration with an accuracy of about 75.29 percent, which is not better than the default configuration from the previous section that achieved a score of about 86.95 percent.

Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
0       0.8667  1.0000  1.0000  0.7778  0.8750  0.7368  0.7638
1       0.6667  0.8393  0.4286  0.7500  0.5455  0.3119  0.3425
2       0.6667  0.8036  0.2857  1.0000  0.4444  0.2991  0.4193
3       0.7333  0.7321  0.4286  1.0000  0.6000  0.4444  0.5345
4       0.6667  0.5714  0.2857  1.0000  0.4444  0.2991  0.4193
5       0.8571  0.8750  0.6667  1.0000  0.8000  0.6957  0.7303
6       0.8571  0.9583  0.6667  1.0000  0.8000  0.6957  0.7303
7       0.7857  0.8776  0.5714  1.0000  0.7273  0.5714  0.6325
8       0.6429  0.7959  0.2857  1.0000  0.4444  0.2857  0.4082
9       0.7857  0.8163  0.5714  1.0000  0.7273  0.5714  0.6325
Mean    0.7529  0.8270  0.5190  0.9528  0.6408  0.4911  0.5613
SD      0.0846  0.1132  0.2145  0.0946  0.1571  0.1753  0.1485
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=1, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=4, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=120,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

We might be able to improve upon the grid search by specifying to the tune_model() function what hyperparameters to search and what ranges to search.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered the PyCaret Python open source library for machine learning.

Specifically, you learned:

  • PyCaret is a Python version of the popular and widely used caret machine learning package in R.
  • How to use PyCaret to easily evaluate and compare standard machine learning models on a dataset.
  • How to use PyCaret to easily tune the hyperparameters of a well-performing machine learning model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to PyCaret for Machine Learning appeared first on Machine Learning Mastery.

Perceptron Algorithm for Classification in Python

$
0
0

The Perceptron is a linear machine learning algorithm for binary classification tasks.

It may be considered one of the first and one of the simplest types of artificial neural networks. It is definitely not “deep” learning but is an important building block.

Like logistic regression, it can quickly learn a linear separation in feature space for two-class classification tasks, although unlike logistic regression, it learns using the stochastic gradient descent optimization algorithm and does not predict calibrated probabilities.

In this tutorial, you will discover the Perceptron classification machine learning algorithm.

After completing this tutorial, you will know:

  • The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
  • How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
  • How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

Let’s get started.

Perceptron Algorithm for Classification in Python

Perceptron Algorithm for Classification in Python
Photo by Belinda Novika, some rights reserved.

Tutorial Overview

This tutorial is divided into 3=three parts; they are:

  1. Perceptron Algorithm
  2. Perceptron With Scikit-Learn
  3. Tune Perceptron Hyperparameters

Perceptron Algorithm

The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.

It is a type of neural network model, perhaps the simplest type of neural network model.

It consists of a single node or neuron that takes a row of data as input and predicts a class label. This is achieved by calculating the weighted sum of the inputs and a bias (set to 1). The weighted sum of the input of the model is called the activation.

  • Activation = Weights * Inputs + Bias

If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.

  • Predict 1: If Activation > 0.0
  • Predict 0: If Activation <= 0.0

Given that the inputs are multiplied by model coefficients, like linear regression and logistic regression, it is good practice to normalize or standardize data prior to using the model.

The Perceptron is a linear classification algorithm. This means that it learns a decision boundary that separates two classes using a line (called a hyperplane) in the feature space. As such, it is appropriate for those problems where the classes can be separated well by a line or linear model, referred to as linearly separable.

The coefficients of the model are referred to as input weights and are trained using the stochastic gradient descent optimization algorithm.

Examples from the training dataset are shown to the model one at a time, the model makes a prediction, and error is calculated. The weights of the model are then updated to reduce the errors for the example. This is called the Perceptron update rule. This process is repeated for all examples in the training dataset, called an epoch. This process of updating the model using examples is then repeated for many epochs.

Model weights are updated with a small proportion of the error each batch, and the proportion is controlled by a hyperparameter called the learning rate, typically set to a small value. This is to ensure learning does not occur too quickly, resulting in a possibly lower skill model, referred to as premature convergence of the optimization (search) procedure for the model weights.

  • weights(t + 1) = weights(t) + learning_rate * (expected_i – predicted_) * input_i

Training is stopped when the error made by the model falls to a low level or no longer improves, or a maximum number of epochs is performed.

The initial values for the model weights are set to small random values. Additionally, the training dataset is shuffled prior to each training epoch. This is by design to accelerate and improve the model training process. Because of this, the learning algorithm is stochastic and may achieve different results each time it is run. As such, it is good practice to summarize the performance of the algorithm on a dataset using repeated evaluation and reporting the mean classification accuracy.

The learning rate and number of training epochs are hyperparameters of the algorithm that can be set using heuristics or hyperparameter tuning.

For more about the Perceptron algorithm, see the tutorial:

Now that we are familiar with the Perceptron algorithm, let’s explore how we can use the algorithm in Python.

Perceptron With Scikit-Learn

The Perceptron algorithm is available in the scikit-learn Python machine learning library via the Perceptron class.

The class allows you to configure the learning rate (eta0), which defaults to 1.0.

...
# define model
model = Perceptron(eta0=1.0)

The implementation also allows you to configure the total number of training epochs (max_iter), which defaults to 1,000.

...
# define model
model = Perceptron(max_iter=1000)

The scikit-learn implementation of the Perceptron algorithm also provides other configuration options that you may want to explore, such as early stopping and the use of a penalty loss.

We can demonstrate the Perceptron classifier with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 10) (1000,)

We can fit and evaluate a Perceptron model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration.

...
# create the model
model = Perceptron()

The complete example of evaluating the Perceptron model for the synthetic binary classification task is listed below.

# evaluate a perceptron model on the dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Perceptron algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 84.7 percent.

Mean Accuracy: 0.847 (0.052)

We may decide to use the Perceptron classifier as our final model and make predictions on new data.

This can be achieved by fitting the model pipeline on all available data and calling the predict() function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a perceptron model on the dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# fit model
model.fit(X, y)
# define new data
row = [0.12777556,-3.64400522,-2.23268854,-1.82114386,1.75466361,0.1243966,1.03397657,2.35822076,1.01001752,0.56768485]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 1

Next, we can look at configuring the model hyperparameters.

Tune Perceptron Hyperparameters

The hyperparameters for the Perceptron algorithm must be configured for your specific dataset.

Perhaps the most important hyperparameter is the learning rate.

A large learning rate can cause the model to learn fast, but perhaps at the cost of lower skill. A smaller learning rate can result in a better-performing model but may take a long time to train the model.

You can learn more about exploring learning rates in the tutorial:

It is common to test learning rates on a log scale between a small value such as 1e-4 (or smaller) and 1.0. We will test the following values in this case:

...
# define grid
grid = dict()
grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0]

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search learning rate for the perceptron
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0]
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that a smaller learning rate than the default results in better performance with learning rate 0.0001 and 0.001 both achieving a classification accuracy of about 85.7 percent as compared to the default of 1.0 that achieved an accuracy of about 84.7 percent.

Mean Accuracy: 0.857
Config: {'eta0': 0.0001}
>0.857 with: {'eta0': 0.0001}
>0.857 with: {'eta0': 0.001}
>0.853 with: {'eta0': 0.01}
>0.847 with: {'eta0': 0.1}
>0.847 with: {'eta0': 1.0}

Another important hyperparameter is how many epochs are used to train the model.

This may depend on the training dataset and could vary greatly. Again, we will explore configuration values on a log scale between 1 and 1e+4.

...
# define grid
grid = dict()
grid['max_iter'] = [1, 10, 100, 1000, 10000]

We will use our well-performing learning rate of 0.0001 found in the previous search.

...
# define model
model = Perceptron(eta0=0.0001)

The complete example of grid searching the number of training epochs is listed below.

# grid search total epochs for the perceptron
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron(eta0=0.0001)
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['max_iter'] = [1, 10, 100, 1000, 10000]
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that epochs 10 to 10,000 result in about the same classification accuracy. An interesting exception would be to explore configuring learning rate and number of training epochs at the same time to see if better results can be achieved.

Mean Accuracy: 0.857
Config: {'max_iter': 10}
>0.850 with: {'max_iter': 1}
>0.857 with: {'max_iter': 10}
>0.857 with: {'max_iter': 100}
>0.857 with: {'max_iter': 1000}
>0.857 with: {'max_iter': 10000}

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

APIs

Articles

Summary

In this tutorial, you discovered the Perceptron classification machine learning algorithm.

Specifically, you learned:

  • The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
  • How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
  • How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Perceptron Algorithm for Classification in Python appeared first on Machine Learning Mastery.

Semi-Supervised Learning With Label Propagation

$
0
0

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label propagation algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

  • An intuition for how the label propagation semi-supervised learning algorithm works.
  • How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
  • How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Propagation

Semi-Supervised Learning With Label Propagation
Photo by TheBluesDude, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Label Propagation Algorithm
  2. Semi-Supervised Classification Dataset
  3. Label Propagation for Semi-Supervised Learning

Label Propagation Algorithm

Label Propagation is a semi-supervised learning algorithm.

The algorithm was proposed in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani titled “Learning From Labeled And Unlabeled Data With Label Propagation.”

The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance. Nodes in the graph then have label soft labels or label distribution based on the labels or label distributions of examples connected nearby in the graph.

Many semi-supervised learning algorithms rely on the geometry of the data induced by both labeled and unlabeled examples to improve on supervised methods that use only the labeled data. This geometry can be naturally represented by an empirical graph g = (V,E) where nodes V = {1,…,n} represent the training data and edges E represent similarities between them

— Page 193, Semi-Supervised Learning, 2006.

Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes.

This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices.

— Page 48, Introduction to Semi-Supervised Learning, 2009.

The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples.

Starting with nodes 1, 2,…,l labeled with their known label (1 or −1) and nodes l + 1,…,n labeled with 0, each node starts to propagate its label to its neighbors, and the process is repeated until convergence.

— Page 194, Semi-Supervised Learning, 2006.

Now that we are familiar with the Label Propagation algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire hold out test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label propagation algorithm to the dataset.

Label Propagation for Semi-Supervised Learning

The Label Propagation algorithm is available in the scikit-learn Python machine learning library via the LabelPropagation class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are integer encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelPropagation class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

Now that we are familiar with how to use the Label Propagation algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelPropagation model on the entire training dataset.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label propagation on the semi-supervised learning dataset is listed below.

# evaluate label propagation on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label propagation model achieves a classification accuracy of about 85.6 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.600

So far, so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label propagation model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

We can then use these labels along with all of the input data to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label propagation for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of the semi-supervised model followed by supervised model achieves a classification accuracy of about 86.2 percent on the holdout dataset, even better than the semi-supervised learning used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 86.200

Can you achieve better results by tuning the hyperparameters of the LabelPropagation model?
Let me know what you discover in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

APIs

Articles

Summary

In this tutorial, you discovered how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

  • An intuition for how the label propagation semi-supervised learning algorithm works.
  • How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
  • How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

Multinomial Logistic Regression With Python

$
0
0

Multinomial logistic regression is an extension of logistic regression that adds native support for multi-class classification problems.

Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.

Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.

In this tutorial, you will discover how to develop multinomial logistic regression models in Python.

After completing this tutorial, you will know:

  • Multinomial logistic regression is an extension of logistic regression for multi-class classification.
  • How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
  • How to tune the penalty hyperparameter for the multinomial logistic regression model.

Let’s get started.

Multinomial Logistic Regression With Python

Multinomial Logistic Regression With Python
Photo by Nicolas Rénac, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Multinomial Logistic Regression
  2. Evaluate Multinomial Logistic Regression Model
  3. Tune Penalty for Multinomial Logistic Regression

Multinomial Logistic Regression

Logistic regression is a classification algorithm.

It is intended for datasets that have numerical input variables and a categorical target variable that has two values or classes. Problems of this type are referred to as binary classification problems.

Logistic regression is designed for two-class problems, modeling the target using a binomial probability distribution function. The class labels are mapped to 1 for the positive class or outcome and 0 for the negative class or outcome. The fit model predicts the probability that an example belongs to class 1.

By default, logistic regression cannot be used for classification tasks that have more than two class labels, so-called multi-class classification.

Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem. Techniques of this type include one-vs-rest and one-vs-one wrapper models.

An alternate approach involves changing the logistic regression model to support the prediction of multiple class labels directly. Specifically, to predict the probability that an input example belongs to each known class label.

The probability distribution that defines multi-class probabilities is called a multinomial probability distribution. A logistic regression model that is adapted to learn and predict a multinomial probability distribution is referred to as Multinomial Logistic Regression. Similarly, we might refer to default or standard logistic regression as Binomial Logistic Regression.

  • Binomial Logistic Regression: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example.
  • Multinomial Logistic Regression: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

If you are new to binomial and multinomial probability distributions, you may want to read the tutorial:

Changing logistic regression from binomial to multinomial probability requires a change to the loss function used to train the model (e.g. log loss to cross-entropy loss), and a change to the output from a single probability value to one probability for each class label.

Now that we are familiar with multinomial logistic regression, let’s look at how we might develop and evaluate multinomial logistic regression models in Python.

Evaluate Multinomial Logistic Regression Model

In this section, we will develop and evaluate a multinomial logistic regression model using the scikit-learn Python machine learning library.

First, we will define a synthetic multi-class classification dataset to use as the basis of the investigation. This is a generic dataset that you can easily replace with your own loaded dataset later.

The make_classification() function can be used to generate a dataset with a given number of rows, columns, and classes. In this case, we will generate a dataset with 1,000 rows, 10 input variables or columns, and 3 classes.

The example below generates the dataset and summarizes the shape of the arrays and the distribution of examples across the three classes.

# test classification dataset
from collections import Counter
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
print(Counter(y))

Running the example confirms that the dataset has 1,000 rows and 10 columns, as we expected, and that the rows are distributed approximately evenly across the three classes, with about 334 examples in each class.

(1000, 10) (1000,)
Counter({1: 334, 2: 334, 0: 332})

Logistic regression is supported in the scikit-learn library via the LogisticRegression class.

The LogisticRegression class can be configured for multinomial logistic regression by setting the “multi_class” argument to “multinomial” and the “solver” argument to a solver that supports multinomial logistic regression, such as “lbfgs“.

...
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

The multinomial logistic regression model will be fit using cross-entropy loss and will predict the integer value for each integer encoded class label.

Now that we are familiar with the multinomial logistic regression API, we can look at how we might evaluate a multinomial logistic regression model on our synthetic multi-class classification dataset.

It is a good practice to evaluate classification models using repeated stratified k-fold cross-validation. The stratification ensures that each cross-validation fold has approximately the same distribution of examples in each class as the whole training dataset.

We will use three repeats with 10 folds, which is a good default, and evaluate model performance using classification accuracy given that the classes are balanced.

The complete example of evaluating multinomial logistic regression for multi-class classification is listed below.

# evaluate multinomial logistic regression model
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report the model performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy across all folds and repeats of the evaluation procedure.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the multinomial logistic regression model with default penalty achieved a mean classification accuracy of about 68.1 percent on our synthetic classification dataset.

Mean Accuracy: 0.681 (0.042)

We may decide to use the multinomial logistic regression model as our final model and make predictions on new data.

This can be achieved by first fitting the model on all available data, then calling the predict() function to make a prediction for new data.

The example below demonstrates how to make a prediction for new data using the multinomial logistic regression model.

# make a prediction with a multinomial logistic regression model
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# fit the model on the whole dataset
model.fit(X, y)
# define a single row of input data
row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426]
# predict the class label
yhat = model.predict([row])
# summarize the predicted class
print('Predicted Class: %d' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to make a prediction.

In this case, we can see that the model predicted the class “1” for the single row of data.

Predicted Class: 1

A benefit of multinomial logistic regression is that it can predict calibrated probabilities across all known class labels in the dataset.

This can be achieved by calling the predict_proba() function on the model.

The example below demonstrates how to predict a multinomial probability distribution for a new example using the multinomial logistic regression model.

# predict probabilities with a multinomial logistic regression model
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# fit the model on the whole dataset
model.fit(X, y)
# define a single row of input data
row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426]
# predict a multinomial probability distribution
yhat = model.predict_proba([row])
# summarize the predicted probabilities
print('Predicted Probabilities: %s' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to predict class probabilities.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that class 1 (e.g. the array index is mapped to the class integer value) has the largest predicted probability with about 0.50.

Predicted Probabilities: [0.16470456 0.50297138 0.33232406]

Now that we are familiar with evaluating and using multinomial logistic regression models, let’s explore how we might tune the model hyperparameters.

Tune Penalty for Multinomial Logistic Regression

An important hyperparameter to tune for multinomial logistic regression is the penalty term.

This term imposes pressure on the model to seek smaller model weights. This is achieved by adding a weighted sum of the model coefficients to the loss function, encouraging the model to reduce the size of the weights along with the error while fitting the model.

A popular type of penalty is the L2 penalty that adds the (weighted) sum of the squared coefficients to the loss function. A weighting of the coefficients can be used that reduces the strength of the penalty from full penalty to a very slight penalty.

By default, the LogisticRegression class uses the L2 penalty with a weighting of coefficients set to 1.0. The type of penalty can be set via the “penalty” argument with values of “l1“, “l2“, “elasticnet” (e.g. both), although not all solvers support all penalty types. The weighting of the coefficients in the penalty can be set via the “C” argument.

...
# define the multinomial logistic regression model with a default penalty
LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=1.0)

The weighting for the penalty is actually the inverse weighting, perhaps penalty = 1 – C.

From the documentation:

C : float, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

This means that values close to 1.0 indicate very little penalty and values close to zero indicate a strong penalty. A C value of 1.0 may indicate no penalty at all.

  • C close to 1.0: Light penalty.
  • C close to 0.0: Strong penalty.

The penalty can be disabled by setting the “penalty” argument to the string “none“.

...
# define the multinomial logistic regression model without a penalty
LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')

Now that we are familiar with the penalty, let’s look at how we might explore the effect of different penalty values on the performance of the multinomial logistic regression model.

It is common to test penalty values on a log scale in order to quickly discover the scale of penalty that works well for a model. Once found, further tuning at that scale may be beneficial.

We will explore the L2 penalty with weighting values in the range from 0.0001 to 1.0 on a log scale, in addition to no penalty or 0.0.

The complete example of evaluating L2 penalty values for multinomial logistic regression is listed below.

# tune regularization for multinomial logistic regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	for p in [0.0, 0.0001, 0.001, 0.01, 0.1, 1.0]:
		# create name for model
		key = '%.4f' % p
		# turn off penalty in some cases
		if p == 0.0:
			# no penalty in this case
			models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')
		else:
			models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=p)
	return models

# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model and collect the scores
	scores = evaluate_model(model, X, y)
	# store the results
	results.append(scores)
	names.append(name)
	# summarize progress along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example reports the mean classification accuracy for each configuration along the way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a C value of 1.0 has the best score of about 77.7 percent, which is the same as using no penalty that achieves the same score.

>0.0000 0.777 (0.037)
>0.0001 0.683 (0.049)
>0.0010 0.762 (0.044)
>0.0100 0.775 (0.040)
>0.1000 0.774 (0.038)
>1.0000 0.777 (0.037)

A box and whisker plot is created for the accuracy scores for each configuration and all plots are shown side by side on a figure on the same scale for direct comparison.

In this case, we can see that the larger penalty we use on this dataset (i.e. the smaller the C value), the worse the performance of the model.

Box and Whisker Plots of L2 Penalty Configuration vs. Accuracy for Multinomial Logistic Regression

Box and Whisker Plots of L2 Penalty Configuration vs. Accuracy for Multinomial Logistic Regression

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to develop multinomial logistic regression models in Python.

Specifically, you learned:

  • Multinomial logistic regression is an extension of logistic regression for multi-class classification.
  • How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
  • How to tune the penalty hyperparameter for the multinomial logistic regression model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

Semi-Supervised Learning With Label Spreading

$
0
0

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label spreading algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

  • An intuition for how the label spreading semi-supervised learning algorithm works.
  • How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
  • How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Spreading

Semi-Supervised Learning With Label Spreading
Photo by Jernej Furman, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Label Spreading Algorithm
  2. Semi-Supervised Classification Dataset
  3. Label Spreading for Semi-Supervised Learning

Label Spreading Algorithm

Label Spreading is a semi-supervised learning algorithm.

The algorithm was introduced by Dengyong Zhou, et al. in their 2003 paper titled “Learning With Local And Global Consistency.”

The intuition for the broader approach of semi-supervised learning is that nearby points in the input space should have the same label, and points in the same structure or manifold in the input space should have the same label.

The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.

Learning With Local And Global Consistency, 2003.

The label spreading is inspired by a technique from experimental psychology called spreading activation networks.

This algorithm can be understood intuitively in terms of spreading activation networks from experimental psychology.

Learning With Local And Global Consistency, 2003.

Points in the dataset are connected in a graph based on their relative distances in the input space. The weight matrix of the graph is normalized symmetrically, much like spectral clustering. Information is passed through the graph, which is adapted to capture the structure in the input space.

The approach is very similar to the label propagation algorithm for semi-supervised learning.

Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

— Page 196, Semi-Supervised Learning, 2006.

After convergence, labels are applied based on nodes that passed on the most information.

Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

Learning With Local And Global Consistency, 2003.

Now that we are familiar with the label spreading algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire holdout test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve a better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label spreading algorithm to the dataset.

Label Spreading for Semi-Supervised Learning

The label spreading algorithm is available in the scikit-learn Python machine learning library via the LabelSpreading class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are ordinal encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelSpreading class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

Now that we are familiar with how to use the label spreading algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelSpreading model on the entire training dataset.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label spreading on the semi-supervised learning dataset is listed below.

# evaluate label spreading on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label spreading model achieves a classification accuracy of about 85.4 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.400

So far so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label spreading model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

We can then use these labels, along with all of the input data, to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label spreading for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 85.800

Can you achieve better results by tuning the hyperparameters of the LabelSpreading model?
Let me know what you discover in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

APIs

Articles

Summary

In this tutorial, you discovered how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

  • An intuition for how the label spreading semi-supervised learning algorithm works.
  • How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
  • How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.


A Gentle Introduction to Machine Learning Modeling Pipelines

$
0
0

Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.

Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).

Nevertheless, working with modeling pipelines can be confusing to beginners as it requires a shift in perspective of the applied machine learning process.

In this tutorial, you will discover modeling pipelines for applied machine learning.

After completing this tutorial, you will know:

  • Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
  • Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
  • Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Let’s get started.

A Gentle Introduction to Machine Learning Modeling Pipelines

A Gentle Introduction to Machine Learning Modeling Pipelines
Photo by Jay Huang, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Finding a Skillful Model Is Not Enough
  2. What Is a Modeling Pipeline?
  3. Implications of a Modeling Pipeline

Finding a Skillful Model Is Not Enough

Applied machine learning is the process of discovering the model that performs best for a given predictive modeling dataset.

In fact, it’s more than this.

In addition to discovering which model performs the best on your dataset, you must also discover:

  • Data transforms that best expose the unknown underlying structure of the problem to the learning algorithms.
  • Model hyperparameters that result in a good or best configuration of a chosen model.

There may also be additional considerations such as techniques that transform the predictions made by the model, like threshold moving or model calibration for predicted probabilities.

As such, it is common to think of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.

This can be quite challenging in practice as it requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.

Although tricky, it may be manageable with a simple train-test split but becomes quite unmanageable when using k-fold cross-validation or even repeated k-fold cross-validation.

The solution is to use a modeling pipeline to keep everything straight.

What Is a Modeling Pipeline?

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

It allows the sequence of steps to be specified, evaluated, and used as an atomic unit.

  • Pipeline: A linear sequence of data preparation and modeling steps that can be treated as an atomic unit.

To make the idea clear, let’s look at two simple examples:

The first example uses data normalization for the input variables and fits a logistic regression model:

  • [Input], [Normalization], [Logistic Regression], [Predictions]

The second example standardizes the input variables, applies RFE feature selection, and fits a support vector machine.

  • [Input], [Standardization], [RFE], [SVM], [Predictions]

You can imagine other examples of modeling pipelines.

As an atomic unit, the pipeline can be evaluated using a preferred resampling scheme such as a train-test split or k-fold cross-validation.

This is important for two main reasons:

  • Avoid data leakage.
  • Consistency and reproducibility.

A modeling pipeline avoids the most common type of data leakage where data preparation techniques, such as scaling input values, are applied to the entire dataset. This is data leakage because it shares knowledge of the test dataset (such as observations that contribute to a mean or maximum known value) with the training dataset, and in turn, may result in overly optimistic model performance.

Instead, data transforms must be prepared on the training dataset only, then applied to the training dataset, test dataset, validation dataset, and any other datasets that require the transform prior to being used with the model.

A modeling pipeline ensures that the sequence of data preparation operations performed is reproducible.

Without a modeling pipeline, the data preparation steps may be performed manually twice: once for evaluating the model and once for making predictions. Any changes to the sequence must be kept consistent in both cases, otherwise differences will impact the capability and skill of the model.

A pipeline ensures that the sequence of operations is defined once and is consistent when used for model evaluation or making predictions.

The Python scikit-learn machine learning library provides a machine learning modeling pipeline via the Pipeline class.

You can learn more about how to use this Pipeline API in this tutorial:

Implications of a Modeling Pipeline

The modeling pipeline is an important tool for machine learning practitioners.

Nevertheless, there are important implications that must be considered when using them.

The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

For example, a pipeline may use a data transform that configures itself automatically, such as the RFECV technique for feature selection.

  • When evaluating a pipeline that uses an automatically-configured data transform, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

The answer is, it doesn’t matter.

Another example is the use of hyperparameter tuning as the final step of the pipeline.

The grid search will be performed on the data provided by any prior transform steps in the pipeline and will then search for the best combination of hyperparameters for the model using that data, then fit a model with those hyperparameters on the data.

  • When evaluating a pipeline that grid searches model hyperparameters, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

The answer again is, it doesn’t matter.

The answer applies when using a threshold moving or probability calibration step at the end of the pipeline.

The reason is the same reason that we are not concerned about the specific internal structure or coefficients of the chosen model.

For example, when evaluating a logistic regression model, we don’t need to inspect the coefficients chosen on each k-fold cross-validation round in order to choose the model. Instead, we focus on its out-of-fold predictive skill

Similarly, when using a logistic regression model as the final model for making predictions on new data, we do not need to inspect the coefficients chosen when fitting the model on the entire dataset before making predictions.

We can inspect and discover the coefficients used by the model as an exercise in analysis, but it does not impact the selection and use of the model.

This same answer generalizes when considering a modeling pipeline.

We are not concerned about which features may have been automatically selected by a data transform in the pipeline. We are also not concerned about which hyperparameters were chosen for the model when using a grid search as the final step in the modeling pipeline.

In all three cases: the single model, the pipeline with automatic feature selection, and the pipeline with a grid search, we are evaluating the “model” or “modeling pipeline” as an atomic unit.

The pipeline allows us as machine learning practitioners to move up one level of abstraction and be less concerned with the specific outcomes of the algorithms and more concerned with the capability of a sequence of procedures.

As such, we can focus on evaluating the capability of the algorithms on the dataset, not the product of the algorithms, i.e. the model. Once we have an estimate of the pipeline, we can apply it and be confident that we will get similar performance, on average.

It is a shift in thinking and may take some time to get used to.

It is also the philosophy behind modern AutoML (automatic machine learning) techniques that treat applied machine learning as a large combinatorial search problem.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered modeling pipelines for applied machine learning.

Specifically, you learned:

  • Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
  • Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
  • Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

Regression Metrics for Machine Learning

$
0
0

Regression refers to predictive modeling problems that involve predicting a numeric value.

It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate the predictions made by a regression model.

Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.

In this tutorial, you will discover how to calculate error metrics for regression predictive modeling projects.

After completing this tutorial, you will know:

  • Regression predictive modeling are those problems that involve predicting a numeric value.
  • Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
  • How to calculate and report mean squared error, root mean squared error, and mean absolute error.

Let’s get started.

Regression Metrics for Machine Learning

Regression Metrics for Machine Learning
Photo by Gael Varoquaux, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Regression Predictive Modeling
  2. Evaluating Regression Models
  3. Metrics for Regression
    1. Mean Squared Error
    2. Root Mean Squared Error
    3. Mean Absolute Error

Regression Predictive Modeling

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

For more on approximating functions in applied machine learning, see the post:

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

Regression is different from classification, which involves predicting a category or class label.

For more on the difference between classification and regression, see the tutorial:

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

  • A regression problem requires the prediction of a quantity.
  • A regression can have real-valued or discrete input variables.
  • A problem with multiple input variables is often called a multivariate regression problem.
  • A regression problem where input variables are ordered by time is called a time series forecasting problem.

Now that we are familiar with regression predictive modeling, let’s look at how we might evaluate a regression model.

Evaluating Regression Models

A common question by beginners to regression predictive modeling projects is:

How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

We cannot calculate accuracy for a regression model.

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

  • Mean Squared Error (MSE).
  • Root Mean Squared Error (RMSE).
  • Mean Absolute Error (MAE)

There are many other metrics for regression, although these are the most commonly used. You can see the full list of regression metrics supported by the scikit-learn Python machine learning library here:

In the next section, let’s take a closer look at each in turn.

Metrics for Regression

In this section, we will take a closer look at the popular metrics for regression models and how to calculate them for your predictive modeling project.

Mean Squared Error

Mean Squared Error, or MSE for short, is a popular error metric for regression problems.

It is also an important loss function for algorithms fit or optimized using the least squares framing of a regression problem. Here “least squares” refers to minimizing the mean squared error between predictions and expected values.

The MSE is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.

  • MSE = 1 / N * sum for i to N (y_i – yhat_i)^2

Where y_i is the i’th expected value in the dataset and yhat_i is the i’th predicted value. The difference between these two values is squared, which has the effect of removing the sign, resulting in a positive error value.

The squaring also has the effect of inflating or magnifying large errors. That is, the larger the difference between the predicted and expected values, the larger the resulting squared positive error. This has the effect of “punishing” models more for larger errors when MSE is used as a loss function. It also has the effect of “punishing” models by inflating the average error score when used as a metric.

We can create a plot to get a feeling for how the change in prediction error impacts the squared error.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The squared error between each prediction and expected value is calculated and plotted to show the quadratic increase in squared error.

...
# calculate error
err = (expected[i] - predicted[i])**2

The complete example is listed below.

# example of increase in mean squared error
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = list()
for i in range(len(expected)):
	# calculate error
	err = (expected[i] - predicted[i])**2
	# store error
	errors.append(err)
	# report error
	print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))
# plot errors
pyplot.plot(errors)
pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)
pyplot.xlabel('Predicted Value')
pyplot.ylabel('Mean Squared Error')
pyplot.show()

Running the example first reports the expected value, predicted value, and squared error for each case.

We can see that the error rises quickly, faster than linear (a straight line).

>1.0, 1.0 = 0.000
>1.0, 0.9 = 0.010
>1.0, 0.8 = 0.040
>1.0, 0.7 = 0.090
>1.0, 0.6 = 0.160
>1.0, 0.5 = 0.250
>1.0, 0.4 = 0.360
>1.0, 0.3 = 0.490
>1.0, 0.2 = 0.640
>1.0, 0.1 = 0.810
>1.0, 0.0 = 1.000

A line plot is created showing the curved or super-linear increase in the squared error value as the difference between the expected and predicted value is increased.

The curve is not a straight line as we might naively assume for an error metric.

Line Plot of the Increase Square Error With Predictions

Line Plot of the Increase Square Error With Predictions

The individual error terms are averaged so that we can report the performance of a model with regard to how much error the model makes generally when making predictions, rather than specifically for a given example.

The units of the MSE are squared units.

For example, if your target value represents “dollars,” then the MSE will be “squared dollars.” This can be confusing for stakeholders; therefore, when reporting results, often the root mean squared error is used instead (discussed in the next section).

The mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

...
# calculate errors
errors = mean_squared_error(expected, predicted)

The example below gives an example of calculating the mean squared error between a list of contrived expected and predicted values.

# example of calculate the mean squared error
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_squared_error(expected, predicted)
# report error
print(errors)

Running the example calculates and prints the mean squared error.

0.35000000000000003

A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MSE is relative to your specific dataset.

It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.

Root Mean Squared Error

The Root Mean Squared Error, or RMSE, is an extension of the mean squared error.

Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.

For example, if your target variable has the units “dollars,” then the RMSE error score will also have the unit “dollars” and not “squared dollars” like the MSE.

As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.

The RMSE can be calculated as follows:

  • RMSE = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value, and sqrt() is the square root function.

We can restate the RMSE in terms of the MSE as:

  • RMSE = sqrt(MSE)

Note that the RMSE cannot be calculated as the average of the square root of the mean squared error values. This is a common error made by beginners and is an example of Jensen’s inequality.

You may recall that the square root is the inverse of the square operation. MSE uses the square operation to remove the sign of each error value and to punish large errors. The square root reverses this operation, although it ensures that the result remains positive.

The root mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

By default, the function calculates the MSE, but we can configure it to calculate the square root of the MSE by setting the “squared” argument to False.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

...
# calculate errors
errors = mean_squared_error(expected, predicted, squared=False)

The example below gives an example of calculating the root mean squared error between a list of contrived expected and predicted values.

# example of calculate the root mean squared error
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_squared_error(expected, predicted, squared=False)
# report error
print(errors)

Running the example calculates and prints the root mean squared error.

0.5916079783099616

A perfect RMSE value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good RMSE is relative to your specific dataset.

It is a good idea to first establish a baseline RMSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an RMSE better than the RMSE for the naive model has skill.

Mean Absolute Error

Mean Absolute Error, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is being predicted.

Unlike the RMSE, the changes in MAE are linear and therefore intuitive.

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is forced to be positive when calculating the MAE.

The MAE can be calculated as follows:

  • MAE = 1 / N * sum for i to N abs(y_i – yhat_i)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.

We can create a plot to get a feeling for how the change in prediction error impacts the MAE.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The absolute error between each prediction and expected value is calculated and plotted to show the linear increase in error.

...
# calculate error
err = abs((expected[i] - predicted[i]))

The complete example is listed below.

# plot of the increase of mean absolute error with prediction error
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = list()
for i in range(len(expected)):
	# calculate error
	err = abs((expected[i] - predicted[i]))
	# store error
	errors.append(err)
	# report error
	print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))
# plot errors
pyplot.plot(errors)
pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)
pyplot.xlabel('Predicted Value')
pyplot.ylabel('Mean Absolute Error')
pyplot.show()

Running the example first reports the expected value, predicted value, and absolute error for each case.

We can see that the error rises linearly, which is intuitive and easy to understand.

>1.0, 1.0 = 0.000
>1.0, 0.9 = 0.100
>1.0, 0.8 = 0.200
>1.0, 0.7 = 0.300
>1.0, 0.6 = 0.400
>1.0, 0.5 = 0.500
>1.0, 0.4 = 0.600
>1.0, 0.3 = 0.700
>1.0, 0.2 = 0.800
>1.0, 0.1 = 0.900
>1.0, 0.0 = 1.000

A line plot is created showing the straight line or linear increase in the absolute error value as the difference between the expected and predicted value is increased.

Line Plot of the Increase Absolute Error With Predictions

Line Plot of the Increase Absolute Error With Predictions

The mean absolute error between your expected and predicted values can be calculated using the mean_absolute_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean absolute error value.

...
# calculate errors
errors = mean_absolute_error(expected, predicted)

The example below gives an example of calculating the mean absolute error between a list of contrived expected and predicted values.

# example of calculate the mean absolute error
from sklearn.metrics import mean_absolute_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_absolute_error(expected, predicted)
# report error
print(errors)

Running the example calculates and prints the mean absolute error.

0.5

A perfect mean absolute error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MAE is relative to your specific dataset.

It is a good idea to first establish a baseline MAE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves a MAE better than the MAE for the naive model has skill.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to calculate error for regression predictive modeling projects.

Specifically, you learned:

  • Regression predictive modeling are those problems that involve predicting a numeric value.
  • Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
  • How to calculate and report mean squared error, root mean squared error, and mean absolute error.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Regression Metrics for Machine Learning appeared first on Machine Learning Mastery.

Sensitivity Analysis of Dataset Size vs. Model Performance

$
0
0

Machine learning model performance often improves with dataset size for predictive modeling.

This depends on the specific datasets and on the choice of model, although it often means that using more data can result in better performance and that discoveries made using smaller datasets to estimate model performance often scale to using larger datasets.

The problem is the relationship is unknown for a given dataset and model, and may not exist for some datasets and models. Additionally, if such a relationship does exist, there may be a point or points of diminishing returns where adding more data may not improve model performance or where datasets are too small to effectively capture the capability of a model at a larger scale.

These issues can be addressed by performing a sensitivity analysis to quantify the relationship between dataset size and model performance. Once calculated, we can interpret the results of the analysis and make decisions about how much data is enough, and how small a dataset may be to effectively estimate performance on larger datasets.

In this tutorial, you will discover how to perform a sensitivity analysis of dataset size vs. model performance.

After completing this tutorial, you will know:

  • Selecting a dataset size for machine learning is a challenging open problem.
  • Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
  • How to perform a sensitivity analysis of dataset size and interpret the results.

Let’s get started.

Sensitivity Analysis of Dataset Size vs. Model Performance

Sensitivity Analysis of Dataset Size vs. Model Performance
Photo by Graeme Churchard, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Dataset Size Sensitivity Analysis
  2. Synthetic Prediction Task and Baseline Model
  3. Sensitivity Analysis of Dataset Size

Dataset Size Sensitivity Analysis

The amount of training data required for a machine learning predictive model is an open question.

It depends on your choice of model, on the way you prepare the data, and on the specifics of the data itself.

For more on the challenge of selecting a training dataset size, see the tutorial:

One way to approach this problem is to perform a sensitivity analysis and discover how the performance of your model on your dataset varies with more or less data.

This might involve evaluating the same model with different sized datasets and looking for a relationship between dataset size and performance or a point of diminishing returns.

Typically, there is a strong relationship between training dataset size and model performance, especially for nonlinear models. The relationship often involves an improvement in performance to a point and a general reduction in the expected variance of the model as the dataset size is increased.

Knowing this relationship for your model and dataset can be helpful for a number of reasons, such as:

  • Evaluate more models.
  • Find a better model.
  • Decide to gather more data.

You can evaluate a large number of models and model configurations quickly on a smaller sample of the dataset with confidence that the performance will likely generalize in a specific way to a larger training dataset.

This may allow evaluating many more models and configurations than you may otherwise be able to given the time available, and in turn, perhaps discover a better overall performing model.

You may also be able to generalize and estimate the expected performance of model performance to much larger datasets and estimate whether it is worth the effort or expense of gathering more training data.

Now that we are familiar with the idea of performing a sensitivity analysis of model performance to dataset size, let’s look at a worked example.

Synthetic Prediction Task and Baseline Model

Before we dive into a sensitivity analysis, let’s select a dataset and baseline model for the investigation.

We will use a synthetic binary (two-class) classification dataset in this tutorial. This is ideal as it allows us to scale the number of generated samples for the same problem as needed.

The make_classification() scikit-learn function can be used to create a synthetic classification dataset. In this case, we will use 20 input features (columns) and generate 1,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base “problem” is used each time samples are generated.

The example below generates the synthetic classification dataset and summarizes the shape of the generated data.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example generates the data and reports the size of the input and output components, confirming the expected shape.

(1000, 20) (1000,)

Next, we can evaluate a predictive model on this dataset.

We will use a decision tree (DecisionTreeClassifier) as the predictive model. It was chosen because it is a nonlinear algorithm and has a high variance, which means that we would expect performance to improve with increases in the size of the training dataset.

We will use a best practice of repeated stratified k-fold cross-validation to evaluate the model on the dataset, with 3 repeats and 10 folds.

The complete example of evaluating the decision tree model on the synthetic classification dataset is listed below.

# evaluate a decision tree model on the synthetic classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# load dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Running the example creates the dataset then estimates the performance of the model on the problem using the chosen test harness.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the mean classification accuracy is about 82.7%.

Mean Accuracy: 0.827 (0.042)

Next, let’s look at how we might perform a sensitivity analysis of dataset size on model performance.

Sensitivity Analysis of Dataset Size

The previous section showed how to evaluate a chosen model on the available dataset.

It raises questions, such as:

Will the model perform better on more data?

More generally, we may have sophisticated questions such as:

Does the estimated performance hold on smaller or larger samples from the problem domain?

These are hard questions to answer, but we can approach them by using a sensitivity analysis. Specifically, we can use a sensitivity analysis to learn:

How sensitive is model performance to dataset size?

Or more generally:

What is the relationship of dataset size to model performance?

There are many ways to perform a sensitivity analysis, but perhaps the simplest approach is to define a test harness to evaluate model performance and then evaluate the same model on the same problem with differently sized datasets.

This will allow the train and test portions of the dataset to increase with the size of the overall dataset.

To make the code easier to read, we will split it up into functions.

First, we can define a function that will prepare (or load) the dataset of a given size. The number of rows in the dataset is specified by an argument to the function.

If you are using this code as a template, this function can be changed to load your dataset from file and select a random sample of a given size.

# load dataset
def load_dataset(n_samples):
	# define the dataset
	X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

Next, we need a function to evaluate a model on a loaded dataset.

We will define a function that takes a dataset and returns a summary of the performance of the model evaluated using the test harness on the dataset.

This function is listed below, taking the input and output elements of a dataset and returning the mean and standard deviation of the decision tree model on the dataset.

# evaluate a model
def evaluate_model(X, y):
	# define model evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return summary stats
	return [scores.mean(), scores.std()]

Next, we can define a range of different dataset sizes to evaluate.

The sizes should be chosen proportional to the amount of data you have available and the amount of running time you are willing to expend.

In this case, we will keep the sizes modest to limit running time, from 50 to one million rows on a rough log10 scale.

...
# define number of samples to consider
sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]

Next, we can enumerate each dataset size, create the dataset, evaluate a model on the dataset, and store the results for later analysis.

...
# evaluate each number of samples
means, stds = list(), list()
for n_samples in sizes:
	# get a dataset
	X, y = load_dataset(n_samples)
	# evaluate a model on this dataset size
	mean, std = evaluate_model(X, y)
	# store
	means.append(mean)
	stds.append(std)

Next, we can summarize the relationship between the dataset size and model performance.

In this case, we will simply plot the result with error bars so we can spot any trends visually.

We will use the standard deviation as a measure of uncertainty on the estimated model performance. This can be achieved by multiplying the value by 2 to cover approximately 95% of the expected performance if the performance follows a normal distribution.

This can be shown on the plot as an error bar around the mean expected performance for a dataset size.

...
# define error bar as 2 standard deviations from the mean or 95%
err = [min(1, s * 2) for s in stds]
# plot dataset size vs mean performance with error bars
pyplot.errorbar(sizes, means, yerr=err, fmt='-o')

To make the plot more readable, we can change the scale of the x-axis to log, given that our dataset sizes are on a rough log10 scale.

...
# change the scale of the x-axis to log
ax = pyplot.gca()
ax.set_xscale("log", nonpositive='clip')
# show the plot
pyplot.show()

And that’s it.

We would generally expect mean model performance to increase with dataset size. We would also expect the uncertainty in model performance to decrease with dataset size.

Tying this all together, the complete example of performing a sensitivity analysis of dataset size on model performance is listed below.

# sensitivity analysis of model performance to dataset size
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

# load dataset
def load_dataset(n_samples):
	# define the dataset
	X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

# evaluate a model
def evaluate_model(X, y):
	# define model evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return summary stats
	return [scores.mean(), scores.std()]

# define number of samples to consider
sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]
# evaluate each number of samples
means, stds = list(), list()
for n_samples in sizes:
	# get a dataset
	X, y = load_dataset(n_samples)
	# evaluate a model on this dataset size
	mean, std = evaluate_model(X, y)
	# store
	means.append(mean)
	stds.append(std)
	# summarize performance
	print('>%d: %.3f (%.3f)' % (n_samples, mean, std))
# define error bar as 2 standard deviations from the mean or 95%
err = [min(1, s * 2) for s in stds]
# plot dataset size vs mean performance with error bars
pyplot.errorbar(sizes, means, yerr=err, fmt='-o')
# change the scale of the x-axis to log
ax = pyplot.gca()
ax.set_xscale("log", nonpositive='clip')
# show the plot
pyplot.show()

Running the example reports the status along the way of dataset size vs. estimated model performance.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the expected trend of increasing mean model performance with dataset size and decreasing model variance measured using the standard deviation of classification accuracy.

We can see that there is perhaps a point of diminishing returns in estimating model performance at perhaps 10,000 or 50,000 rows.

Specifically, we do see an improvement in performance with more rows, but we can probably capture this relationship with little variance with 10K or 50K rows of data.

We can also see a drop-off in estimated performance with 1,000,000 rows of data, suggesting that we are probably maxing out the capability of the model above 100,000 rows and are instead measuring statistical noise in the estimate.

This might mean an upper bound on expected performance and likely that more data beyond this point will not improve the specific model and configuration on the chosen test harness.

>50: 0.673 (0.141)
>100: 0.703 (0.135)
>500: 0.809 (0.055)
>1000: 0.826 (0.044)
>5000: 0.835 (0.016)
>10000: 0.866 (0.011)
>50000: 0.900 (0.005)
>100000: 0.912 (0.003)
>500000: 0.938 (0.001)
>1000000: 0.936 (0.001)

The plot makes the relationship between dataset size and estimated model performance much clearer.

The relationship is nearly linear with a log dataset size. The change in the uncertainty shown as the error bar also dramatically decreases on the plot from very large values with 50 or 100 samples, to modest values with 5,000 and 10,000 samples and practically gone beyond these sizes.

Given the modest spread with 5,000 and 10,000 samples and the practically log-linear relationship, we could probably get away with using 5K or 10K rows to approximate model performance.

Line Plot With Error Bars of Dataset Size vs. Model Performance

Line Plot With Error Bars of Dataset Size vs. Model Performance

We could use these findings as the basis for testing additional model configurations and even different model types.

The danger is that different models may perform very differently with more or less data and it may be wise to repeat the sensitivity analysis with a different chosen model to confirm the relationship holds. Alternately, it may be interesting to repeat the analysis with a suite of different model types.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to perform a sensitivity analysis of dataset size vs. model performance.

Specifically, you learned:

  • Selecting a dataset size for machine learning is a challenging open problem.
  • Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
  • How to perform a sensitivity analysis of dataset size and interpret the results.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Sensitivity Analysis of Dataset Size vs. Model Performance appeared first on Machine Learning Mastery.

What Is Semi-Supervised Learning

$
0
0

Semi-supervised learning is a learning problem that involves a small number of labeled examples and a large number of unlabeled examples.

Learning problems of this type are challenging as neither supervised nor unsupervised learning algorithms are able to make effective use of the mixtures of labeled and untellable data. As such, specialized semis-supervised learning algorithms are required.

In this tutorial, you will discover a gentle introduction to the field of semi-supervised learning for machine learning.

After completing this tutorial, you will know:

  • Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
  • Top books on semi-supervised learning designed to get you up to speed in the field.
  • Additional resources on semi-supervised learning, such as review papers and APIs.

Let’s get started.

What Is Semi-Supervised Learning

What Is Semi-Supervised Learning
Photo by Paul VanDerWerf, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Semi-Supervised Learning
  2. Books on Semi-Supervised Learning
  3. Additional Resources

Semi-Supervised Learning

Semi-supervised learning is a type of machine learning.

It refers to a learning problem (and algorithms designed for the learning problem) that involves a small portion of labeled examples and a large number of unlabeled examples from which a model must learn and make predictions on new examples.

… dealing with the situation where relatively few labeled training points are available, but a large number of unlabeled points are given, it is directly relevant to a multitude of practical problems where it is relatively expensive to produce labeled data …

— Page xiii, Semi-Supervised Learning, 2006.

As such, it is a learning problem that sits between supervised learning and unsupervised learning.

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some super- vision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples.

— Page 2, Semi-Supervised Learning, 2006.

We require semi-supervised learning algorithms when working with data where labeling examples is challenging or expensive.

Semi-supervised learning has tremendous practical value. In many tasks, there is a paucity of labeled data. The labels y may be difficult to obtain because they require human annotators, special devices, or expensive and slow experiments.

— Page 9, Introduction to Semi-Supervised Learning, 2009.

The sign of an effective semi-supervised learning algorithm is that it can achieve better performance than a supervised learning algorithm fit only on the labeled training examples.

Semi-supervised learning algorithms generally are able to clear this low bar expectation.

… in comparison with a supervised algorithm that uses only labeled data, can one hope to have a more accurate prediction by taking into account the unlabeled points? […] in principle the answer is ‘yes.’”

— Page 4, Semi-Supervised Learning, 2006.

Finally, semi-supervised learning may be used or may contrast inductive and transductive learning.

Generally, inductive learning refers to a learning algorithm that learns from labeled training data and generalizes to new data, such as a test dataset. Transductive learning refers to learning from labeled training data and generalizing to available unlabeled (training) data. Both types of learning tasks may be performed by a semi-supervised learning algorithm.

… there are two distinct goals. One is to predict the labels on future test data. The other goal is to predict the labels on the unlabeled instances in the training sample. We call the former inductive semi-supervised learning, and the latter transductive learning.

— Page 12, Introduction to Semi-Supervised Learning, 2009.

If you are new to the idea of transduction vs. induction, the following tutorial has more information:

Now that we are familiar with semi-supervised learning from a high-level, let’s take a look at top books on the topic.

Books on Semi-Supervised Learning

Semi-supervised learning is a new and fast-moving field of study, and as such, there are very few books on the topic.

There are perhaps two key books on semi-supervised learning that you should consider if you are new to the topic; they are:

Let’s take a closer look at each in turn.

Semi-Supervised Learning, 2006

The book “Semi-Supervised Learning” was published in 2006 and was edited by Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.

Semi-Supervised Learning

Semi-Supervised Learning

This book provides a large number of chapters, each written by top researchers in the field.

It is designed to take you on a tour of the field of research including intuitions, top techniques, and open problems.

The full table of contents is listed below.

Table of Contents

  • Chapter 01: Introduction to Semi-Supervised Learning
  • Part I: Generative Models
    • Chapter 02: A Taxonomy for Semi-Supervised Learning Methods
    • Chapter 03: Semi-Supervised Text Classification Using EM
    • Chapter 04: Risks of Semi-Supervised Learning
    • Chapter 05: Probabilistic Semi-Supervised Clustering with Constraints
  • Part II: Low-Density Separation
    • Chapter 06: Transductive Support Vector Machines
    • Chapter 07: Semi-Supervised Learning Using Semi-Definite Programming
    • Chapter 08: Gaussian Processes and the Null-Category Noise Model
    • Chapter 09: Entropy Regularization
    • Chapter 10: Data-Dependent Regularization
  • Part III: Graph-Based Methods
    • Chapter 11: Label Propagation and Quadratic Criterion
    • Chapter 12: The Geometric Basis of Semi-Supervised Learning
    • Chapter 13: Discrete Regularization
    • Chapter 14: Semi-Supervised Learning with Conditional Harmonic Mixing
  • Part IV: Change of Representation
    • Chapter 15: Graph Kernels by Spectral Transforms
    • Chapter 16: Spectral Methods for Dimensionality Reduction
    • Chapter 17: Modifying Distances
  • Part V: Semi-Supervised Learning in Practice
    • Chapter 18: Large-Scale Algorithms
    • Chapter 19: Semi-Supervised Protein Classification Using Cluster Kernels
    • Chapter 20: Prediction of Protein Function from Networks
    • Chapter 21: Analysis of Benchmarks
  • Part VI: Perspectives
    • Chapter 22: An Augmented PAC Model for Semi-Supervised Learning
    • Chapter 23: Metric-Based Approaches for Semi-Supervised Regression and Classification
    • Chapter 24: Transductive Inference and Semi-Supervised Learning
    • Chapter 25: A Discussion of Semi-Supervised Learning and Transduction

I highly recommend this book and reading it cover to cover if you are starting out in this field.

Introduction to Semi-Supervised Learning, 2009

The book “Introduction to Semi-Supervised Learning” was published in 2009 and was written by Xiaojin Zhu and Andrew Goldberg.

Introduction to Semi-Supervised Learning

Introduction to Semi-Supervised Learning

This book is aimed at students, researchers, and engineers just getting started in the field.

The book is a beginner’s guide to semi-supervised learning. It is aimed at advanced under-graduates, entry-level graduate students and researchers in areas as diverse as Computer Science, Electrical Engineering, Statistics, and Psychology.

— Page xiii, Introduction to Semi-Supervised Learning, 2009.

It’s a shorter read than the above book and a great introduction.

The full table of contents is listed below.

Table of Contents

  • Chapter 01: Introduction to Statistical Machine Learning
  • Chapter 02: Overview of Semi-Supervised Learning
  • Chapter 03: Mixture Models and EM
  • Chapter 04: Co-Training
  • Chapter 05: Graph-Based Semi-Supervised Learning
  • Chapter 06: Semi-Supervised Support Vector Machines
  • Chapter 07: Human Semi-Supervised Learning
  • Chapter 08: Theory and Outlook

I also recommend this book if you’re just starting out for a quick review of the key elements of the field.

Other Books

There are some additional books on semi-supervised learning that you might also like to consider; they are:

Have you read any of the above books?
What did you think?

Did I miss your favorite book?
Let me know in the comments below.

Additional Resources

There are additional resources that may be helpful when getting started in the field of semi-supervised learning.

I would recommend reading some review papers.

Some examples of good review papers on semi-supervised learning include:

In this paper, we provide a comprehensive overview of deep semi-supervised learning, starting with an introduction to the field, followed by a summarization of the dominant semi-supervised approaches in deep learning.

An Overview of Deep Semi-Supervised Learning, 2020.

An Overview of Deep Semi-Supervised Learning

An Overview of Deep Semi-Supervised Learning

It is also a good idea to try out some of the algorithms.

The scikit-learn Python machine learning library provides a few graph-based semi-supervised learning algorithms that you can try:

The Wikipedia article may also provide some useful links for further reading:

Summary

In this tutorial, you discovered a gentle introduction to the field of semi-supervised learning for machine learning.

Specifically, you learned:

  • Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
  • Top books on semi-supervised learning designed to get you up to speed in the field.
  • Additional resources on semi-supervised learning, such as review papers and APIs.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post What Is Semi-Supervised Learning appeared first on Machine Learning Mastery.

A Practical Guide to Building Recommender Systems

$
0
0
Recommender systems enhance user experiences in Internet-based applications by recommending items tailored to individual preferences or needs, such as products, services, or content.

Demystifying Ensemble Methods: Boosting, Bagging, and Stacking Explained

The Beginner’s Guide to Natural Language Processing with Python

$
0
0
Learning natural language processing can be a super useful addition to your developer toolkit.

5 Free Courses for Mastering LLMs

$
0
0
Without any doubt, Large Language Models (LLMs) have emerged as one of the biggest AI breakthroughs in recent years: they excel in understanding and generating human-like text , making them versatile for a wide range of applications.

Building a Graph RAG System: A Step-by-Step Approach

$
0
0
Graph RAG, Graph RAG, Graph RAG! This term has become the talk of the town, and you might have come across it as well.

Machine Learning vs. Traditional Analytics: When to Use Which?

$
0
0
This article focuses on demystifying the difference between traditional data analytics methods vs.

5 Free APIs for Building AI Applications

$
0
0
The AI field is rapidly evolving, becoming one of the most dynamic areas within machine learning.

5 Beginner-Friendly Projects to Learn LLMs & RAG

$
0
0
I believe in the 'learning by doing' approach—you learn more this way.
Viewing all 269 articles
Browse latest View live