"Given computers the ability to learn and make decisions from data without being explicitly programmed."
Uncovering hidden patterns from unlabels data.
Software agents interact with an environment by taking actions.
Applications are economocis, genetics and game playing.
Data is commonly represented in a data structure.
We have the predictor variables/features as the independent variables and the target variable as the dependent variable that we wish to make predictions about.
The aim is to predict the target variable, given the predictor variables.
If target variable consists of categories (like spam or not spam), then it is a classification task.
If the target variable is a continuously varying variable (like house pricing), then it is a regression task.
The goal is to:
These are many ways to do supervised learning in Python. The focus for this will be on
scikit-learn (also known as
It integrates well with
Other libraries include Tensorflow and Keras.
We will use some
sklearn for the first exercises.
from sklearn import datasets import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') iris = datasets.load_iris() type(iris) # sklearn.datasets.base.Bunch - a dictionary-like object with key-value pairs print(iris.keys()) # dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names']) type(iris.data) # numpy.ndarray type(iris.target) # numpy.ndarray) iris.data.shape # (150, 4) - 150 rows and 4 columns iris.target_names # array(['setosa', 'versicolor', 'virginica'], dtype='<U10') these will be encoded as 0, 1, 2 X = iris.data y = iris.target df = pd.DataFrame(X, columns=iris.feature_names) print(df.head()) # print the first 5 rows # Help visualize the data. # c stands for color so we're all colors by species. # figsize will be the size of the figure. # marker is the shape of the points. _ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8], s = 150, marker = 'D')
Take unlabeled data as input and outputs the class label.
We choose a type of classifier that learns from the already labelled data. This is known as
We will use the
k-n-neighbors classifier in today's example (also known as
The basic idea is that we can predict the label of a data point by looking at the
k closest labeled data points and taking a "majority vote".
k=3, we would take the 3 latest labeled data points and take the majority vote.
k=5, we would take the 5 latest labeled data points and take the majority vote.
k=1, we would take the closest labeled data point.
What it algorithm does is create a set of decision boundaries. Anything within those decision boundaries are considered to be in the same class.
All ML models are implemented as Python classes.
We can use the
fit method to train the classifier and
predict method to assign a label.
To use it, we can do the following:
from sklearn.neighbors import KNeighborsClassifier # Set this to create boundaries based on 6 closest neight. knn = KNeighborsClassifier(n_neighbors=6) knn.fit(iris.data, iris.target)
The arguments for
fit requires both arguments to be a numpy array.
The data for Scikit learn must always be a numpy array or data frame.
It also requires that features take on continuous values as opposed to categories.
There also cannot be any missing data (these will be explored later on how to deal with them).
The scikit learn API requires that features are in an array where each column is a feature and each for is a data point.
iris.data.shape will have the number of data points and the number of features. The
iris.target.shape requires that the number of data points is the same as the labeled data.
# A set of unlabeled data. X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]]) prediction = knn.predict(X_new) X_new.shape # (1, 4) - 1 data point and 4 features (assuming in the example about you just used the first example and not more) print(prediction) # array() - 0 is the label for the first example which will map to one of the iris labels # The prediction is [1 1 0] which maps to [versicolor versicolor setosa]
How was our classifiers performance? We need a metric.
accuracy is a common metric to use.
Accuracy = Fraction of correct predictions
Which data do we use to compute accuray? We want to know how well the model will work on new data. Testing against training data is NOT indicative of ability to generalize.
Generally, we want to train and fit on a training set, then make predications on a test set.
Then, we compare and compute the accuracy of the model on the test set.
It is best practice to perform split so that split reflects labels on the data. We want labels to be split in the test and train set as they are in the original data set. We do this with
from sklearn.model_selection import train_test_split # Use the function to randomly split out data into a training set and a test set. # We can reproduce the split we did in the first example by setting the random_state keyword arg. # @see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y) from sklearn.neighbors import KNeighborsClassifier # Set this to create boundaries based on 6 closest neight. knn = KNeighborsClassifier(n_neighbors=6) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) # Prints list with predictions of the test set labels (encoded as ints). print(y_pred) # Checking the score by printing it out print(knn.score(X_test, y_test)) # 0.9111111 for 6 neighbors
k increases, the decision boundary becomes smoothers (a less complex model).
k is more complex and can lead to
If you increase
k even more, you can end up
There is a sweet spot in the middle that gives us the be fit.
We can check overfitting/underfitting by iterating over a range of
k and calculating the accuracy.
# Setup arrays to store train and test accuracies neighbors = np.arange(1, 9) train_accuracy = np.empty(len(neighbors)) test_accuracy = np.empty(len(neighbors)) # Loop over different values of k for i, k in enumerate(neighbors): # Setup a k-NN Classifier with k neighbors: knn knn = KNeighborsClassifier(n_neighbors = k) # Fit the classifier to the training data knn.fit(X_train, y_train) #Compute accuracy on the training set train_accuracy[i] = knn.score(X_train, y_train) #Compute accuracy on the testing set test_accuracy[i] = knn.score(X_test, y_test) # Generate plot plt.title('k-NN: Varying Number of Neighbors') plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy') plt.plot(neighbors, train_accuracy, label = 'Training Accuracy') plt.legend() plt.xlabel('Number of Neighbors') plt.ylabel('Accuracy') plt.show()