Home

Machine Learning - Udemy A-Z


Part 1 - Data Preprocessing

1. The initial data

DatasetExample set
CountryString
AgeInt
SalaryInt
PurchasedBoolean

This dataset also has independent vs dependent variables, with the dependent variable being the Purchased data.

So using the first three variables, we will predict the fourth column.

Importing the Libraries

In Python

LibrariesWhat for?
matplotlibHas a bunch of very useful and intuitive tools
numpyHelp with math
pandasImports and manages data sets

import numpy as np import matplotlib.pyplot as plt import pandas as pd

In R

Here, we don't need to import any libraries since R Studio comes with a bunch of them!

Importing the Dataset

Here, we will import the variables and create a matrix of observations.

In Python

Set the working directory to where we need to be.

# given the pandas import dataset = pd.read_csv('Data.csv') # iloc[lines, columns] -> :-1 all columns except last X = dataset.iloc[:, :-1].values # if we print X, it will create a matrix of the data and give a datatype y = dataset.iloc[:, 3].values # printing y will give the last column values

In R

REMEMBER - R Arrays begin from 1

#importing the dataset dataset = read.csv('Data.csv');

Missing Data

How can handle the problem when there is null data for where the is missing data?

One way to get around this is to take the mean of the columns.

For these dataset in Age, we will replace that data with the mean.

In Python

The library will will use is sklearn.

sklearn is SideKick learn and is an amazing library. We import Imputer to help with the preprocessing.

from sklean.preprocessing import Imputer # set NaN and we will see that the missing values are NaN # strategy default is mean anyway but we'll be verbose # axis = 0 imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # lowerbound included, upperbound is excluded imputer = imputer.fit(X[:, 1:3]) # tranform method replaces the missing data X[:, 1:3] = imputer.tranform(X[:, 1:3])

In R

# ifelse is like a ternary # is.na is to check if value is missing or not dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)

Catagorical Variables

What happens when we have strings instead of numbers for defining data? We must convert them to numbers. Example, we have country strings and a bool column in the data given.

# encoding catagorical data from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

However, the problem is that since the encodings are of int values, we could actually have the computer consider that the higher integer is of greater importance where it is not.

Instead, what we will do is essentially set up three columns that work like an adjacency list.

1 where the country is correlated to the row, 0 otherwise.

# encoding catagorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(catergorical_features = [0]) # ensure that X is transformed X = onehotencoder.fit_transform(X).toarray()

However, we will need to understand which variable we know are which.

Let's look at the encoding for the dependent variable, where we only need the LabelEncoder.

# ... labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y)

In the case of the boolean, we basically want to numbers to be encoded to 0 and 1.

In R

For R, we just need to factor the way we want to.

Since we have the factor function, the number encoding themselves don't need to be setup in the same way that it was for Python.

# Encoding catergorical data # remember c() is a Vector! dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany'), labels = c(1,2,3)) dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes') labels = c(0, 1))

Splitting the data into a Training Set and Test Set

With any model, we should split the data into the training set and the test set.

We need to build our models on the set and then test it on a new set against which we used certain data for that model.

The performance should not differ too much.

For this section, we use from sklearn.model_selection import train_test_split to do the training, testing and splitting.

train_test_split(*arrays, test_size, train_size)

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=0) # use below if using python-shell in node res = X_train.tolist() send(res, 0) res = X_test.tolist() send(res, 0)

Feature Scaling

With two variables, we can find the Euclidean Distance between point one and point two as sqroot((x[1] - x[0])^2 + (y[1] - y[0])^2)

However, with two very contrasting sizes of variables, the difference may be so ridiculous due to the square difference. Basically, the smaller, less dominant one may not exist.

# # FEATURE SCALING # from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

How about the Dummy Variables? It won't break the Model if you don't scale it, but you might lose how we can intepret which country is which.

Even when no Euclidean distance is required, Feature scaling allows the execution to be much faster.

Templating Data Preprocessing

# Importing the libraries import numpy as mp import mapplotlib.pypot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Data.csv') x = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values # Taking care of missing data # Not compulsary - only if data is missing from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = Imputer.fit(X[: 1:3]) X[: 1:3] = imputer.transform(X[:, 1:3]) # Encoding categorical data # Not compulsary - only if we need to convert the data from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode Strings # Think example of countries to [0|1] matrix # Encoding the Independent Variable labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(categorical_features = [0]) # ensure that X is transformed # details here http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html X = onehotencoder.fit_transform(X).toarray() # Encoding the Dependent Variable labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y) # # SPLITTING THE SET INTO THE TRAINING AND TEST SET # from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # # FEATURE SCALING # from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

2. Regression

Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise your model is predicting present but unknown values. Regression technique vary from Linear Regression to SVR and Random Forests Regression.

In this part, you will understand and learn how to implement the following Machine Learning Regression models:

Simple Linear Regression Multiple Linear Regression Polynomial Regression Support Vector for Regression (SVR) Decision Tree Classification Random Forest Classification

2.1: Simple Linear Regression

Looking at years of experience vs salary.

The issue - what is the correlation between Years of experience and Salary.

Ask the questions, what are the values that we get from this model? We could have a business go back to this model and apply it to help get an idea of salaries you are willing to give out.

Intuition

Simple linear regression is basically y = b[0] + b[1]*x[1] (even y = mx + c)

# Example - How does salary change with years of experience? y - dependent variable (DV) eg. (y = salary change) x - independent variable(IV) eg. years of experience b[1] - coefficient of IV (unit changes in x[1] how it affects y) b[0] - constant

Regression - look at the hard facts.

The simple linear regression will basically be a best fit for the data.

In the case of b[0], that will be the y-intercept. b[1] being the point at y.

On the XY Graph the datapoints will all end up being the independent variables. If we draw lines from these points to the model linear regression line, we can see where that person should be sitting. If y[i] is the data point, y[hat][i] is the point is modelled that is should be.

To get the best fitting line, we just sum(y - y[hat])^2 to get the min.

IN PYTHON

In this example, YourExperience is the independent value and Salary is the dependent value.

# Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values send(X.tolist(), 0); send(y.tolist(), 0); # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

If we run the above, we may get an error from sklearn.preprocessing that is that 1d arrays need to be reshaped.

In simple linear regression, we also don't need to worry about Feature Scaling.

Fitting Simple Linear Regression to the Training Set

  • fit the regressor

# Add to the above code # Fitting simple ;inear Regression to the Training Set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) send(str(regressor), 0);

Now that we have the regressor, we can start making basic predictions! With the Linear Regression object, we can now do this using the predict method.

# Add to code above # Prediciting the test set results y_pred = regressor.predict(X_test) # send(X_test.tolist(), 0) # see test set years for IV # send(y_test.tolist(), 0) # check what the results were # send(y_pred.tolist(), 0) # check the predictions

Visualizing the Model

This will be training set to train a line and now we can see how it goes against first - the training set, and then secondly, the test set!

Note the blue line being the prediction while the red dots are what give the actual plot points.

# Visualizing the Training Set results plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() # plt.savefig('plot.png')

As for checking the test set:

# Visualizing the Test Set results plt.scatter(X_test, y_test, color = 'red') # We do not change this since the regressor is already trained # with the training set plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() # plt.savegit('plot.png')

2.2 Multiple Linear Regression

The challenge: you have 50 companies that all have extracts from Profit and the independent variables that it depends on R&D Spend, Administration, Markerting Spend.

Intuition

Multiple where there are multiple IVs of causation.

# Simple Linear Regression y = b[0] + b[1]*x[1] # Multiple Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] # Multiple Linear Regression after replacing categorical data y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] + b[n+1]*D[1] + ... + b[n+m]*D[m]

The Assumptions of Linear Regression

  1. Linearity
  2. Homoscedasticity
  3. Multivariate normality
  4. Independence of errors
  5. Lack of mulicollinearity

Dummy Variables

With the data that has categorical data, we actually use the LabelEncoder and OneHotEncoder to allow the expansion of the column into the total different values of of state and make a binary matrix for those columns and rows.

Note: There is a dummy variable trap we will talk about later.

We can also think this to be biased, however by default we will have the correct coefficient for the category that will help alter the state to be for the correct category.

You cannot have the default b[0] + all dummy varibles. You should always omit one dummy varible.

How to build MLR models (step-by-step)

Back with one IV and one DV, life was great, but now that we have many columns we need to decide what we can use as useful predictors.

Why throw out columns and use everything?

  1. Garbage in -> Garbage out. If you throw everything in, you may also add in garbage.
  2. Shows an understanding of variables

5 Methods of Building Models

  1. All-in
  2. Backward Elimination
  3. Forward Selection
  4. Bidirectional Elimination
  5. Score Comparison

2, 3 and 4 are sometimes referred to as Stepwise Regression or sometimes just 4.

All in

Throw in everything. When to do it?

  • You have prior knowledge that these are the true predictors
  • You have to: maybe a framework where you have to use them
  • Preparing for Backward Elimination type of regression

Backward Elimination

  1. Select a significance level to stay in the model (eg SL = 0.05)
  2. Fit the full model with all possible predictors
  3. Consider the predictor with the highest P-value - if P > SL, go to step 4, else fin
  4. Remove the predictor
  5. Fit model without this variable*, rebuild the entire model with the other vars
  6. Return to step 3 with the new model FIN. When P > SL, you come here and the model is ready

Forward Elimination

  1. Select a significance level to stay in the model (eg SL = 0.05)
  2. Fit all simple regression models y ~ x[n] - select the one with the lower P-value
  3. Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
  4. Consider the predictor with the lowest P-value. If P < SL, go to Step 3, otherwise go to FIN FIN. Keep the previous model

Bidirectional Elimination

  1. Select a significance level to enter and one to stay in the model (eg SLENTER, SLSTAY = 0.05)
  2. Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)
  3. Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay) - very iterative process
  4. No new variables can enter and no old variables can exit, go to FIN FIN. Model is ready

All Possible Models

Most thorough approach, but also the most consuming.

  1. Select a criterion of goodness of fit (eg. Akaike criterion)
  2. Construct All Possible Regression Models: (2^N) - 1 total combinations
  3. Select the one with the best criterion FIN. Your model is ready

If you have 10 columns in your data, that means 1023 models (ridiculous)

IN PYTHON

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values # send(X.tolist(), 0); # send(y.tolist(), 0); # # Taking care of missing data # from sklearn.preprocessing import Imputer # imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # imputer = imputer.fit(X[:, 1:3]) # X[:, 1:3] = imputer.transform(X[:, 1:3]) # Encoding categorical data # Encoding the Independent Variable from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 3] = labelencoder_X.fit_transform(X[:, 3]) onehotencoder = OneHotEncoder(categorical_features = [3]) X = onehotencoder.fit_transform(X).toarray() send(X.tolist(), 0); # Avoiding the Dummy Variable Trap # Lib in this case takes care of it # for us in this case # X = X[:, 1:] # send(X.tolist(), 0); # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Library for Multiple Linear Regression

Add this following to the above

# Fitting simple ;inear Regression to the Training Set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test)

Backward Elimination - Multiple Linear Regression

To get it ready, you need to import the required library. Add the follwoing to the previous code.

The library that we use doesn't take into account the x[0] constant = 1, so we will need to add this. Most other libraries normally will include this.

# Backward Elimination Preparation import statsmodels.formula.api as sm # Add in column for X[0] X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis=1) send(X.tolist(), 0)

Now that we are ready to start Backward Elimination, we can go ahead and begin with this...

The following table shows us some useful information about the multiple linear regression model - the R-squared, the Adjusted R-squared, P values and more.

The lower the P value in this case, the more important.

# Backward Elimination Preparation import statsmodels.formula.api as sm # Add in column for X[0] X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis=1) X_opt = X[:, [0,1,2,3,4,5]] # Stay if < SL SL = 0.05 # Create a new regressor regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0)

In the case of the first run through, get rid of the variable with the highest P value. We need to continue this until we are under the 0.05 SL value.

# Because of how everything went, we iterate through the BE algorithm iteratively # For now, we are not focused on improving the model # Create a new regressor and run iteration X_opt = X[:, [0,1,2,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,1,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0)

2.3 Polynomial Linear Regression

# Simple Linear Regression y = b[0] + b[1]*x[1] # Multiple Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] # Multiple Linear Regression after replacing categorical data y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] + b[n+1]*D[1] + ... + b[n+m]*D[m] # Polynomial Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[1]^2 + ... + b[n]*x[1]^n

Interpretation

Depending on the power, we begin to have a parabolic shape - think of how it all graphs and the amount of min/max for each power.

Use cases could be things such as understanding how epidemics have spread etc.

Why is it still called Linear?

The trick here is that we're not talking about the X variables. When talking about the class of the regression, we're talking about the coefficients.

These models aren't necessarily more advanced than the other linear regression models that we have looked at so far.

In this model, we will basically only require 1 independent variable level and the salaries column will becoome the DV y.

Note: always ensure that X is a vector of matrices and that y is a vector.

We also won't need to split the data into a training and test set simply because we don't have enough data to train one and test the performance of the other. We also want to make an accurate prediction, and not miss the target.

# Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values send(X.tolist(), 0); send(y.tolist(), 0); # Fitting simple Linear Regression to the Training Set # Feature Scaling not required with the following library from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) # Fitting Polynomial Regression to the dataset # This is transform the original features to have # associated polynomial terms from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=2) X_poly=poly_reg.fit_transform(X) # Fit the poly to another lin reg # to have eg. two independent vars # etc - using the Poly lin_reg2 = LinearRegression() lin_reg2.fit(X_poly, y) # Visualising the Linear Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title('Truth or Bluff for salary for job (LR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryLR.png') plt.close()

In order to plot and predict polynomial regressions, we need to use the fit_transform method within the LinearRegression.predict() method.

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values send(X.tolist(), 0); send(y.tolist(), 0); # Fitting simple Linear Regression to the Training Set # Feature Scaling not required with the following library from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) # Fitting Polynomial Regression to the dataset # This is transform the original features to have # associated polynomial terms from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=4) X_poly=poly_reg.fit_transform(X) poly_reg.fit(X_poly, y) # Fit the poly to another lin reg # to have eg. two independent vars # etc - using the Poly lin_reg2 = LinearRegression() lin_reg2.fit(X_poly, y) # Visualising the Linear Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title('Truth or Bluff for salary for job (LR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryLR.png') plt.close() # Visualising the Poly Regression results # For higher res X_grid = np.arange(min(X), max(X), 0.1) plt.scatter(X, y, color = 'red') plt.plot(X_grid, lin_reg2.predict(poly_reg.fit_transform(X_grid)), color = 'green') plt.title('Truth or Bluff for salary for job (PR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryPR-x.png') plt.close() prediction = lin_reg2.predict(X_poly) send(prediction.tolist(), 0) # Prediciting a new result with the Linear Regression model y_pred = lin_reg.predict(6.5) # This will be an awful result send(y_pred.tolist(), 0) # Prediciting a new result with the Polynomial Regression model y_pred_poly = lin_reg2.predict(poly_reg.fit_transform(6.5)) # This will be a great result! send(y_pred_poly.tolist(), 0)

2.4 Support Vector Regression

Very similar to Polynomial Linear Regression in regards to code, but we use Feature Scaling and the SVR class for the regressor. The kernel refers to the type of fit eg poly, rbf etc.

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type = 0): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values # Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() sc_y = StandardScaler() X = sc_X.fit_transform(X) y = sc_y.fit_transform(y) # Create the SVR regressor # SVR doesn't auto Feature Scale from sklearn.svm import SVR # kernel for linear, poly, rbf etc regressor = SVR(kernel='rbf') regressor.fit(X, y) # Prediciting the test set results y_pred = regressor.predict(6.5) # We have to do this because of feature scaling y_pred = sc_y.inverse_transform(y_pred) send(y_pred.tolist()) # Visualising the SVR results plt.scatter(X, y, color = 'red') plt.plot(X, regressor.predict(X), color = 'blue') plt.title('Truth or Bluff (SVR)') plt.xlabel('Position level') plt.ylabel('Salary') plt.show() # plt.savefig('svr.png') # plt.show() # plt.close()

3. Classification

3.1 Logistical Regression

The code can be found in ~/Learning/ML-Course/ml-a-z-course/part-3-classification/1-logistical-regression/.

First, start by adding in the Python Preprocessing template (search SnippetsLab).

In this first example, we are going to see if we can predict the purchase of an SUV given the Age and EstimatedSalary.

Since we are using columns 2,3 and we are attempting to predict 4, update the import of the dataset to look like the following:

# Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2,3]].values y = dataset.iloc[:, 4].values

Since there are 400 observations, let's use 300 for the training set and the rest for the test set.

# Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Because we want an accurate prediction of whether or not a user is going buy an SUV, we WANT feature scaling. Just uncommenting this will be enough to include feature scaling.

For an intuitive example of the "why" behind feature scaling, checkout Stack Overflow.

# Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

Fitting LR to the training set

We import the LogisticRegression class from sklearn.linear_model to and use the constructor to build the object that we will fit and use for predictions.

# Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=0) classifier.fit(X_train, y_train) # teach the correlations

Making the prediction

Here, we can just need use our X_test variable with the methond predict.

# Predicting the results y_pred = classifier.predict(X_test) print(y_pred)

Investigating the confusion matrix

A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. See more here.

Import the function (not a class) from sklearn.metrics.

# Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm) # prints [[65 3] # [ 8 24]]

Note: From the above, 65 and 24 are the correct predictions, and 3 and 8 are the incorrect predictions.

Visualising the results

The best way to check the results are to use a graph!

To intepret the graph, you will have a split of red and green points. All the points themselves represent each of our data points on a X/Y graph of the to IVs on the axis. The colour of the point itself references whether the predictions were to buy or not buy with the background colour representing the "prediction regions".

What is the goal for this? Since we want to classify the right users and put them into the right category, we can help use that to help target a particular demographic.

The line inbetween the regions is called a prediction boundary. What is it a straight line? That is because we are using a linear logistic classifier.

The last part that is important, is that the graph is a representation of the training set.

# Visualizing the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train # 0.01 pixel resolution and apply classifier on it # min - 1 and max + 1 for range X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) # create the plt contour split plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) # set x and y limits plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) # for each element in set, create a scatter element for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() # Visualizing the Test set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

3.2 K-Nearest Neighbours Algorith

Intuition

What K-NN does for you: help define if a new data point added should fall into the red category or into the green category.

How does it work?

  1. K-NN works by choosing the number K of nearest neighbours. One of the most common default values is 5.
  2. These neighbours are chosen with Euclidean distance.
  3. Among the K neighbours, count the number of data points in each category.
  4. Assign new data point to category based on most neighbours.

It is a very simple model.

K-NN in Python

Using our classification template, we can just add the necessary lines to import to classifier and use the fit method.

# Fitting classifier to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) classifier.fit(X_train, y_train)

The confusion matrix for this data using the K-NN method has some more success than the linear classifier!

[[64 4] [ 3 29]]

Final Code

# Classification template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Fitting classifier to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('K-NN Classifier (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() # Visualising the Test set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('K-NN Classifier (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

3.3 Support Vector Machine (SVM)

SVM Intuition

SVM searches for the maximum margin which is the line that separates the two classes of points with the largest distance between them. The two points are called the support vectors - they are the only two points that contribute to the results of the algorithm.

The line in the middle is called the maximum margin hyperplane/classifier.

So why SVM? As opposed to most machine learning algorithms that will try to use the most common elements of a set of information, SVM will try to find the extreme elements close to being classified as the other at the boundary. This in itself makes SVMs both special and very different. At times, this means that they could work a lot better because of this ignorance of other data points.

SVM in Python

Again, the important part is importing the classifier, creating the instance and running the fit method:

# Fitting SVM to the Training set from sklearn.svm import SVC # specify "linear" since that is what we want classifier = SVC(kernel='linear', random_state=0) classifier.fit(X_train, y_train)

In this case, the CM comes as the following:

[[66 2] [ 8 24]]

In the case of the model, the SVM actually turns out pretty good for this training set!

3.4 Kernel SVM

Kernel SVM Intuition

Think about a situation with not linearly seperable data (ie data in circles etc). This is not possible to set a useful boundary with SVM.

Mapping to a higher dimension

We map the data to a higher dimension in order to then use linear separation.

After we add the linear separator to this higher dimension, we use projection to again bring it back down a dimension and we will have our non-linear separator.

Warning: this could require a whole bunch of compute power.

The Kernel trick

It uses some intense looking math function with Euler's number. This function calculated is the Gaussian RBF Kernel and worth noting to revisit down the track. To visualise, think of a 3d plane and what the calculated number relates to. The point calculated comes from the central point on the XY plane. A reference to the image comes from here.

The landmark for the kernel itself is abstracted from the intuition, but it is calculated for us.

Sigma's role in this whole process is defining the circumference of how wide the definition for landmark 0 becomes.

You can also take multiple kernel functions and add them if required.

Classifications are generally assigned based on the kernel value being = 0 or > 0.

Types of Kernal Functions

To see some of these in 3d, head to this website

  • Gaussian RBG Kernel
  • Sigmoid Kernel
  • Polynomial Kernel

Kernel SVM Example

Classification code:

# Fitting classifier to the Training set from sklearn.svm import SVC # Set rbf for Gaussian RBF classifier = SVC(kernel='rbf', random_state=0) classifier.fit(X_train, y_train)

Confusion matrix:

[[64 4] [ 3 29]]

3.5 Naive Bayes

Bayes Theorem

This is more a prefix to using Naive Bayes.

To picture how this works, think of a spanner. There are two machines that both produce spanners, each spanner marked by which machine created it. What we want to find out is that if we go through and throw out the "defects", what is the probability that machine two will have a defect.

Mach 1: 30 wrenches/hr Mach 2: 20 wrenches/hr > Out of all defective parts, 1% are defective > Out of all defective, 50% from mach1, 50% from mach2 > Question: What is the probability that a part produced by mach2 is defective? Given that we know the totals > Probability(Mach1) = 30/50 = 0.6 > P(Mach2) = 20/50 = 0.4 And for defects # Prob of part being defective > P(Defect) = 1% # Prob of defect picking up from defect pile > P(Mach1 | Defect) = 50% > P(Mach2 | Defect) = 50% # Therefore > P(Defective | Mach2) = (P(Mach2 | Defect) * P(Defect)) / P(Mach 2) = 0.0125

Naive Bayes Intuition

Given our graph with two (can be more!) labeled categories (Walks and Drives) and axes labeled Salary and Age.

Armed with the knowledge of Bayes Theorem and the previous datapoints, what is the likelyhook with some person with these features walking?

> Posterior Probability = (Likelihood * Prior Probability) / Marginal Likelihood > P(Walks|X) = (P(X|Walks) * P(Walks)) / P(X) > P(Drives|X) = (P(X|Drives) * P(Drives)) / P(X) # After calculating > P(Walks|X) vs P(Drives|X)

Naive Bayes Example

Classifier code:

# Fitting Naive Bayes to the Training set # Create your Naive Bayes here from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(X_train, y_train)

Confusion matrix:

[[65 3] [ 7 25]]

With Naive Bayes, you'll have a nice curve without irregularities.

3.6 Decision Trees Classification

Decision Tree Intuition

CART = Classification and Regression Trees. This is an umbrella terms for:

  1. Classification trees ie red/green apples
  2. Regression trees ie temperature outside, cost for things etc

For the intuition behind it, the graph looks as if the graph is based on splits which are based on maximising the number of a certain category. That's a simple explanation, although there is some complex mathematics behind how it is working.

During the initial split, we begin making a decision tree. Ie first split a 60, is X2 < 60, then second might be X1 < 50 for split two on a particular branch. The final leaves on the branch are called the terminal leaves and these leaves are the final classification.

Decision trees are also old. They've started to die off since more sophisticated methods have come to replace them. Recently, they were "reborn" with additional methods like random forest, gradient boosting etc that have brought it back into the game. While not very powerful on their own, they are leveraged on for other methods.

Decision Tree Classification example

Classification code:

# Fitting decision tree to the Training set from sklearn.tree import DecisionTreeClassifier # To be as homogeneous as possible, we want to use entropy as we are looking to reduce this # information gain is what we want to improve after the split classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) classifier.fit(X_train, y_train)

As for the confusion matrix:

[[62 6] [ 3 29]]

Checking the visualisation of the output is pretty intuitive if you understand the idea of the decision trees and splitting used.

3.7 Random Forest Classification

Random Forest Intuition

Ensemble Learning is when you take multiple ML algorithms to come out with a final one. The random forest method using a number of random forest algorithms.

Steps:

  1. Pick at random K data points from Training set.
  2. Build Decision Tree associated at these K data points.
  3. Choose the number Ntree of trees you want to build and repeat steps 1 and 2.
  4. For a new data point, make each one of your Ntree trees predict the category to which the data point belongs and assign the new data point to the category that wins the majority vote.

With the "power of the crowd", it helps this classification become quite useful to get rid of particular uncertainties. It was used for things such as "Konnect" for Xbox.

Random Forest Classification Example

Classification code:

# Fitting Random Forest to the Training set # Create your Random Forest here from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier( n_estimators=10, criterion='entropy', random_state=0) classifier.fit(X_train, y_train)

Confusion Matrix:

[[63 5] [ 4 28]]

Be careful - we want to prevent overfitting. Remember: Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets.

3.8 Evaluating Classification Model Performance

False Positives and False Negatives

Given a set of results, we wanted to take what we already know and use that projected data to build out a prediction model.

Example - We started by taking an four random independent variable values, and for anything below a particular value (0.5) we projected to the floor, and the ceiling for any above. Given actual dependent variable data, we project what we know onto the model and seeing what the predicted equilvent would be.

We can get True Positive, False Positive, False Negative and True Negative. In both cases, we want the True values!

Confusion matrix

The y axis is the Actual DV, and the x axis is the Predicted DV

| TP | FP | | --- | --- | | FN | TN | > accuracy rate = correct / total > error rate = wrong / total

Accuracy Paradox

If we predicted that nothing would ever equal 0, the confusion matrix could possibly go up even though we just completely stopped using the model. Be wary about this paradox.

Cumulative Accuracy Profile (CAP)

Image a horizontal axis Total Contacted up to 100000, and vertical axis Purchased up to 10000.

Can we get more customers to purchase for less contacted customers? How can we pick and choose customers to contact? The area underneath the model increase is better and is known as the CAP.

The ideal line would be having 10% of customer who purchased, all were 100%. Ideal, but unlikely.

CAP Curve Analysis

Now that we know how it can work, how can analyse the CAP?

The standard approach to calculate the efficiency is this:

AR = a[r]/a[p] = area under model to random / area under perfect model to random

The second approach is to look at the 50% line, then where this intersects the model line, check the intersection to the vertical axis and take this number to use for assessing.

Generally, the numbers go like so:

XValue
X < 60%Rubbish
60% < X < 70%Poor
70% < X < 80%Good
80% < X < 90%Very Good
90% < X < 100%Too Good (be wary)

Be careful if it goes over 90%. You could also be overfitting and the anomolies down the track might not relate to the trained model.

Classification Summary

How do I know which model to choose for my problem?

Same as for regression models, you first need to figure out whether your problem is linear or non linear. You will learn how to do that in Part 10 - Model Selection. Then:

If your problem is linear, you should go for Logistic Regression or SVM.

If your problem is non linear, you should go for K-NN, Naive Bayes, Decision Tree or Random Forest.

Then which one should you choose in each case ? You will learn that in Part 10 - Model Selection with k-Fold Cross Validation.

Then from a business point of view, you would rather use:

  • Logistic Regression or Naive Bayes when you want to rank your predictions by their probability. For example if you want to rank your customers from the highest probability that they buy a certain product, to the lowest probability. Eventually that allows you to target your marketing campaigns. And of course for this type of business problem, you should use Logistic Regression if your problem is linear, and Naive Bayes if your problem is non linear.

  • SVM when you want to predict to which segment your customers belong to. Segments can be any kind of segments, for example some market segments you identified earlier with clustering.

  • Decision Tree when you want to have clear interpretation of your model results,

  • Random Forest when you are just looking for high performance with less need for interpretation.

How can I improve each of these models?

In Part 10 - Model Selection, you will find the second section dedicated to Parameter Tuning, that will allow you to improve the performance of your models, by tuning them. You probably already noticed that each model is composed of two types of parameters:

the parameters that are learnt, for example the coefficients in Linear Regression, the hyperparameters. The hyperparameters are the parameters that are not learnt and that are fixed values inside the model equations. For example, the regularization parameter lambda or the penalty parameter C are hyperparameters. So far we used the default value of these hyperparameters, and we haven't searched for their optimal value so that your model reaches even higher performance. Finding their optimal value is exactly what Parameter Tuning is about. So for those of you already interested in improving your model performance and doing some parameter tuning, feel free to jump directly to Part 10 - Model Selection.

4. Clustering

4.1 K-Means Clustering

K-Means Clustering Intuition

Think of a scatter plot. K-Means is used to help create clusters of groups. You can have as many IV as required.

Steps:

  1. Choose number of clusters K
  2. Select at random K points, the centroids (not necessarily from your dataset)
  3. Assign each data point to the closest centroid -> this forms K clusters
  4. Compute and place the new centroid of each cluster
  5. Reassign each data point to the new closest centroid. If any reassignment took place, go back to step 4.

It is basically just an iterative process that you continue until the end centroids converg eto a place that data points are never reassigned.

K-Means Random Initialization Trap

Think of three easily determined clusters. The question is, if we have a bad randomly initiasation points for the centroids, can we run into an issue with final clusters computed? Yes. This is the initialization trap.

How can we combat this? It is not so straight forward. The solution is a K-Means++ algorithm. It is quite an involved approach.

Choosing the right number of clusters

What is the metric to evaluate how a certain number of clusters performs? It's called the Within Cluster Sum of Squares (WCSS).

This is a sum of each point in cluster i where you sum the distance of point i to centroid and then square it. From this, you sum the total for each cluster applying this function.

How do we find the optimal goodness of fit as it keeps improving with more clusters given that it keeps improving? We look at the drop off after incrementing the number of clustesrs. We use the elbow method to see where the drop off goes from dramatic to small. That is a judgement call that you need to make as a data scientist.

K-Means Clustering Example

# K-Means++ Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers.csv') X = dataset.iloc[:, [3, 4]].values # Using the elbow method to find optimal number of clusters from sklearn.cluster import KMeans """ wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') # plt.show() """ # Applying k-means to the mall dataset kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0) y_kmeans = kmeans.fit_predict(X) # Visualising the cluster # y_kmeans == 0 is cluster 1 # 0, 1 for second arg are x,y plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1 - Careful') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2 - Standard') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3 - Target') plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4 - Careless') plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5 - Sensible') # Plotting the centroids plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.xlabel('Annual income (k$)') plt.ylabel('Spending Score (1-100') plt.legend() plt.show()

4.2 Hierarchichal Clustering

HC Intuition

As for intuition, the before and after HC can end up with similar results to K-Means Clustering.

There are two ways to do this:

  1. Agglomerative
  2. Divisive

Agglomerative HC

  1. Make each data point a single-point cluster that forms N clusters
  2. Take two closest data points and make them one cluster -> N - 1 clusters
  3. Take the two clusters and make them one -> N - 2 clusters
  4. Repeat step three until there is only one cluster left

Calculating distance between clusters

This is a crucial part. Distance between two clusters can be measured as:

  1. Closest points
  2. Furthest points
  3. Average distances
  4. Distance between centroids

How dendrograms work

Dendrograms have all points on the X axis and on the Y axis will have the Euclidean distances. You repeat the process based on the cluster size and the point to connect to.

HC Using Dendrograms

After setting a Euclidean distance threshold, we always want the similarity to be lower than the threshold to define the clusters.

If you have a dendrogram and reduce the threshold, the number of clusters will be equal to how many vertical lines the threshold goes through.

To decide the distance, generally you will look for when the clustering arm becomes longest.

HC Example

Determining optimum clusters

Plot the dendrogram, then look at it to decide how many clusters there should be by identifiying the longest arm.

# Using the dendrogram to find optimal number of clusters import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) plt.title('Dendrogram') plt.xlabel('Customers') plt.ylabel('Euclidean distances') plt.show()

HC Full example

# K-Means++ Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers.csv') X = dataset.iloc[:, [3, 4]].values # Using the dendrogram to find optimal number of clusters import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) plt.title('Dendrogram') plt.xlabel('Customers') plt.ylabel('Euclidean distances') plt.show() # Fitting hierarchical clustering to the mall dataset # Applying k-means to the mall dataset # Note we use AgglomerativeClustering here from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering( n_clusters=5, affinity='euclidean', linkage='ward') y_hc = hc.fit_predict(X) # Visualising the cluster # y_hc == 0 is cluster 1 # 0, 1 for second arg are x,y plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s=100, c='red', label='Cluster 1 - Careful') plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s=100, c='blue', label='Cluster 2 - Standard') plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s=100, c='green', label='Cluster 3 - Target') plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4 - Careless') plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5 - Sensible') plt.xlabel('Annual income (k$)') plt.ylabel('Spending Score (1-100') plt.legend() plt.show()

5. Associate Rule Learning

5.1 Apriori

Apriori Intuition

Think of the correlation of why customers would buy nappies and beers.

People who bought also bought...

Apriori can also help us build rules based on what else has been done.

Steps

  1. Set a minimum support and confidence
  2. Take all subsets in transactions having higher support than minimum support
  3. Take all the rules of these subsets having higher confidence than minimum confidence
  4. Sort the rules by decreasing lift

Support

support(M) = number user watchlists containing M / number user watchlists support(I) = number transactions containing I / number of transactions

Confidence

confidence(M1->M2) = number user watchlists containing M1 and M2 / number user watchlists containing M1 confidence(I1->I2) = number transactions containing I1 and I2 / number transactions container I1

Lift

Chances of people who liked movie 1 liking movie 2.

What are the chances of recommending Ex Machina if they've seen Interstellar?

Say 10 out of 100 liked Ex Machina from the total but 17.5% of those who watched Interstellar liked Ex Machina:

Lift = 17.5% / 10% = 1.75

Apriori

For this example, we will actually use a file instead of a library.

Python code:

# Apriori # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset # 7500 customer and what they had in their basket dataset = pd.read_csv('Market_Basket_Optimisation.csv', header=None) print(dataset) # Need to prepare the data correctly for a list of lists transactions = [] for i in range(0, 7501): transactions.append([str(dataset.values[i, j]) for j in range(0, 20)]) # Training Apriori on dataset # REMEMBER: this depends on your dataset. You need so spend some time for args here. from apyori import apriori rules = apriori(transactions, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2) # Visualising the results results = list(rules) for item in results: # first index of the inner list # Contains base item and add item pair = item[0] items = [x for x in pair] print("Rule: " + items[0] + " -> " + items[1]) # second index of the inner list print("Support: " + str(item[1])) # third index of the list located at 0th # of the third index of the inner list print("Confidence: " + str(item[2][0][2])) print("Lift: " + str(item[2][0][3])) print("=====================================")

6. Reinforcement Learning

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired outcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.

6.1 Upper Confidence Bound (UCB)

The Multi-Armed Bandit Problem

Think of a robotic dog: We can either give it an algorithm to follow, or we can give it all the options it has and give it a reward or punishment based on the choices it makes.

What is the problem?

A one-armed bandit is a slot machine (there is history behind the name). It is to do with the old levers and bandit comes from the fact that it takes your money. The multi comes into it when you think of many of these machines. How do you play them to maximise your return? Without knowing the distribution of chances how to win, we need to figure this out while spending the least amount of time.

A use case for things like this could be to figure out which ad gives us the best return.

UCB Intuition

  • We have d arms eg ads displayed to user on a page
  • Each time a user connects to the page, this makes a round
  • At each round n, we choose one ad to display to the user
  • At each round n, ad i gives reward ri as an element of {0, 1}: ri = 1 if the user clicked on the ad i and 0 if they did not
  • The goal is to maximize the total reward we get over many rounds

For each distribution at the start, we assume that they are all the same. For the first couple of rounds, they are basically all trial runs. We do that to create the initial confidence bound. As the agent observing the data, the value goes up or down based on the observed average.

Because we now have an extra observation, our confidence bound will get smaller as we become more confident.

The confidence bounds only have one task: have the expected value within the box bound.

Confidence bound

Confidence bound

Confidence bound interval

Confidence bound interval

As time goes on and the confidence bound converges because the result is good, it moves on to others to remain unbiased. We continually choose the option with the "upper" confidence bound as the next choice.

UCB Implementation in Python

Given a scenario of 10000 ads being shown to users, we want to see how many click no a specific ad (denoted 0 and 1).

Beginning here, we start with a random 10000 iterator as a test, and we see it gets 1246 rewards in total. If we keep running the algorithm, you can see we continually get ~1200.

If we run a histogram of a distibution of the ads, we notice that they are almost similar in the number of times chosen since it is random and not using an educated guess.

Implementing UCB by scratch

There is no real easy package to implement UCB, so we add this by scratch.

Steps:

  1. At each round n, we consider two numbers for each ad i.
    • $N_i(n)$ - the number of times ad i was selected up to round n
    • $R_i(n)$ - the sum of rewards of the ad i up to round n
  2. From these two numbers we compute:
    • The average rewards of ad i up to round n
    • The confidence interval at round n
  3. We select the ad i that has the maximum UCB.

# Implemented UCB import math N = 10000 d = 10 ads_selected = [] numbers_of_selections = [0] * d sum_of_rewards = [0] * d total_reward = 0 for n in range(0, N): ad = 0 max_upper_bound = 0 for i in range(0, d): # if statement for initial conditions if (numbers_of_selections[i] > 0): average_reward = sum_of_rewards[i] / numbers_of_selections[i] delta_i = math.sqrt(3/2 * math.log(n + 1) / numbers_of_selections[i]) upperbound = average_reward + delta_i else: # this large value is used to ensure we select the 10 different ads over first 10 rounds upperbound = 1e400 if upperbound > max_upper_bound: max_upper_bound = upperbound ad = i ads_selected.append(ad) numbers_of_selections[ad] = numbers_of_selections[ad] + 1 reward = dataset.values[n, ad] # sum of rewards for specific ad sum_of_rewards[ad] = sum_of_rewards[ad] + reward # sum of all rewards over all ads total_reward = total_reward + reward print(total_reward) # 2178 # Visualising plt.hist(ads_selected) plt.title('Histogram of Ad Selections') plt.xlabel('Ads') plt.ylabel('Selections') plt.show()

6.2 Thompson Sampling

Thompson Sampling Intuition

Thompson Sampling

Thompson Sampling

Bayesian Inference

Without delving deep into the math, this requires Bayesian Inference.

Thompson Sampling Algorithm

Again, we have no prior knowledge of the current situation. What Thompson Sampling will end up doing is actually creating a distribution based on returns.

These distributions are representing where we think the actual expected value might lie. Using this, we can generate our own "bandit" configuration.

Steps:

  1. At each round $n$, we consider two numbers for each ad $i$

    • $N_i^1(n)$ - the number of times the ad $i$ got rewards 1 up to round $n$

    • $N_i^0(n)$ - the number of times the ad $i$ got rewards 0 up to round $n$

  2. For each ad $i$ we take a random draw from the distribution below:

    • $\theta_i(n)=\beta(N_i^1(n)+1,N_i^0(n)+1)$
  3. We selet the ad that has the highest $\theta_i(n)$

Algorithm Comparison vs UCB

They both do solve the same problem, but there are pros and cons for both.

UCB Pros/Cons

  • Deterministic
  • Requires an update every round

Thomspon Samples Pros/Cons

  • Probablistic
  • Can accomodate delayed feedback
  • Better empirical evidence

Implementing the algorithm

# Data Preprocessing Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset # Note this dataset is just for simulation # Observing if 10000 users click on an ad (0 or 1) # We are not showing the ads at random - based on UCB # This will be chosen after the first 10 in this simulation dataset = pd.read_csv('Ads_CTR_Optimisation.csv') # Implemented UCB import random N = 10000 d = 10 ads_selected = [] numbers_of_rewards_1 = [0] * d numbers_of_rewards_0 = [0] * d total_reward = 0 for n in range(0, N): ad = 0 max_random = 0 for i in range(0, d): random_beta = random.betavariate( numbers_of_rewards_1[i] + 1, numbers_of_rewards_0[i] + 1) if random_beta > max_random: max_random = random_beta ad = i ads_selected.append(ad) reward = dataset.values[n, ad] if reward == 1: numbers_of_rewards_1[ad] = numbers_of_rewards_1[ad] + 1 else: numbers_of_rewards_0[ad] = numbers_of_rewards_0[ad] + 1 total_reward = total_reward + reward print(total_reward) # 2178 # Visualising plt.hist(ads_selected) plt.title('Histogram of Ad Selections') plt.xlabel('Ads') plt.ylabel('Selections') plt.show()

8. Deep Learning

You need data and a lot of data + a bunch of processing power.

But what is deep learning? A lot of it is based on mimicing the mind. A lot of the terminology is based on terms like "neurons".

Thinking of the network, we can think of the input layer which consists of input values, an output layer at the end with the output value and a hidden layer between. So all the input layers are connected to the hidden layer, the hidden layer is connected to the output layer.

Deep learning occurs when we have lots and lots of hidden layers.

Geoffrey Hinton is a good reference of someone leading the field.

8.1 Artificial Neural Networks

Plan of attack

What will we learn?

  • The neuron
  • The activation functions and examples
  • How neural networks work
  • How neural networks learn
  • Gradient Descent
  • Stochastic Gradient Descent
  • Backpropagation

The Neuron

How can we recreate the neuron in the machine? We mimic how neurons and their networks look like in the brain. The neuron itself consists of that main body, dendrites and the axon.

Neuron diagram

Neuron diagram

Conceptionally, the dendrites of a neuron are connected to other axons. The whole concept of a an impulse being passed is the synapse. Synapses is an important term.

The neuron in our case gets a number of input signals, and gives an output signal. For the sake of understanding, input values can be representing in yellow, the neuron in green. The joining between input and neuron is the synapse.

Input value can themselves just be a standardized independent variable.

Additional reading: Efficient BackProp b Yann LeCun.

What can the output value be? Can be continuous (price), binary (will exit yes/no), categorical.

An important thing to note is that for both the input and output is just for a single observation.

On the synpase, there also weights. These values are important for the process and it is the weights that are adjusted during the learning process.

Basic diagram

Basic diagram

Activation functions

Threshold function

A simple function that defines that if a value is less than 0, pass a 0, else pass 1.

Threshold

Threshold

Sigmoid function

$\phi(x)=\frac{1}{1+e^{-x}}$

What is good about this function is that it is smooth. A gradual progression. It is very useful for the final layer - especially for things like probability.

Sigmoid

Sigmoid

Rectifier function

$\phi(x)=max(x,0)$

Starts at 0 and then at some point turns towards 1.

Rectifier

Rectifier

Rectifier function is one of the most used.

Hyperbolic Tangent Function (tanh)

$\phi(x)=\frac{1-e^{-2x}}{1+e^{-x}}$

How do NNs work?

An important part of NNs are training them up, but for now let's just focus on how they work (pretend it is already trained up) so we can see the application we are working towards.

Say we have 4 input variables: Area (feet), Bedrooms, Distanct to city (Miles) and Age.

The output variable we want is the price. So what we want is that the weights that give a price calculation.

Hidden layer

The power here comes from that hidden layer.

Hidden layer

Hidden layer

If we walk through this, we assume that some weight will have a zero value and some others will have a non-zero value.

Each neuron itself could pick up all or some of the input values after being trained.

How do NNs learn?

If you think of a perceptron (single-layer feed forward), think of $y$ as the actual value and $\hat{y}$ is the output value from the output layer. To learn, we need to compare this output value to the actual value. We can plot both on a graph and calculate the cost function $C = 1/2(\hat{y}-y)^2$ - we will then feed this information back into the neural network to adjust the weights. We want the cost function to tend towards zero.

One epoch is when we go through all the data and train based on the rows.

One epoch

One epoch

Again, optimal weights are found when you reduce the cost function as much as possible. The whole process is known as back propagation.

Gradient descent

Gradient decent is finding the lowest C value over the iterations. If we map the cost function on a graph, it can look like a parabola. We look at the angle of our cost function along this point for $\hat{y}$.

Stochastic Gradient Descent

What happens if our cost function is not convexed? Ie not just a parabola? How do we find the minimum for the cost function?

Normal gradient descent is when we take all of our rows and adjust the weights after calculating it all. The is also known as Batch Gradient Descent. With Stochastic Gradient Descent, we actually take the rows one by one, then run the network, look at the cost function and then adjust the weights.

Backpropagation

We know there is a process called forward propagation to get our values for our $\hat{y}$'s. Backpropagation is an advanced algorithm that allows us to adjust all the weights simulataneously at the same time.

Training the ANN with Stochastic Gradient Descent

  1. Randomly initialise the weights to small numbers close to 0 (but not 0)
  2. Input the first observation of your dataset in the input layer, each feature in one input node
  3. Forward propagation (left to right): neurons activated in a way that the impact of each neuron's actiation is limited by the weights. Propagate the activations until getting the predicted result y.
  4. Compare predicted result to the actual test. Measure the generated error.
  5. Back-propagation: from right to left, the error is back propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.

Business Problem Description

The Churn_Modelling.csv file is bank data and they're trying to figure out the churn and why customers are leaving at high rates. They want you to assess and address the problem.

They've had 10000 customers and they're basically watching them. If the left, the exited field is 1, else 0.

We need to create a model of customers that are at high risk of leaving.

Installing Theano, Tensorflow and Keras

pip install --upgrade git+https://github.com/Theano/Theano.git#egg=Theano pip install --upgrade tensorflow pip install --upgrade keras # if conda installing tensorflow conda install -c conda-forge tensorflow conda install -c conda-forge keras

  • Theano - a library to help make use of the graphics card
  • Tensorflow - an opensource numerical computations library
  • Keras - a library that wraps both Theano and Tensorflow to make it abstracted to build powerful deep learning models with short amounts of code

For install issues, check Stack Overflow

Preparing the data for the future deep learning model

Given the answer we are looking for, we are in the middle of a classification problem.

As we look through all the data, we need to decide which independent variables may impact the dependent variable.

For this, we first start by setting our dataframes for our IV and DV.

# Importing the dataset dataset = pd.read_csv('Churn_Modelling.csv') X = dataset.iloc[:, [3:13]].values y = dataset.iloc[:, 13].values

Since we have some string categories, we need to encode these categories.

# Clean the categorical variables # Encoding categorical data # We need to encode both gender and country from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode country labelencoder_X_1 = LabelEncoder() X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # Encode gender labelencoder_X_2 = LabelEncoder() X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) # We need to remove dummy variables for one to avoid the dummy variable trap onehotencoder = OneHotEncoder(categorical_features = [1]) X = onehotencoder.fit_transform(X).toarray() X = X[:, 1:]

A good link on the Dummy Variable Trap.

Dummy Variable Trap

Dummy Variable Trap

Building the ANN

Steps:

  1. Randomly init weights to small numbers close to 0
  2. Input the first observation of your dataset in the input layer, each feature in one input node
  3. Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y.
  4. Compare predicted result to actual result. Measure generated error.
  5. Back-Propagation: from right to left, the error is back-propagated.
  6. Repeat steps 1 to 5 and employ either Batch Learning or Reinforcement learning.
  7. When the whole training set passed through the ANN, that makes an epoch. Redo more epochs.

# Classification template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Churn_Modelling.csv') X = dataset.iloc[:, 3:13].values y = dataset.iloc[:, 13].values # Clean the categorical variables # Encoding categorical data # We need to encode both gender and country from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode country labelencoder_X_1 = LabelEncoder() X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # Encode gender labelencoder_X_2 = LabelEncoder() X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) # Encode one of the categorical features onehotencoder = OneHotEncoder(categorical_features=[1]) X = onehotencoder.fit_transform(X).toarray() # We update X to again finish removing a varible for the dummy variable trap X = X[:, 1:] # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Building the ANN import keras from keras.models import Sequential from keras.layers import Dense # Initialising the ANN classifier = Sequential() # Adding in the input layer and the first hidden layer classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu', input_shape=(11,))) # Add second hidden layer classifier.add(Dense(units=6, init='uniform', activation='relu')) # Add output layer - softmax for activation if you have more than 2 categories for DV classifier.add(Dense(units=1, init='uniform', activation='sigmoid')) # Compiling the ANN classifier.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) classifier.fit(X_train, y_train, batch_size=10, epochs=100) # Predicting the Test set results y_pred = classifier.predict(X_test) y_pred = (y_pred > 0.5) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm)

8.2 Convolutional Neural Networks

What are CNNs

The example given is dealing with vague images that could be two images in one.

Mixed image

Mixed image

The example above is the rabbit vs duck. The brain changes how it processes the image based on what it sees.

The other example given was the face with four eyes and two mouths to illustrate the point that the brain finds it hard to comprehend certain images.

Yann LeCun is grandfather of CNNs. Big findings from him were made in the 80s/90s.

How does it work? We have an input image that goes through the CNN to an output label (image class). For each pixel, the computer sees a value between 0 and 255. For a black and white image, there is only one channel and the computer sees a 2d array, however for coloured images it is a 3d array with a red, blue and green channel.

Steps:

  1. Convolution
  2. Max Pooling
  3. Flattening
  4. Full Connection

The initial paper can be found here if you want to read.

Convolution Operation

$(f*g)(t)=^{def}\int^{\infty}_{-\infty}f(\tau)g(t-\tau)d\tau$

More reading if you want a good intro can be found here.

What is a convolution in intuitive terms? We have an input image and a feature detector. Feature detector can also be known as a kernal or a filter.

Feature detector

Feature detector

As the filter moves through, we call it a stride.

The feature map is calculted by can also be called a convolved map or a feature map.

Are we losing information when we apply the feature detector? Yes, but we are looking to detect parts of the image that are integral. In our personal life, we don't look at every pixel, we look at features.

To create our first convolution layer, we create many feature maps.

Many maps

Many maps

Applying these filters actually is what can happen we apply image filters too. Things like sharpen, blur, edge detect etc are basic applications of these filters. The output feature map is the filtered image.

Rectified Linear Unit (ReLU) Layer

During this small step, you apply a rectifier function to increase non-linearity. The example given was a filter that has an image going from light to dark with white, grey, black. The applied rectifier function breaks up this linearity.

A link to more information on ReLU vs the others can be found here. For more indepth information on rectifiers, read here.

Pooling

What is pooling and why do we need it? Think of a cheetah image where it is positioned properly, one where it is rotated, and another where it is squashed.

In our case, we are going to apply max pooling. Max pooling again looks at small sections of the matrix and the pool feature map outputs the max of each section. This reduction helps with a number of things, including processing power, parameters (preventing overfitting) and preserving features but account for spatial of featural distortions.

Here is a good read on max pooling. Pooling can also be known as downsampling.

A great visualiation tool can be found here.

Max pooling

Max pooling

Flattening

With the pooled feature map, we flatten it into a column.

Flattening

Flattening

Full Connection

After the flattening, we go back to the fully connected layer (hidden layer - input layer needs to be fully connected for CNNs) and output layer.

Full connection

Full connection

In the above, we have an example of a classification output layer which can be seen on the output layer above. The layers prior to the classification may, for example, have strong probabilities based on features for each node that contribute towards the weights of the features required for each classification.

Summary

Link to The 9 Deep Learning Papers...

Softmax and Cross-Entropy

Softmax

Softmax

The softmax function helps our output layer sum to 1.

The cross-entropy function after applying the softmax function is cross-entropy. This calculates the loss function which we want to minimize.

Cross entropy function

$H(p,q) = - \sum_{x}p(x)log{q(x)}$

Say we have two neural networks and we have a few images of dogs and cats, we want to see what our NN predict. After evaluating, we can check things like the classification error (not a great one), mean squared error (which is more accurate) and cross-entropy (also more accurate). So why use cross-entropy over mean squared error? The answer is a few advantages. A couple are:

  1. At the start of back propagation, the gradient descent will be low if the error is low. Cross-entropy helps with this as it has $log$ in the calculation.

To get a better understanding, check out this YouTube video.

For some reading on cross entropy, checkout this reading.

CNN in Python

With these images, we cannot put the DV on the same array at the information as the image data. We can instead write some code to instead abstract the word "cat" or "dog" from the file name to create the DV. Another solution (better) is using Keras.

The final code looks like so:

# Part 1 - Building CNN # Importing the dataset structure with Keras # First structure pillar is to seperate the images into test_set and training_set # Second within these folders is split into the DV cats and dogs. from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Flatten # Dense is used for the fully connected layers from keras.layers import Dense # Initialising the ANN classifier = Sequential() # Step 1, adding a convolution layer # kernel_size is the feature detector matrix size # input_shape is 3d, so we want each channel to point to each array # 64, 64 is for the colours 2d arrays (not using 255 because of CPU) and 3 for 3 dimensions (for tensorflow backend) classifier.add(Conv2D( filters=32, kernel_size=(3, 3), input_shape=(64, 64, 3), activation='relu')) # Step 2 # Pool size for how big we want our pooling matrix # Since we don't set the strides tuple, strides will default # to the pool_size classifier.add(MaxPooling2D(pool_size=(2, 2))) # Optional, create another Conv + Max Pooling layer classifier.add(Conv2D( filters=32, kernel_size=(3, 3), activation='relu')) classifier.add(MaxPooling2D(pool_size=(2, 2))) # Step 3 - Flattening # This huge, flat array will relate to a specific feature classifier.add(Flatten()) # Step 4 - Full connection # Adding in the input layer and the first hidden layer classifier.add(Dense(units=128, activation='relu')) # If wasn't a binary output, we would use softmax classifier.add(Dense(units=1, activation='sigmoid')) classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # Part 2 - Fitting CNN to images from keras.preprocessing.image import ImageDataGenerator # Some of the args are for applying random transformations for training train_datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) # Only requires rescale since the rest doesn't need transforms test_datagen = ImageDataGenerator(rescale=1./255) # target_size update for our 64 expectation training_set = train_datagen.flow_from_directory( 'dataset/training_set', target_size=(64, 64), batch_size=32, class_mode='binary') test_set = test_datagen.flow_from_directory( 'dataset/test_set', target_size=(64, 64), batch_size=32, class_mode='binary') # steps_per_epochs=number of images in training set classifier.fit_generator( training_set, steps_per_epoch=8000, epochs=25, validation_data=test_set, validation_steps=2000)

After training the model, if you want to use it and make predictions, you can save and reload that model. Check here for more information.

9. Dimensionality reduction

In Classification, we only worked with datasets comprised of only two independent variables. This is because:

  1. We needed two dimensions to visualise how ML models worked.
  2. Because whatever is the original number of IV, we can often end up with two indepent variables by applying an appropriate Dimensionality Reduction technique.

Feature selection techniques covered in Regression (Part 2) included Backward Elimination, Forward Selection, Bidirectional Elimination, Score Comparison and more.

In this part, we will cover the following Feauture Extraction techniques:

  1. Principal Component Anaylsis (PCA)
  2. Linear Discriminant Analysis (LDA)
  3. Kernel PCA
  4. Quadratic Discriminant Analysis (QDA)

9.1 Principal Component Analysis - PCA

One of the most used unsupervised algorithms. It is used for features such as:

  • Noise filtering
  • Visualization
  • Feature extraction
  • Stock market predictions
  • Gene data analysis

It is used to:

  1. Identify patterns in data

  2. Detect the correlation between variables

PCA

PCA

The goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d). We want to:

  1. Standardize the data.
  2. Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
  3. Sort eigenvalues in descending order and choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace $(k\le{d})$.
  4. Construct the projection matrix W from the selected $k$ eigenvectors.
  5. Transform the original dataset X via W to obtain a $k$-dimensional feature subspace Y.

A great link on the mathematics behind it can be found here.

A great visual link for intuition can be found here.

For 2D, we can see how the relationship works for dimensionality reduction. The real power can be seen for the 3 dimensional space.

PCA in summary helps us to learn about the relationship between the X and Y values and find the list of principal axes. Be careful though, PCA is highly affected by outliers.

PCA in Python

If we have $n$ independent variables, PCA extracts $p\le{m}$ new independent variables that explain the most of the variance in the dataset regardless of the dependent variable. The fact that DV is not considered is what makes PCA an unsupervised model.

What we want to do with the Wine.csv file is take the data and make a classification model like logistical regression. That will help us create a recommended wine. To visualise the predictions, it cannot be done with all the independent variables. We apply dimensionality reductions techniques to show two variables that can help us visualise this instead.

Confusion matrix

The CM will end up with 3 dimensions in this case. The diagonal will still contain the correct predictions while the rest will not.

Code

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.decomposition import PCA # n_components is the principal components we want # Note: use None at first to find what uses the most pca = PCA(n_components=2) X_train = pca.fit_transform(X_train) X_test = pca.fit_transform(X_test) # We want to find what variables explain the variance # Check the print out and then use it we need """ explained_variance = pca.explained_variance_ratio_ print(explained_variance) """ # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_Set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_Set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

Plot

Note that in this example we had to set a third colour for the third recommendation.

PCA Plot

PCA Plot

9.2 Linear Discriminant Analysis (LDA)

LDA Intuition

While it may seem similar to PCA, but there are important differences. LDA is used in preprocessing step for pattern classification.

LDA differs because in addition to finding the component axises with LDA, we are interested in the axes that maximise the separation between multiple classes.

LDA

LDA

The goal of LDA is to project a feature space onto a smaller subspace while maintaining the class-discriminatory information.

Here is a good intro into Linear Discriminant Analysis.

Steps for LDA

  1. Compute the $d$-dimensional mean vectors for the different classes from the dataset.
  2. Compute the scatter matrices (in-between-class and within-class scatter matrix).
  3. Compute the eigenvectors $(e_1,e_2,...,e_d)$ and corresponding eigenvalues $(λ_1,λ_2,...,λ_d)$ for the scatter matrices.
  4. Sort the eigenvectors by decreasing eigenvalues and choose $k$ eigenvectors with the largest eigenvalues to form a $d\times{k}$ dimensional matrix $W$ (where every column represents an eigenvector).
  5. Use this $d\times{k}$ eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: $Y=X\times{Y}$ (where X is a $n\times{d}$-dimensional matrix representing the $n$ samples, and $y$ are the transformed $n\times{k}$-dimensional samples in the new subspace).

LDA in Python

As opposed to PCA, LDA is a supervised model since it takes the dependent variable into consideration.

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # n_components is the principal components we want # Note: for LDA, we want to include y_train for X_train transform as LDA is supervised lda = LinearDiscriminantAnalysis(n_components=2) X_train = lda.fit_transform(X_train, y_train) X_test = lda.transform(X_test) # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

9.3 Kernal PCA

Know when to apply it. It is useful when data is not linearly seperable.

Mappingfunc

Mappingfunc

Before and after

Before and after

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.decomposition import KernelPCA # n_components is the principal components we want # Note: use None at first to find what uses the most kpca = KernelPCA(n_components=2, kernel='rbf', random_state=0) X_train = kpca.fit_transform(X_train) X_test = kpca.transform(X_test) # We want to find what variables explain the variance # Check the print out and then use it we need """ explained_variance = pca.explained_variance_ratio_ print(explained_variance) """ # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

Help and Issue Tracking

  • If you run into a MKL error, check here.
  • Updates to sklearn mean that train_test_split comes from sklearn.model_selection.