- Machine Learning - Udemy A-Z
- Part 1 - Data Preprocessing
- 1. The initial data
- 2. Regression
- 2.1: Simple Linear Regression
- 2.2 Multiple Linear Regression
- 2.3 Polynomial Linear Regression
- 2.4 Support Vector Regression
- 3. Classification
- 3.2 K-Nearest Neighbours Algorith
- 3.3 Support Vector Machine (SVM)
- 3.4 Kernel SVM
- 3.5 Naive Bayes
- 3.6 Decision Trees Classification
- 3.7 Random Forest Classification
- 3.8 Evaluating Classification Model Performance
- Classification Summary
- 4. Clustering
- 4.1 K-Means Clustering
- 4.2 Hierarchichal Clustering
- 5. Associate Rule Learning
- 5.1 Apriori
- 6. Reinforcement Learning
- 6.1 Upper Confidence Bound (UCB)

- Part 1 - Data Preprocessing

Dataset | Example set |
---|---|

Country | String |

Age | Int |

Salary | Int |

Purchased | Boolean |

This dataset also has `independent vs dependent`

variables, with the `dependent`

variable being the Purchased data.

So using the first three variables, we will predict the fourth column.

**In Python**

Libraries | What for? |
---|---|

matplotlib | Has a bunch of very useful and intuitive tools |

numpy | Help with math |

pandas | Imports and manages data sets |

import numpy as np import matplotlib.pyplot as plt import pandas as pd

**In R**

Here, we don't need to import any libraries since R Studio comes with a bunch of them!

Here, we will import the variables and create a matrix of observations.

**In Python**

Set the working directory to where we need to be.

# given the pandas import dataset = pd.read_csv('Data.csv') # iloc[lines, columns] -> :-1 all columns except last X = dataset.iloc[:, :-1].values # if we print X, it will create a matrix of the data and give a datatype y = dataset.iloc[:, 3].values # printing y will give the last column values

**In R**

REMEMBER - R Arrays begin from 1

#importing the dataset dataset = read.csv('Data.csv');

How can handle the problem when there is null data for where the is missing data?

One way to get around this is to take the mean of the columns.

For these dataset in `Age`

, we will replace that data with the mean.

**In Python**

The library will will use is `sklearn`

.

`sklearn`

is SideKick learn and is an amazing library. We import Imputer to help with the preprocessing.

from sklean.preprocessing import Imputer # set NaN and we will see that the missing values are NaN # strategy default is mean anyway but we'll be verbose # axis = 0 imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # lowerbound included, upperbound is excluded imputer = imputer.fit(X[:, 1:3]) # tranform method replaces the missing data X[:, 1:3] = imputer.tranform(X[:, 1:3])

**In R**

# ifelse is like a ternary # is.na is to check if value is missing or not dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)

What happens when we have strings instead of numbers for defining data? We must convert them to numbers. Example, we have country strings and a bool column in the data given.

# encoding catagorical data from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

However, the problem is that since the encodings are of int values, we could actually have the computer consider that the higher integer is of greater importance where it is not.

Instead, what we will do is essentially set up three columns that work like an `adjacency list`

.

`1`

where the country is correlated to the row, `0`

otherwise.

# encoding catagorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(catergorical_features = [0]) # ensure that X is transformed X = onehotencoder.fit_transform(X).toarray()

However, we will need to understand which variable we know are which.

Let's look at the encoding for the `dependent`

variable, where we only need the LabelEncoder.

# ... labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y)

In the case of the boolean, we basically want to numbers to be encoded to 0 and 1.

**In R**

For R, we just need to factor the way we want to.

Since we have the factor function, the number encoding themselves don't need to be setup in the same way that it was for Python.

# Encoding catergorical data # remember c() is a Vector! dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany'), labels = c(1,2,3)) dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes') labels = c(0, 1))

With any model, we should split the data into the training set and the test set.

We need to build our models on the set and then test it on a new set against which we used certain data for that model.

The performance should not differ too much.

For this section, we use `from sklearn.model_selection import train_test_split`

to do the training, testing and splitting.

`train_test_split(*arrays, test_size, train_size)`

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=0) # use below if using python-shell in node res = X_train.tolist() send(res, 0) res = X_test.tolist() send(res, 0)

With two variables, we can find the Euclidean Distance between point one and point two as `sqroot((x[1] - x[0])^2 + (y[1] - y[0])^2)`

However, with two very contrasting sizes of variables, the difference may be so ridiculous due to the square difference. Basically, the smaller, less dominant one may not exist.

# # FEATURE SCALING # from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

How about the Dummy Variables? It won't break the Model if you don't scale it, but you might lose how we can intepret which country is which.

Even when no Euclidean distance is required, Feature scaling allows the execution to be much faster.

# Importing the libraries import numpy as mp import mapplotlib.pypot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Data.csv') x = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values # Taking care of missing data # Not compulsary - only if data is missing from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = Imputer.fit(X[: 1:3]) X[: 1:3] = imputer.transform(X[:, 1:3]) # Encoding categorical data # Not compulsary - only if we need to convert the data from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode Strings # Think example of countries to [0|1] matrix # Encoding the Independent Variable labelencoder_X = LabelEncoder() # put in index for country column X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(categorical_features = [0]) # ensure that X is transformed # details here http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html X = onehotencoder.fit_transform(X).toarray() # Encoding the Dependent Variable labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y) # # SPLITTING THE SET INTO THE TRAINING AND TEST SET # from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # # FEATURE SCALING # from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise your model is predicting present but unknown values. Regression technique vary from Linear Regression to SVR and Random Forests Regression.

In this part, you will understand and learn how to implement the following Machine Learning Regression models:

Simple Linear Regression Multiple Linear Regression Polynomial Regression Support Vector for Regression (SVR) Decision Tree Classification Random Forest Classification

Looking at years of experience vs salary.

The issue - what is the correlation between `Years of experience`

and `Salary`

.

Ask the questions, what are the values that we get from this model? We could have a business go back to this model and apply it to help get an idea of salaries you are willing to give out.

Simple linear regression is basically `y = b[0] + b[1]*x[1]`

(even y = mx + c)

# Example - How does salary change with years of experience? y - dependent variable (DV) eg. (y = salary change) x - independent variable(IV) eg. years of experience b[1] - coefficient of IV (unit changes in x[1] how it affects y) b[0] - constant

Regression - look at the hard facts.

The simple linear regression will basically be a best fit for the data.

In the case of `b[0]`

, that will be the `y-intercept`

. `b[1]`

being the point at y.

On the `XY Graph`

the datapoints will all end up being the independent variables. If we draw lines from these points to the model linear regression line, we can see where that person should be sitting. If `y[i]`

is the data point, `y[hat][i]`

is the point is modelled that is should be.

To get the best fitting line, we just `sum(y - y[hat])^2`

to get the `min`

.

In this example, `YourExperience`

is the independent value and `Salary`

is the dependent value.

# Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values send(X.tolist(), 0); send(y.tolist(), 0); # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

If we run the above, we may get an error from `sklearn.preprocessing`

that is that 1d arrays need to be reshaped.

In simple linear regression, we also don't need to worry about `Feature Scaling`

.

**Fitting Simple Linear Regression to the Training Set**

`fit`

the`regressor`

# Add to the above code # Fitting simple ;inear Regression to the Training Set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) send(str(regressor), 0);

Now that we have the `regressor`

, we can start making basic predictions! With the Linear Regression object, we can now do this using the `predict`

method.

# Add to code above # Prediciting the test set results y_pred = regressor.predict(X_test) # send(X_test.tolist(), 0) # see test set years for IV # send(y_test.tolist(), 0) # check what the results were # send(y_pred.tolist(), 0) # check the predictions

**Visualizing the Model**

This will be training set to train a line and now we can see how it goes against first - the training set, and then secondly, the test set!

Note the blue line being the prediction while the red dots are what give the actual plot points.

# Visualizing the Training Set results plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() # plt.savefig('plot.png')

As for checking the test set:

# Visualizing the Test Set results plt.scatter(X_test, y_test, color = 'red') # We do not change this since the regressor is already trained # with the training set plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test Set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() # plt.savegit('plot.png')

The challenge: you have 50 companies that all have extracts from `Profit`

and the independent variables that it depends on `R&D Spend`

, `Administration`

, `Markerting Spend`

.

Multiple where there are multiple IVs of causation.

# Simple Linear Regression y = b[0] + b[1]*x[1] # Multiple Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] # Multiple Linear Regression after replacing categorical data y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] + b[n+1]*D[1] + ... + b[n+m]*D[m]

**The Assumptions of Linear Regression**

- Linearity
- Homoscedasticity
- Multivariate normality
- Independence of errors
- Lack of mulicollinearity

**Dummy Variables**

With the data that has categorical data, we actually use the `LabelEncoder`

and `OneHotEncoder`

to allow the expansion of the column into the total different values of of `state`

and make a binary matrix for those columns and rows.

**Note:** There is a dummy variable trap we will talk about later.

We can also think this to be biased, however by default we will have the correct coefficient for the category that will help alter the state to be for the correct category.

You cannot have the default b[0] + all dummy varibles. You should always omit one dummy varible.

Back with one IV and one DV, life was great, but now that we have many columns we need to decide what we can use as useful predictors.

**Why throw out columns and use everything?**

- Garbage in -> Garbage out. If you throw everything in, you may also add in garbage.
- Shows an understanding of variables

**5 Methods of Building Models**

- All-in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score Comparison

`2, 3 and 4`

are sometimes referred to as `Stepwise Regression`

or sometimes just `4`

.

**All in**

Throw in `everything`

. When to do it?

- You have prior knowledge that these are the true predictors
- You have to: maybe a framework where you have to use them
- Preparing for
`Backward Elimination`

type of regression

**Backward Elimination**

- Select a significance level to stay in the model (eg SL = 0.05)
- Fit the full model with all possible predictors
- Consider the predictor with the
`highest P-value`

- if`P > SL`

, go to step 4, else fin - Remove the predictor
- Fit model without this variable*, rebuild the entire model with the other vars
- Return to step 3 with the new model
FIN. When
`P > SL`

, you come here and the model is ready

**Forward Elimination**

- Select a significance level to stay in the model (eg SL = 0.05)
- Fit all simple regression models
`y ~ x[n]`

- select the one with the lower P-value - Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
- Consider the predictor with the
`lowest P-value`

.`If P < SL`

, go to Step 3, otherwise go to`FIN`

FIN. Keep the previous model

**Bidirectional Elimination**

- Select a significance level to enter and one to stay in the model (eg SLENTER, SLSTAY = 0.05)
- Perform the next step of
`Forward Selection`

(new variables must have:`P < SLENTER`

to enter) - Perform ALL steps of Backward Elimination (old variables must have
`P < SLSTAY`

to stay) - very iterative process - No new variables can enter and no old variables can exit, go to FIN FIN. Model is ready

**All Possible Models**

Most thorough approach, but also the most consuming.

- Select a criterion of goodness of fit (eg. Akaike criterion)
- Construct All Possible Regression Models:
`(2^N) - 1`

total combinations - Select the one with the best criterion FIN. Your model is ready

If you have 10 columns in your data, that means 1023 models (ridiculous)

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values # send(X.tolist(), 0); # send(y.tolist(), 0); # # Taking care of missing data # from sklearn.preprocessing import Imputer # imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # imputer = imputer.fit(X[:, 1:3]) # X[:, 1:3] = imputer.transform(X[:, 1:3]) # Encoding categorical data # Encoding the Independent Variable from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 3] = labelencoder_X.fit_transform(X[:, 3]) onehotencoder = OneHotEncoder(categorical_features = [3]) X = onehotencoder.fit_transform(X).toarray() send(X.tolist(), 0); # Avoiding the Dummy Variable Trap # Lib in this case takes care of it # for us in this case # X = X[:, 1:] # send(X.tolist(), 0); # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

**Library for Multiple Linear Regression**

Add this following to the above

# Fitting simple ;inear Regression to the Training Set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test)

To get it ready, you need to import the required library. Add the follwoing to the previous code.

The library that we use doesn't take into account the `x[0]`

constant = 1, so we will need to add this. Most other libraries normally will include this.

# Backward Elimination Preparation import statsmodels.formula.api as sm # Add in column for X[0] X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis=1) send(X.tolist(), 0)

Now that we are ready to start Backward Elimination, we can go ahead and begin with this...

The following table shows us some useful information about the multiple linear regression model - the `R-squared`

, the `Adjusted R-squared`

, `P`

values and more.

The lower the `P`

value in this case, the more important.

# Backward Elimination Preparation import statsmodels.formula.api as sm # Add in column for X[0] X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis=1) X_opt = X[:, [0,1,2,3,4,5]] # Stay if < SL SL = 0.05 # Create a new regressor regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0)

In the case of the first run through, get rid of the variable with the highest `P`

value. We need to continue this until we are under the `0.05`

SL value.

# Because of how everything went, we iterate through the BE algorithm iteratively # For now, we are not focused on improving the model # Create a new regressor and run iteration X_opt = X[:, [0,1,2,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,1,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3,4,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3,5]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0) # Create a new regressor and run iteration X_opt = X[:, [0,3]] regressorOLS = sm.OLS(endog=y, exog=X_opt).fit() send(str(regressorOLS.summary()), 0)

# Simple Linear Regression y = b[0] + b[1]*x[1] # Multiple Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] # Multiple Linear Regression after replacing categorical data y = b[0] + b[1]*x[1] + b[2]*x[2] + ... + b[n]*x[n] + b[n+1]*D[1] + ... + b[n+m]*D[m] # Polynomial Linear Regression y = b[0] + b[1]*x[1] + b[2]*x[1]^2 + ... + b[n]*x[1]^n

Depending on the power, we begin to have a parabolic shape - think of how it all graphs and the amount of min/max for each power.

Use cases could be things such as understanding how epidemics have spread etc.

**Why is it still called Linear?**

The trick here is that we're not talking about the X variables. When talking about the class of the regression, we're talking about the coefficients.

These models aren't necessarily more advanced than the other linear regression models that we have looked at so far.

In this model, we will basically only require 1 independent variable `level`

and the `salaries`

column will becoome the DV y.

**Note:** always ensure that X is a vector of matrices and that y is a vector.

We also won't need to split the data into a training and test set simply because we don't have enough data to train one and test the performance of the other. We also want to make an accurate prediction, and not miss the target.

# Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values send(X.tolist(), 0); send(y.tolist(), 0); # Fitting simple Linear Regression to the Training Set # Feature Scaling not required with the following library from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) # Fitting Polynomial Regression to the dataset # This is transform the original features to have # associated polynomial terms from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=2) X_poly=poly_reg.fit_transform(X) # Fit the poly to another lin reg # to have eg. two independent vars # etc - using the Poly lin_reg2 = LinearRegression() lin_reg2.fit(X_poly, y) # Visualising the Linear Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title('Truth or Bluff for salary for job (LR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryLR.png') plt.close()

In order to plot and predict polynomial regressions, we need to use the `fit_transform`

method within the `LinearRegression.predict()`

method.

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values send(X.tolist(), 0); send(y.tolist(), 0); # Fitting simple Linear Regression to the Training Set # Feature Scaling not required with the following library from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) # Fitting Polynomial Regression to the dataset # This is transform the original features to have # associated polynomial terms from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=4) X_poly=poly_reg.fit_transform(X) poly_reg.fit(X_poly, y) # Fit the poly to another lin reg # to have eg. two independent vars # etc - using the Poly lin_reg2 = LinearRegression() lin_reg2.fit(X_poly, y) # Visualising the Linear Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title('Truth or Bluff for salary for job (LR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryLR.png') plt.close() # Visualising the Poly Regression results # For higher res X_grid = np.arange(min(X), max(X), 0.1) plt.scatter(X, y, color = 'red') plt.plot(X_grid, lin_reg2.predict(poly_reg.fit_transform(X_grid)), color = 'green') plt.title('Truth or Bluff for salary for job (PR)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.savefig('SalaryPR-x.png') plt.close() prediction = lin_reg2.predict(X_poly) send(prediction.tolist(), 0) # Prediciting a new result with the Linear Regression model y_pred = lin_reg.predict(6.5) # This will be an awful result send(y_pred.tolist(), 0) # Prediciting a new result with the Polynomial Regression model y_pred_poly = lin_reg2.predict(poly_reg.fit_transform(6.5)) # This will be a great result! send(y_pred_poly.tolist(), 0)

Very similar to Polynomial Linear Regression in regards to code, but we use Feature Scaling and the SVR class for the regressor. The kernel refers to the type of fit eg poly, rbf etc.

# Data Preprocessing Template # Importing the libraries import sys, json import numpy as np import matplotlib.pyplot as plt import pandas as pd # send() for Node.js Python Shell lib def send(arg, type = 0): if type == 1: print json.dumps(json.loads(arg)) elif type == 2: print arg else: print json.dumps(arg) # Importing the dataset dataset = pd.read_csv('data/Position_Salaries.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values # Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() sc_y = StandardScaler() X = sc_X.fit_transform(X) y = sc_y.fit_transform(y) # Create the SVR regressor # SVR doesn't auto Feature Scale from sklearn.svm import SVR # kernel for linear, poly, rbf etc regressor = SVR(kernel='rbf') regressor.fit(X, y) # Prediciting the test set results y_pred = regressor.predict(6.5) # We have to do this because of feature scaling y_pred = sc_y.inverse_transform(y_pred) send(y_pred.tolist()) # Visualising the SVR results plt.scatter(X, y, color = 'red') plt.plot(X, regressor.predict(X), color = 'blue') plt.title('Truth or Bluff (SVR)') plt.xlabel('Position level') plt.ylabel('Salary') plt.show() # plt.savefig('svr.png') # plt.show() # plt.close()

The code can be found in `~/Learning/ML-Course/ml-a-z-course/part-3-classification/1-logistical-regression/`

.

First, start by adding in the Python Preprocessing template (search SnippetsLab).

In this first example, we are going to see if we can predict the purchase of an SUV given the `Age`

and `EstimatedSalary`

.

Since we are using columns `2,3`

and we are attempting to predict `4`

, update the import of the dataset to look like the following:

# Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2,3]].values y = dataset.iloc[:, 4].values

Since there are 400 observations, let's use 300 for the training set and the rest for the test set.

# Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Because we want an accurate prediction of whether or not a user is going buy an SUV, we WANT feature scaling. Just uncommenting this will be enough to include feature scaling.

For an intuitive example of the "why" behind feature scaling, checkout Stack Overflow

# Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)

We import the `LogisticRegression`

class from `sklearn.linear_model`

to and use the constructor to build the object that we will fit and use for predictions.

# Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=0) classifier.fit(X_train, y_train) # teach the correlations

Here, we can just need use our `X_test`

variable with the methond `predict`

.

# Predicting the results y_pred = classifier.predict(X_test) print(y_pred)

A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. See more here

Import the `function`

(not a class) from `sklearn.metrics`

.

# Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm) # prints [[65 3] # [ 8 24]]

Note: From the above, 65 and 24 are the *correct* predictions, and 3 and 8 are the *incorrect* predictions.

The best way to check the results are to use a graph!

To intepret the graph, you will have a split of red and green points. All the points themselves represent each of our data points on a X/Y graph of the to IVs on the axis. The colour of the point itself references whether the predictions were to buy or not buy with the background colour representing the "prediction regions".

What is the goal for this? Since we want to classify the right users and put them into the right category, we can help use that to help target a particular demographic.

The line inbetween the regions is called a prediction boundary. What is it a straight line? That is because we are using a *linear* logistic classifier.

The last part that is important, is that the graph is a representation of the *training* set.

# Visualizing the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train # 0.01 pixel resolution and apply classifier on it # min - 1 and max + 1 for range X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) # create the plt contour split plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) # set x and y limits plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) # for each element in set, create a scatter element for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() # Visualizing the Test set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

What K-NN does for you: help define if a new data point added should fall into the red category or into the green category.

How does it work?

- K-NN works by choosing the number K of nearest neighbours. One of the most common default values is 5.
- These neighbours are chosen with Euclidean distance.
- Among the K neighbours, count the number of data points in each category.
- Assign new data point to category based on most neighbours.

It is a very simple model.

Using our classification template, we can just add the necessary lines to import to classifier and use the `fit`

method.

# Fitting classifier to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) classifier.fit(X_train, y_train)

The confusion matrix for this data using the K-NN method has some more success than the linear classifier!

[[64 4] [ 3 29]]

# Classification template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Fitting classifier to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('K-NN Classifier (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() # Visualising the Test set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('K-NN Classifier (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

SVM searches for the `maximum margin`

which is the line that separates the two classes of points with the largest distance between them. The two points are called the `support vectors`

- they are the only two points that contribute to the results of the algorithm.

The line in the middle is called the `maximum margin hyperplane/classifier`

.

So why SVM? As opposed to most machine learning algorithms that will try to use the most common elements of a set of information, SVM will try to find the extreme elements close to being classified as the other at the boundary. This in itself makes SVMs both special and very different. At times, this means that they could work a lot better because of this ignorance of other data points.

Again, the important part is importing the classifier, creating the instance and running the `fit`

method:

# Fitting SVM to the Training set from sklearn.svm import SVC # specify "linear" since that is what we want classifier = SVC(kernel='linear', random_state=0) classifier.fit(X_train, y_train)

In this case, the CM comes as the following:

[[66 2] [ 8 24]]

In the case of the model, the SVM actually turns out pretty good for this training set!

Think about a situation with not linearly seperable data (ie data in circles etc). This is not possible to set a useful boundary with SVM.

We map the data to a higher dimension in order to then use linear separation.

After we add the linear separator to this higher dimension, we use projection to again bring it back down a dimension and we will have our non-linear separator.

Warning: this could require a whole bunch of compute power.

It uses some intense looking math function with Euler's number. This function calculated is the *Gaussian RBF Kernel* and worth noting to revisit down the track. To visualise, think of a 3d plane and what the calculated number relates to. The point calculated comes from the central point on the XY plane. A reference to the image comes from here

The landmark for the kernel itself is abstracted from the intuition, but it is calculated for us.

Sigma's role in this whole process is defining the circumference of how wide the definition for landmark 0 becomes.

You can also take multiple kernel functions and add them if required.

Classifications are generally assigned based on the kernel value being = 0 or > 0.

To see some of these in 3d, head to this website

- Gaussian RBG Kernel
- Sigmoid Kernel
- Polynomial Kernel

Classification code:

# Fitting classifier to the Training set from sklearn.svm import SVC # Set rbf for Gaussian RBF classifier = SVC(kernel='rbf', random_state=0) classifier.fit(X_train, y_train)

Confusion matrix:

[[64 4] [ 3 29]]

This is more a prefix to using *Naive Bayes*.

To picture how this works, think of a spanner. There are two machines that both produce spanners, each spanner marked by which machine created it. What we want to find out is that if we go through and throw out the "defects", what is the probability that machine two will have a defect.

Mach 1: 30 wrenches/hr Mach 2: 20 wrenches/hr > Out of all defective parts, 1% are defective > Out of all defective, 50% from mach1, 50% from mach2 > Question: What is the probability that a part produced by mach2 is defective? Given that we know the totals > Probability(Mach1) = 30/50 = 0.6 > P(Mach2) = 20/50 = 0.4 And for defects # Prob of part being defective > P(Defect) = 1% # Prob of defect picking up from defect pile > P(Mach1 | Defect) = 50% > P(Mach2 | Defect) = 50% # Therefore > P(Defective | Mach2) = (P(Mach2 | Defect) * P(Defect)) / P(Mach 2) = 0.0125

Given our graph with two (can be more!) labeled categories (Walks and Drives) and axes labeled `Salary`

and `Age`

.

Armed with the knowledge of `Bayes Theorem`

and the previous datapoints, what is the likelyhook with some person with these features walking?

> Posterior Probability = (Likelihood * Prior Probability) / Marginal Likelihood > P(Walks|X) = (P(X|Walks) * P(Walks)) / P(X) > P(Drives|X) = (P(X|Drives) * P(Drives)) / P(X) # After calculating > P(Walks|X) vs P(Drives|X)

Classifier code:

# Fitting Naive Bayes to the Training set # Create your Naive Bayes here from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(X_train, y_train)

Confusion matrix:

[[65 3] [ 7 25]]

With Naive Bayes, you'll have a nice curve without irregularities.

CART = Classification and Regression Trees. This is an umbrella terms for:

- Classification trees ie red/green apples
- Regression trees ie temperature outside, cost for things etc

For the intuition behind it, the graph looks as if the graph is based on *splits* which are based on maximising the number of a certain category. That's a simple explanation, although there is some complex mathematics behind how it is working.

During the initial split, we begin making a decision tree. Ie first split a 60, is `X2 < 60`

, then second might be `X1 < 50`

for split two on a particular branch. The final leaves on the branch are called the *terminal leaves* and these leaves are the final classification.

Decision trees are also old. They've started to die off since more sophisticated methods have come to replace them. Recently, they were "reborn" with additional methods like `random forest`

, `gradient boosting`

etc that have brought it back into the game. While not very powerful on their own, they are leveraged on for other methods.

Classification code:

# Fitting decision tree to the Training set from sklearn.tree import DecisionTreeClassifier # To be as homogeneous as possible, we want to use entropy as we are looking to reduce this # information gain is what we want to improve after the split classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) classifier.fit(X_train, y_train)

As for the confusion matrix:

[[62 6] [ 3 29]]

Checking the visualisation of the output is pretty intuitive if you understand the idea of the decision trees and splitting used.

*Ensemble Learning* is when you take multiple ML algorithms to come out with a final one. The *random forest* method using a number of *random forest* algorithms.

Steps:

- Pick at random K data points from Training set.
- Build
`Decision Tree`

associated at these K data points. - Choose the number
`Ntree`

of trees you want to build and repeat steps 1 and 2. - For a new data point, make each one of your Ntree trees predict the category to which the data point belongs and assign the new data point to the category that wins the majority vote.

With the "power of the crowd", it helps this classification become quite useful to get rid of particular uncertainties. It was used for things such as "Konnect" for Xbox.

Classification code:

# Fitting Random Forest to the Training set # Create your Random Forest here from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier( n_estimators=10, criterion='entropy', random_state=0) classifier.fit(X_train, y_train)

Confusion Matrix:

[[63 5] [ 4 28]]

Be careful - we want to *prevent overfitting*. Remember: Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets.

Given a set of results, we wanted to take what we already know and use that projected data to build out a prediction model.

Example - We started by taking an four random independent variable values, and for anything below a particular value (0.5) we projected to the floor, and the ceiling for any above. Given actual dependent variable data, we project what we know onto the model and seeing what the predicted equilvent would be.

We can get True Positive, False Positive, False Negative and True Negative. In both cases, we want the `True`

values!

The y axis is the `Actual DV`

, and the x axis is the `Predicted DV`

| TP | FP | | --- | --- | | FN | TN | > accuracy rate = correct / total > error rate = wrong / total

If we predicted that nothing would ever equal 0, the confusion matrix could possibly go up even though we just completely stopped using the model. Be wary about this paradox.

Image a horizontal axis `Total Contacted`

up to 100000, and vertical axis `Purchased`

up to 10000.

Can we get more customers to purchase for less contacted customers? How can we pick and choose customers to contact? The area underneath the model increase is better and is known as the CAP.

The ideal line would be having 10% of customer who purchased, all were 100%. Ideal, but unlikely.

Now that we know how it can work, how can analyse the CAP?

The standard approach to calculate the efficiency is this:

AR = a[r]/a[p] = area under model to random / area under perfect model to random

The second approach is to look at the 50% line, then where this intersects the model line, check the intersection to the vertical axis and take this number to use for assessing.

Generally, the numbers go like so:

X | Value |
---|---|

X < 60% | Rubbish |

60% < X < 70% | Poor |

70% < X < 80% | Good |

80% < X < 90% | Very Good |

90% < X < 100% | Too Good (be wary) |

Be careful if it goes over 90%. You could also be overfitting and the anomolies down the track might not relate to the trained model.

Same as for regression models, you first need to figure out whether your problem is linear or non linear. You will learn how to do that in Part 10 - Model Selection. Then:

If your problem is linear, you should go for Logistic Regression or SVM.

If your problem is non linear, you should go for K-NN, Naive Bayes, Decision Tree or Random Forest.

Then which one should you choose in each case ? You will learn that in Part 10 - Model Selection with k-Fold Cross Validation.

Then from a business point of view, you would rather use:

Logistic Regression or Naive Bayes when you want to rank your predictions by their probability. For example if you want to rank your customers from the highest probability that they buy a certain product, to the lowest probability. Eventually that allows you to target your marketing campaigns. And of course for this type of business problem, you should use Logistic Regression if your problem is linear, and Naive Bayes if your problem is non linear.

SVM when you want to predict to which segment your customers belong to. Segments can be any kind of segments, for example some market segments you identified earlier with clustering.

Decision Tree when you want to have clear interpretation of your model results,

Random Forest when you are just looking for high performance with less need for interpretation.

In Part 10 - Model Selection, you will find the second section dedicated to Parameter Tuning, that will allow you to improve the performance of your models, by tuning them. You probably already noticed that each model is composed of two types of parameters:

the parameters that are learnt, for example the coefficients in Linear Regression, the hyperparameters. The hyperparameters are the parameters that are not learnt and that are fixed values inside the model equations. For example, the regularization parameter lambda or the penalty parameter C are hyperparameters. So far we used the default value of these hyperparameters, and we haven't searched for their optimal value so that your model reaches even higher performance. Finding their optimal value is exactly what Parameter Tuning is about. So for those of you already interested in improving your model performance and doing some parameter tuning, feel free to jump directly to Part 10 - Model Selection.

Think of a scatter plot. K-Means is used to help create *clusters* of groups. You can have as many IV as required.

Steps:

- Choose number of clusters K
- Select at random K points, the centroids (not necessarily from your dataset)
- Assign each data point to the closest centroid -> this forms K clusters
- Compute and place the new centroid of each cluster
- Reassign each data point to the new closest centroid. If any reassignment took place, go back to step 4.

It is basically just an iterative process that you continue until the end centroids converg eto a place that data points are never reassigned.

Think of three easily determined clusters. The question is, if we have a bad randomly initiasation points for the centroids, can we run into an issue with final clusters computed? Yes. This is the initialization trap.

How can we combat this? It is not so straight forward. The solution is a K-Means++ algorithm. It is quite an involved approach.

What is the metric to evaluate how a certain number of clusters performs? It's called the *Within Cluster Sum of Squares* (WCSS).

This is a sum of each point in cluster i where you sum the distance of point i to centroid and then square it. From this, you sum the total for each cluster applying this function.

How do we find the optimal goodness of fit as it keeps improving with more clusters given that it keeps improving? We look at the drop off after incrementing the number of clustesrs. We use the *elbow method* to see where the drop off goes from dramatic to small. That is a judgement call that you need to make as a data scientist.

# K-Means++ Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers.csv') X = dataset.iloc[:, [3, 4]].values # Using the elbow method to find optimal number of clusters from sklearn.cluster import KMeans """ wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') # plt.show() """ # Applying k-means to the mall dataset kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0) y_kmeans = kmeans.fit_predict(X) # Visualising the cluster # y_kmeans == 0 is cluster 1 # 0, 1 for second arg are x,y plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1 - Careful') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2 - Standard') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3 - Target') plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4 - Careless') plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5 - Sensible') # Plotting the centroids plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.xlabel('Annual income (k$)') plt.ylabel('Spending Score (1-100') plt.legend() plt.show()

As for intuition, the before and after HC can end up with similar results to K-Means Clustering.

There are two ways to do this:

- Agglomerative
- Divisive

- Make each data point a single-point cluster that forms N clusters
- Take two closest data points and make them one cluster -> N - 1 clusters
- Take the
*two clusters*and make them one -> N - 2 clusters - Repeat step three until there is only one cluster left

This is a crucial part. Distance between two clusters can be measured as:

- Closest points
- Furthest points
- Average distances
- Distance between centroids

Dendrograms have all points on the X axis and on the Y axis will have the Euclidean distances. You repeat the process based on the cluster size and the point to connect to.

After setting a Euclidean distance threshold, we always want the similarity to be lower than the threshold to define the clusters.

If you have a dendrogram and reduce the threshold, the number of clusters will be equal to how many vertical lines the threshold goes through.

To decide the distance, generally you will look for when the clustering arm becomes longest.

Plot the dendrogram, then look at it to decide how many clusters there should be by identifiying the longest arm.

# Using the dendrogram to find optimal number of clusters import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) plt.title('Dendrogram') plt.xlabel('Customers') plt.ylabel('Euclidean distances') plt.show()

# K-Means++ Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers.csv') X = dataset.iloc[:, [3, 4]].values # Using the dendrogram to find optimal number of clusters import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) plt.title('Dendrogram') plt.xlabel('Customers') plt.ylabel('Euclidean distances') plt.show() # Fitting hierarchical clustering to the mall dataset # Applying k-means to the mall dataset # Note we use AgglomerativeClustering here from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering( n_clusters=5, affinity='euclidean', linkage='ward') y_hc = hc.fit_predict(X) # Visualising the cluster # y_hc == 0 is cluster 1 # 0, 1 for second arg are x,y plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s=100, c='red', label='Cluster 1 - Careful') plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s=100, c='blue', label='Cluster 2 - Standard') plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s=100, c='green', label='Cluster 3 - Target') plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4 - Careless') plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5 - Sensible') plt.xlabel('Annual income (k$)') plt.ylabel('Spending Score (1-100') plt.legend() plt.show()

Think of the correlation of why customers would buy nappies and beers.

`People who bought also bought...`

Apriori can also help us build rules based on what else has been done.

- Set a minimum support and confidence
- Take all subsets in transactions having higher support than minimum support
- Take all the rules of these subsets having higher confidence than minimum confidence
- Sort the rules by decreasing lift

support(M) = number user watchlists containing M / number user watchlists support(I) = number transactions containing I / number of transactions

confidence(M1->M2) = number user watchlists containing M1 and M2 / number user watchlists containing M1 confidence(I1->I2) = number transactions containing I1 and I2 / number transactions container I1

Chances of people who liked movie 1 liking movie 2.

`What are the chances of recommending Ex Machina if they've seen Interstellar?`

Say 10 out of 100 liked Ex Machina from the total but 17.5% of those who watched Interstellar liked Ex Machina:

Lift = 17.5% / 10% = 1.75

For this example, we will actually use a file instead of a library.

Python code:

# Apriori # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset # 7500 customer and what they had in their basket dataset = pd.read_csv('Market_Basket_Optimisation.csv', header=None) print(dataset) # Need to prepare the data correctly for a list of lists transactions = [] for i in range(0, 7501): transactions.append([str(dataset.values[i, j]) for j in range(0, 20)]) # Training Apriori on dataset # REMEMBER: this depends on your dataset. You need so spend some time for args here. from apyori import apriori rules = apriori(transactions, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2) # Visualising the results results = list(rules) for item in results: # first index of the inner list # Contains base item and add item pair = item[0] items = [x for x in pair] print("Rule: " + items[0] + " -> " + items[1]) # second index of the inner list print("Support: " + str(item[1])) # third index of the list located at 0th # of the third index of the inner list print("Confidence: " + str(item[2][0][2])) print("Lift: " + str(item[2][0][3])) print("=====================================")

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired outcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.

Think of a robotic dog: We can either give it an algorithm to follow, or we can give it all the options it has and give it a *reward* or *punishment* based on the choices it makes.

What is the problem?

A one-armed bandit is a slot machine (there is history behind the name). It is to do with the old levers and bandit comes from the fact that it takes your money. The *multi* comes into it when you think of many of these machines. How do you play them to maximise your return? Without knowing the distribution of chances how to win, we need to figure this out while spending the least amount of time.

A use case for things like this could be to figure out which ad gives us the best return.

- We have
*d*arms eg ads displayed to user on a page - Each time a user connects to the page, this makes a round
- At each round
*n*, we choose one ad to display to the user - At each round
*n*, ad*i*gives reward ri as an element of {0, 1}: ri = 1 if the user clicked on the ad*i*and 0 if they did not - The goal is to maximize the total reward we get over many rounds

For each distribution at the start, we assume that they are all the same. For the first couple of rounds, they are basically all trial runs. We do that to create the initial * confidence bound*. As the agent observing the data, the value goes up or down based on the

Because we now have an extra observation, our confidence bound will get smaller as we become more confident.

The confidence bounds only have one task: have the expected value within the box bound.

Confidence bound

Confidence bound interval

As time goes on and the confidence bound converges because the result is *good*, it moves on to others to remain unbiased. We continually choose the option with the "upper" confidence bound as the next choice.

Given a scenario of 10000 ads being shown to users, we want to see how many click no a specific ad (denoted 0 and 1).

Beginning here, we start with a random 10000 iterator as a test, and we see it gets 1246 *rewards* in total. If we keep running the algorithm, you can see we continually get ~1200.

If we run a histogram of a distibution of the ads, we notice that they are almost similar in the number of times chosen since it is random and not using an educated guess.

There is no real easy package to implement UCB, so we add this by scratch.

**Steps:**

- At each round
*n*, we consider two numbers for each ad*i*.- $N_i(n)$ - the number of times ad
*i*was selected up to round*n* - $R_i(n)$ - the sum of rewards of the ad
*i*up to round*n*

- $N_i(n)$ - the number of times ad
- From these two numbers we compute:
- The average rewards of ad
*i*up to round*n* - The confidence interval at round
*n*

- The average rewards of ad
- We select the ad
*i*that has the maximum UCB.

# Implemented UCB import math N = 10000 d = 10 ads_selected = [] numbers_of_selections = [0] * d sum_of_rewards = [0] * d total_reward = 0 for n in range(0, N): ad = 0 max_upper_bound = 0 for i in range(0, d): # if statement for initial conditions if (numbers_of_selections[i] > 0): average_reward = sum_of_rewards[i] / numbers_of_selections[i] delta_i = math.sqrt(3/2 * math.log(n + 1) / numbers_of_selections[i]) upperbound = average_reward + delta_i else: # this large value is used to ensure we select the 10 different ads over first 10 rounds upperbound = 1e400 if upperbound > max_upper_bound: max_upper_bound = upperbound ad = i ads_selected.append(ad) numbers_of_selections[ad] = numbers_of_selections[ad] + 1 reward = dataset.values[n, ad] # sum of rewards for specific ad sum_of_rewards[ad] = sum_of_rewards[ad] + reward # sum of all rewards over all ads total_reward = total_reward + reward print(total_reward) # 2178 # Visualising plt.hist(ads_selected) plt.title('Histogram of Ad Selections') plt.xlabel('Ads') plt.ylabel('Selections') plt.show()

Thompson Sampling

Without delving deep into the math, this requires Bayesian Inference.

Again, we have no prior knowledge of the current situation. What Thompson Sampling will end up doing is actually creating a distribution based on returns.

These distributions are representing where we think the actual expected value might lie. Using this, we can generate our own "bandit" configuration.

**Steps:**

At each round $n$, we consider two numbers for each ad $i$

$N_i^1(n)$ - the number of times the ad $i$ got rewards 1 up to round $n$

$N_i^0(n)$ - the number of times the ad $i$ got rewards 0 up to round $n$

For each ad $i$ we take a random draw from the distribution below:

- $\theta_i(n)=\beta(N_i^1(n)+1,N_i^0(n)+1)$

We selet the ad that has the highest $\theta_i(n)$

They both do solve the same problem, but there are pros and cons for both.

- Deterministic
- Requires an update every round

- Probablistic
- Can accomodate delayed feedback
- Better empirical evidence

# Data Preprocessing Template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset # Note this dataset is just for simulation # Observing if 10000 users click on an ad (0 or 1) # We are not showing the ads at random - based on UCB # This will be chosen after the first 10 in this simulation dataset = pd.read_csv('Ads_CTR_Optimisation.csv') # Implemented UCB import random N = 10000 d = 10 ads_selected = [] numbers_of_rewards_1 = [0] * d numbers_of_rewards_0 = [0] * d total_reward = 0 for n in range(0, N): ad = 0 max_random = 0 for i in range(0, d): random_beta = random.betavariate( numbers_of_rewards_1[i] + 1, numbers_of_rewards_0[i] + 1) if random_beta > max_random: max_random = random_beta ad = i ads_selected.append(ad) reward = dataset.values[n, ad] if reward == 1: numbers_of_rewards_1[ad] = numbers_of_rewards_1[ad] + 1 else: numbers_of_rewards_0[ad] = numbers_of_rewards_0[ad] + 1 total_reward = total_reward + reward print(total_reward) # 2178 # Visualising plt.hist(ads_selected) plt.title('Histogram of Ad Selections') plt.xlabel('Ads') plt.ylabel('Selections') plt.show()

You need data and a lot of data + a bunch of processing power.

But what is deep learning? A lot of it is based on mimicing the mind. A lot of the terminology is based on terms like "neurons".

Thinking of the network, we can think of the input layer which consists of input values, an output layer at the end with the output value and a hidden layer between. So all the input layers are connected to the hidden layer, the hidden layer is connected to the output layer.

Deep learning occurs when we have lots and lots of hidden layers.

Geoffrey Hinton is a good reference of someone leading the field.

What will we learn?

- The neuron
- The activation functions and examples
- How neural networks work
- How neural networks learn
- Gradient Descent
- Stochastic Gradient Descent
- Backpropagation

How can we recreate the neuron in the machine? We mimic how neurons and their networks look like in the brain. The neuron itself consists of that main body, dendrites and the axon.

Neuron diagram

Conceptionally, the dendrites of a neuron are connected to other axons. The whole concept of a an impulse being passed is the synapse. Synapses is an important term.

The neuron in our case gets a number of input signals, and gives an output signal. For the sake of understanding, input values can be representing in yellow, the neuron in green. The joining between input and neuron is the **synapse**.

Input value can themselves just be a standardized **independent variable**.

Additional reading: *Efficient BackProp* b Yann LeCun.

What can the output value be? Can be continuous (price), binary (will exit yes/no), categorical.

An important thing to note is that for both the input and output is just for a **single observation**.

On the synpase, there also **weights**. These values are important for the process and it is the weights that are adjusted during the learning process.

Basic diagram

A simple function that defines that if a value is less than 0, pass a 0, else pass 1.

Threshold

$\phi(x)=\frac{1}{1+e^{-x}}$

What is good about this function is that it is smooth. A gradual progression. It is very useful for the final layer - especially for things like probability.

Sigmoid

$\phi(x)=max(x,0)$

Starts at 0 and then at some point turns towards 1.

Rectifier

Rectifier function is one of the most used.

$\phi(x)=\frac{1-e^{-2x}}{1+e^{-x}}$

An important part of NNs are training them up, but for now let's just focus on how they work (pretend it is already trained up) so we can see the application we are working towards.

Say we have 4 input variables: Area (feet), Bedrooms, Distanct to city (Miles) and Age.

The output variable we want is the price. So what we want is that the weights that give a price calculation.

The power here comes from that hidden layer.

Hidden layer

If we walk through this, we assume that some weight will have a zero value and some others will have a non-zero value.

Each neuron itself could pick up all or some of the input values after being trained.

If you think of a perceptron (single-layer feed forward), think of $y$ as the actual value and $\hat{y}$ is the output value from the output layer. To learn, we need to compare this output value to the actual value. We can plot both on a graph and calculate the cost function $C = 1/2(\hat{y}-y)^2$ - we will then feed this information back into the neural network to adjust the weights. We want the cost function to tend towards zero.

One **epoch** is when we go through all the data and train based on the rows.

One epoch

Again, optimal weights are found when you reduce the cost function as much as possible. The whole process is known as **back propagation**.

Gradient decent is finding the lowest **C** value over the iterations. If we map the cost function on a graph, it can look like a parabola. We look at the angle of our cost function along this point for $\hat{y}$.

What happens if our cost function is not convexed? Ie not just a parabola? How do we find the minimum for the cost function?

Normal gradient descent is when we take all of our rows and adjust the weights after calculating it all. The is also known as **Batch Gradient Descent**. With **Stochastic Gradient Descent**, we actually take the rows one by one, then run the network, look at the cost function and then adjust the weights.

We know there is a process called forward propagation to get our values for our $\hat{y}$'s. Backpropagation is an advanced algorithm that allows us to adjust all the weights simulataneously at the same time.

- Randomly initialise the weights to small numbers close to 0 (but not 0)
- Input the first observation of your dataset in the input layer, each feature in one input node
- Forward propagation (left to right): neurons activated in a way that the impact of each neuron's actiation is limited by the weights. Propagate the activations until getting the predicted result y.
- Compare predicted result to the actual test. Measure the generated error.
- Back-propagation: from right to left, the error is back propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.

The Churn_Modelling.csv file is bank data and they're trying to figure out the churn and why customers are leaving at high rates. They want you to assess and address the problem.

They've had 10000 customers and they're basically watching them. If the left, the *exited* field is 1, else 0.

We need to create a model of customers that are at high risk of leaving.

pip install --upgrade git+https://github.com/Theano/Theano.git#egg=Theano pip install --upgrade tensorflow pip install --upgrade keras # if conda installing tensorflow conda install -c conda-forge tensorflow conda install -c conda-forge keras

**Theano**- a library to help make use of the graphics card**Tensorflow**- an opensource numerical computations library**Keras**- a library that wraps both Theano and Tensorflow to make it abstracted to build powerful deep learning models with short amounts of code

For install issues, check Stack Overflow

Given the answer we are looking for, we are in the middle of a classification problem.

As we look through all the data, we need to decide which independent variables may impact the dependent variable.

For this, we first start by setting our dataframes for our IV and DV.

# Importing the dataset dataset = pd.read_csv('Churn_Modelling.csv') X = dataset.iloc[:, [3:13]].values y = dataset.iloc[:, 13].values

Since we have some string categories, we need to encode these categories.

# Clean the categorical variables # Encoding categorical data # We need to encode both gender and country from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode country labelencoder_X_1 = LabelEncoder() X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # Encode gender labelencoder_X_2 = LabelEncoder() X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) # We need to remove dummy variables for one to avoid the dummy variable trap onehotencoder = OneHotEncoder(categorical_features = [1]) X = onehotencoder.fit_transform(X).toarray() X = X[:, 1:]

A good link on the Dummy Variable Trap

Dummy Variable Trap

**Steps:**

- Randomly init weights to small numbers close to 0
- Input the first observation of your dataset in the input layer, each feature in one input node
- Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y.
- Compare predicted result to actual result. Measure generated error.
- Back-Propagation: from right to left, the error is back-propagated.
- Repeat steps 1 to 5 and employ either Batch Learning or Reinforcement learning.
- When the whole training set passed through the ANN, that makes an epoch. Redo more epochs.

# Classification template # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Churn_Modelling.csv') X = dataset.iloc[:, 3:13].values y = dataset.iloc[:, 13].values # Clean the categorical variables # Encoding categorical data # We need to encode both gender and country from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Encode country labelencoder_X_1 = LabelEncoder() X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # Encode gender labelencoder_X_2 = LabelEncoder() X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) # Encode one of the categorical features onehotencoder = OneHotEncoder(categorical_features=[1]) X = onehotencoder.fit_transform(X).toarray() # We update X to again finish removing a varible for the dummy variable trap X = X[:, 1:] # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Building the ANN import keras from keras.models import Sequential from keras.layers import Dense # Initialising the ANN classifier = Sequential() # Adding in the input layer and the first hidden layer classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu', input_shape=(11,))) # Add second hidden layer classifier.add(Dense(units=6, init='uniform', activation='relu')) # Add output layer - softmax for activation if you have more than 2 categories for DV classifier.add(Dense(units=1, init='uniform', activation='sigmoid')) # Compiling the ANN classifier.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) classifier.fit(X_train, y_train, batch_size=10, epochs=100) # Predicting the Test set results y_pred = classifier.predict(X_test) y_pred = (y_pred > 0.5) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(cm)

The example given is dealing with vague images that could be two images in one.

Mixed image

The example above is the rabbit vs duck. The brain changes how it processes the image based on what it sees.

The other example given was the face with four eyes and two mouths to illustrate the point that the brain finds it hard to comprehend certain images.

Yann LeCun is grandfather of CNNs. Big findings from him were made in the 80s/90s.

How does it work? We have an input image that goes through the CNN to an output label (image class). For each pixel, the **computer sees a value between 0 and 255**. For a black and white image, there is only one channel and the computer sees a 2d array, however for coloured images it is a 3d array with a red, blue and green channel.

**Steps**:

- Convolution
- Max Pooling
- Flattening
- Full Connection

The initial paper can be found here if you want to read

$(f*g)(t)=^{def}\int^{\infty}_{-\infty}f(\tau)g(t-\tau)d\tau$

More reading if you want a good intro can be found here

What is a convolution in intuitive terms? We have an **input image** and a **feature detector**. Feature detector can also be known as a *kernal* or a *filter*.

Feature detector

As the filter moves through, we call it a ** stride**.

The **feature map** is calculted by can also be called a *convolved map* or a *feature map*.

Are we losing information when we apply the feature detector? Yes, but we are looking to detect parts of the image that are integral. In our personal life, we don't look at every pixel, we look at features.

To create our first **convolution layer**, we create many feature maps.

Many maps

Applying these filters actually is what can happen we apply image filters too. Things like *sharpen*, *blur*, *edge detect* etc are basic applications of these filters. The output feature map is the filtered image.

During this small step, you apply a rectifier function to ** increase non-linearity**. The example given was a filter that has an image going from light to dark with white, grey, black. The applied rectifier function breaks up this linearity.

A link to more information on ReLU vs the others can be found here

What is pooling and why do we need it? Think of a cheetah image where it is positioned properly, one where it is rotated, and another where it is squashed.

In our case, we are going to apply **max pooling**. Max pooling again looks at small sections of the matrix and the **pool feature map** outputs the max of each section. This reduction helps with a number of things, including processing power, parameters (preventing overfitting) and preserving features but account for spatial of featural distortions.

Here is a good read on max pooling*downsampling*.

A great visualiation tool can be found here

Max pooling

With the pooled feature map, we flatten it into a column.

Flattening

After the flattening, we go back to the fully connected layer (hidden layer - input layer needs to be fully connected for CNNs) and output layer.

Full connection

In the above, we have an example of a classification output layer which can be seen on the output layer above. The layers prior to the classification may, for example, have strong probabilities based on features for each node that contribute towards the weights of the features required for each classification.

Link to The 9 Deep Learning Papers...

Softmax

The softmax function helps our ** output layer sum to 1**.

The cross-entropy function after applying the softmax function is cross-entropy. This calculates the **loss** function which we want to minimize.

$H(p,q) = - \sum_{x}p(x)log{q(x)}$

Say we have two neural networks and we have a few images of dogs and cats, we want to see what our NN predict. After evaluating, we can check things like the *classification error* (not a great one), *mean squared error* (which is more accurate) and *cross-entropy* (also more accurate). So why use cross-entropy over mean squared error? The answer is a few advantages. A couple are:

- At the start of back propagation, the gradient descent will be low if the error is low. Cross-entropy helps with this as it has $log$ in the calculation.

To get a better understanding, check out this YouTube video

For some reading on cross entropy, checkout this reading

With these images, we cannot put the DV on the same array at the information as the image data. We can instead write some code to instead abstract the word "cat" or "dog" from the file name to create the DV. Another solution (better) is using **Keras**.

The final code looks like so:

# Part 1 - Building CNN # Importing the dataset structure with Keras # First structure pillar is to seperate the images into test_set and training_set # Second within these folders is split into the DV cats and dogs. from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Flatten # Dense is used for the fully connected layers from keras.layers import Dense # Initialising the ANN classifier = Sequential() # Step 1, adding a convolution layer # kernel_size is the feature detector matrix size # input_shape is 3d, so we want each channel to point to each array # 64, 64 is for the colours 2d arrays (not using 255 because of CPU) and 3 for 3 dimensions (for tensorflow backend) classifier.add(Conv2D( filters=32, kernel_size=(3, 3), input_shape=(64, 64, 3), activation='relu')) # Step 2 # Pool size for how big we want our pooling matrix # Since we don't set the strides tuple, strides will default # to the pool_size classifier.add(MaxPooling2D(pool_size=(2, 2))) # Optional, create another Conv + Max Pooling layer classifier.add(Conv2D( filters=32, kernel_size=(3, 3), activation='relu')) classifier.add(MaxPooling2D(pool_size=(2, 2))) # Step 3 - Flattening # This huge, flat array will relate to a specific feature classifier.add(Flatten()) # Step 4 - Full connection # Adding in the input layer and the first hidden layer classifier.add(Dense(units=128, activation='relu')) # If wasn't a binary output, we would use softmax classifier.add(Dense(units=1, activation='sigmoid')) classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # Part 2 - Fitting CNN to images from keras.preprocessing.image import ImageDataGenerator # Some of the args are for applying random transformations for training train_datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) # Only requires rescale since the rest doesn't need transforms test_datagen = ImageDataGenerator(rescale=1./255) # target_size update for our 64 expectation training_set = train_datagen.flow_from_directory( 'dataset/training_set', target_size=(64, 64), batch_size=32, class_mode='binary') test_set = test_datagen.flow_from_directory( 'dataset/test_set', target_size=(64, 64), batch_size=32, class_mode='binary') # steps_per_epochs=number of images in training set classifier.fit_generator( training_set, steps_per_epoch=8000, epochs=25, validation_data=test_set, validation_steps=2000)

After training the model, if you want to use it and make predictions, you can save and reload that model. Check here for more information

In Classification, we only worked with datasets comprised of **only two independent variables**. This is because:

- We needed two dimensions to visualise how ML models worked.
- Because whatever is the original number of IV, we can often end up with two indepent variables by applying an appropriate Dimensionality Reduction technique.

Feature selection techniques covered in Regression (Part 2) included **Backward Elimination, Forward Selection, Bidirectional Elimination, Score Comparison and more**.

In this part, we will cover the following **Feauture Extraction** techniques:

- Principal Component Anaylsis (PCA)
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- Quadratic Discriminant Analysis (QDA)

One of the most used unsupervised algorithms. It is used for features such as:

- Noise filtering
- Visualization
- Feature extraction
- Stock market predictions
- Gene data analysis

It is used to:

Identify patterns in data

Detect the correlation between variables

PCA

The goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d). We want to:

- Standardize the data.
- Obtain the
**Eigenvectors**and**Eigenvalues**from the covariance matrix or correlation matrix, or perform**Singular Vector Decomposition**. - Sort eigenvalues in descending order and choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace $(k\le{d})$.
- Construct the projection matrix
**W**from the selected $k$ eigenvectors. - Transform the original dataset
**X**via**W**to obtain a $k$-dimensional feature subspace**Y**.

A great link on the mathematics behind it can be found here

A great visual link for intuition can be found here

For 2D, we can see how the relationship works for dimensionality reduction. The real power can be seen for the 3 dimensional space.

PCA in summary helps us to learn about the relationship between the X and Y values and find the list of principal axes. Be careful though, PCA is **highly affected** by outliers.

If we have $n$ independent variables, PCA extracts $p\le{m}$ new independent variables that explain the most of the variance in the dataset *regardless of the dependent variable*. The fact that DV is not considered is what makes PCA an unsupervised model.

What we want to do with the Wine.csv file is take the data and make a classification model like logistical regression. That will help us create a recommended wine. To visualise the predictions, it cannot be done with all the independent variables. We apply dimensionality reductions techniques to show two variables that can help us visualise this instead.

The CM will end up with 3 dimensions in this case. The diagonal will still contain the correct predictions while the rest will not.

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.decomposition import PCA # n_components is the principal components we want # Note: use None at first to find what uses the most pca = PCA(n_components=2) X_train = pca.fit_transform(X_train) X_test = pca.fit_transform(X_test) # We want to find what variables explain the variance # Check the print out and then use it we need """ explained_variance = pca.explained_variance_ratio_ print(explained_variance) """ # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_Set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_Set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

Note that in this example we had to set a third colour for the third recommendation.

PCA Plot

While it may seem similar to PCA, but there are important differences. LDA is used in preprocessing step for pattern classification.

LDA differs because in addition to finding the component axises with LDA, we are interested in the axes that maximise the separation between multiple classes.

LDA

The goal of LDA is to project a feature space onto a smaller subspace while maintaining the class-discriminatory information.

Here is a good intro into Linear Discriminant Analysis

- Compute the $d$-dimensional mean vectors for the different classes from the dataset.
- Compute the scatter matrices (in-between-class and within-class scatter matrix).
- Compute the eigenvectors $(e_1,e_2,...,e_d)$ and corresponding eigenvalues $(λ_1,λ_2,...,λ_d)$ for the scatter matrices.
- Sort the eigenvectors by decreasing eigenvalues and choose $k$ eigenvectors with the largest eigenvalues to form a $d\times{k}$ dimensional matrix $W$ (where every column represents an eigenvector).
- Use this $d\times{k}$ eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: $Y=X\times{Y}$ (where X is a $n\times{d}$-dimensional matrix representing the $n$ samples, and $y$ are the transformed $n\times{k}$-dimensional samples in the new subspace).

As opposed to PCA, LDA is a **supervised** model since it takes the dependent variable into consideration.

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # n_components is the principal components we want # Note: for LDA, we want to include y_train for X_train transform as LDA is supervised lda = LinearDiscriminantAnalysis(n_components=2) X_train = lda.fit_transform(X_train, y_train) X_test = lda.transform(X_test) # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green', 'blue'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

Know when to apply it. It is useful when data **is not linearly seperable.**

Mappingfunc

Before and after

# Data Preprocessing Template # Importing the libraries import sys import json import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0) # Feature Scaling - must be applied to PCA and LDA from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) # ! Applying PCA from sklearn.decomposition import KernelPCA # n_components is the principal components we want # Note: use None at first to find what uses the most kpca = KernelPCA(n_components=2, kernel='rbf', random_state=0) X_train = kpca.fit_transform(X_train) X_test = kpca.transform(X_test) # We want to find what variables explain the variance # Check the print out and then use it we need """ explained_variance = pca.explained_variance_ratio_ print(explained_variance) """ # Fitting Logistic Regression to the Training Set from sklearn.linear_model import LogisticRegression regressor = LogisticRegression(random_state=0) regressor.fit(X_train, y_train) # Prediciting the test set results y_pred = regressor.predict(X_test) # Produce confusion matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Visualising the Training set results from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show() # Visualizing the Test Set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max()+1, step=0.01), np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max()+1, step=0.01)) plt.contourf(X1, X2, regressor.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape( X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c=ListedColormap(('red', 'green'))(i), label=j) plt.title('Logistic Regression (Training set)') plt.xlabel('pc1') plt.ylabel('pc2') plt.legend() plt.show()

- If you run into a
`MKL`

error, check here . - Updates to sklearn mean that
`train_test_split`

comes from`sklearn.model_selection`

.