Cl Logistic Regression

Classification: Logistic Regression

This is a basic look at Logistic Regression and implementing an example from a csv file. While the CSV file itself with the data is excluded, this basic look will show how to interpret the CSV in a particular way to give your dependent and independent variables.

The performance and reduction of these independent variables to improve the model are not included in this basic overview.

Note

The original text below includes mathmetical formulas that do not translate into their mathematical expressions on the blog. Some familiarity with Latex will be required to interpret the expressions used.

Logistic Regression Intuition

This section can be quite difficult - there will be some math.

We know about linear regression, multiple linear regression etc. (DV on y, IV on x).

What happens if we classify things along a graph? Eg. 0 and 1 on the y axis and age on the x axis. This one is very black and white, but at the same time we can intuitive see some correlation.

In the example given above, we wouldn't use a linear model (as you could imagine). How about instead, you were able throw in probabilies between 0 and 1. The could be a probability between the x intercept and the y-intecept at x[hat]. You could interpret the above and below 100% and 0% respectively. This would be a VERY basic but sensicle attempt to describe the model.

The scientific approach

If we take the linear y = b[0] + b[1]*x and take that into the sigmoid function p = 1 / (1 + pow(e, -y)) and then we throw that into ln(p/(1-p)) = b[0] + b[1]*x then we can get the y. Therefore the last equation is the one for logistical regression.

# MAIN FORMULA
ln(p/(1-p)) = b[0] + b[1]*x

Based on the above formula and plugging in the example data, we will get the best fitting line.

If we now take any particular ages along the x axis of 20, 30, 40, 50 etc, we can then find y[hat] to get the predicted value that it will be a 1 or 0 - the higher the probability, the higher the chance of a 1. Any probability that is less than 0.5 is projected down whereas anything else is projected up.

After applying to model, we can start drawing conclusions.

Implementation in Python

Using our standard setup, we want to predict whether or not we can get a correlation between the purchase of something using their age and salary.

For accurate predictions, we do use feature scaling and we will also create a classification test and training set.

# Data Preprocessing Template

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data/Social_Network_Ads.csv')
# We jut want the estimate of purchase using the Age and Estimated Salary
X = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values

# If you wish to check to lists
# print(X.tolist())
# print(y.tolist())

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
# we use this here for accurate predicition
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

print(X_train.tolist())

Fitting the logistic regression model to the Training Set

# Fitting Logistic Regression to the Training Set
# Create the Regressor
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

In order to make a prediction on the X_test:

# y_pred will be the vector of predictions
y_pred = classifier.predict(X_test)
print(y_pred.tolist())

Checking the fit predictions using the Confusion Matrix

We do this by making a Confusion Matrix.

# Create the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix")
print(cm.tolist())

Visualising the predictive power using a graph

There is a lot of code required to visualise this:

# Visualising the Training Set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
            alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
            c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistical Regression Training Set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
# plt.savefig('logistical-regression.png')
plt.show()
plt.close()

How do we interpret the graph?

The red points are the training set observations for when the IV purchased = 0, and 1 for green.

In our example, red did not buy the SUV, green are those who did.

Given the x,y axis, those with the lower salary who also didn't have red are also those who didn't but the SUV. We can see those with the higher salaries are more likely to have bought the SUV.

Another observation is that the older above the average even with the lower salary were more likely to buy the SUV.

What is the point of the classifiers?

The goal is to classify the right users into the right categories. We do this by plotting the prediction regions - in the case of the graph, it's the red prediction and the green region is where the classifier does by the SUV.

The data point is the result, the region is the estimate.

When we have a linear classifier, the boundary will always be a straight line.

Checking the results when applied to the Test Set

The results that we can see from this actually come from the same confusion matrix that we saw before.

# Visualising the Test Set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
            alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
            c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistical Regression Test Set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
# plt.savefig('logistical-regression.png')
plt.legend()
plt.show()
plt.close()