Home

Regression

The target value is a continually varying variable in a regression problem.

Example

boston = pd.read_csv('boston.csv') print(boston.head()) # ... output

We want to create features and target arrays, so we can split out data.

# Drop the target variable for the features X = boston.drop(['MEDV'], axis=1).values # Focus only on the target variable y = boston['MEDV'].values X_rooms = X[:, 5] type(X_rooms), type(y) # (numpy.ndarray, numpy.ndarray) # Turn them into the desired shape to keep first dimension but add another of size 1 to X y = y.reshape(-1, 1) X_rooms = X_rooms.reshape(-1, 1)

Plotting the house value vs number of rooms:

plt.scatter(X_rooms, y) plt.ylabel('Value of house /1000 ($)') plt.xlabel('Number of rooms') plt.show() # ... output

As you could expect, the output graph shows a strong linear relationship between the number of rooms and the house value.

Fitting a regression model

import numpy as np from sklearn.linear_model import LinearRegression reg = LinearRegression() # Fit the regressor to the data reg.fit(X_rooms, y) prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)

To plot that line:

plt.scatter(X_rooms, y, color='blue') plt.plot(prediction_space, reg.predict(prediction_space), color='black', linewidth=3) plt.show()

Importing data challenge

# Import numpy and pandas import numpy as np import pandas as pd # Read the CSV file into a DataFrame: df df = pd.DataFrame(pd.read_csv('gapminder.csv')) # Create arrays for features and target variable y = df['life'].values X = df['fertility'].values # Print the dimensions of y and X before reshaping print("Dimensions of y before reshaping: ", y.shape) print("Dimensions of X before reshaping: ", X.shape) # Reshape X and y y_reshaped = y.reshape(-1, 1) X_reshaped = X.reshape(-1, 1) # Print the dimensions of y_reshaped and X_reshaped print("Dimensions of y after reshaping: ", y_reshaped.shape) print("Dimensions of X after reshaping: ", X_reshaped.shape)

The basics of Linear Regression

We want to fit a line to the data (where the line is y = ax + b).

Variable/StatementDescription
y = ax + bLine equation
yTarget variable
xSingle feature
a,bParameters of the model

How do we choose a and b?

We need to define an error function for any line and choose the line that minimizes the error function.

Error function is also know as a cost function or loss function.

We want to minimize the vertical line between the fit and the data point.

The distance between them is known as the residual. Because a positive and negative residual will cancel each other out, we use the sum of the squares of the residuals.

This will be the loss function, and using that is called ordinary least squares (OLS): minimize sum of squares of residuals.

When you call fit on a linear regression model, it performs OLS under the hood.

In higher dimensions, the linear regression equation becomes y = a1x1 + a2x2 + ... + anxn + b. We need to specify the parameters a1, a2, ..., an, b to fit the model.

The Scikit-learn API works the same way. We pass two arrays:

  1. With the features.
  2. With the target variable.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) reg_all = LinearRegression() reg_all.fit(X_train, y_train) y_pred = reg_all.predict(X_test) reg_all.score(X_test, y_test)

The default score method for linear regression is R squared. For more details, see the documentation.

Note: You will never use Linear Regression out of the box like this. You will almost always want to use regularization.

Fit and predict regressor challenge

# Import LinearRegression from sklearn.linear_model import LinearRegression # Create the regressor: reg reg = LinearRegression() # Create the prediction space prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1) # Fit the model to the data reg.fit(X_fertility, y) # Compute predictions over the prediction space: y_pred y_pred = reg.predict(prediction_space) # Print R^2 print(reg.score(X_fertility, y)) # Plot regression line plt.plot(prediction_space, y_pred, color='black', linewidth=3) plt.show()

Test/train for regression challenge

# Import necessary modules from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import numpy as np # Create training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42) # Create the regressor: reg_all reg_all = LinearRegression() # Fit the regressor to the training data reg_all.fit(X_train, y_train) # Predict on the test data: y_pred y_pred = reg_all.predict(X_test) # Compute and print R^2 and RMSE print("R^2: {}".format(reg_all.score(X_test, y_test))) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) # @see https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/ print("Root Mean Squared Error: {}".format(rmse))

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. See Wikipedia.

Cross-validation

What is a potential pitfall of what we have done?

If we calculate R^2 on the test set, the value is dependent on how we split up data. The test set may have peculiarities that may make is hard to generalize.

We use a technique called cross-validation to solve this problem.

We split the dataset into 5 groups (or 5 folds).

For split one, we hold out the first fold as a test set, and fit our model on the remaining 4 folds, predict on the test set and compute the metric of interest.

For split two, we hold out on the second fold as the test set and fit our model on the remaining 4 folds, predict on the test set and compute the metric of interest.

We repeat this for all splits.

We end up with 5 values of R^2. From here, we could calculate mean, median as well as 95% confidence interals.

This is known as 5-fold cross-validation. In general, k folds is known as k-fold cross-validation.

The trade off is that more folds = more computationally expensive.

This method prevents us from becoming dependent on the train/test split for our metric of interest.

from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression reg = LinearRegression() # cv defines the number of folds # the score reported is R^2 cv_results = cross_val_score(reg, X, y, cv=5) print(cv_results) # Example output [ 0.87878787 0.87878787 0.87878787 0.87878787 0.87878787] np_mean(cv_results) # prints mean of results

Cross-validation challenge

# Import the necessary modules from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression # Create a linear regression object: reg reg = LinearRegression() # Compute 5-fold cross-validation scores: cv_scores cv_scores = cross_val_score(reg, X, y, cv=5) # Print the 5-fold cross-validation scores print(cv_scores) # [0.81720569 0.82917058 0.90214134 0.80633989 0.94495637] print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores))) # Average 5-Fold CV Score: 0.8599627722793232

K-Fold CV comparison challenge

The challenge helps us to time the execution of the code.

# Import necessary modules from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression # Create a linear regression object: reg reg = LinearRegression() # Perform 3-fold CV cvscores_3 = %timeit cross_val_score(reg, X, y, cv=3) # 100 loops, best of 3: 6.78 ms per loop print(np.mean(cvscores_3)) # 0.8718712782622108 # Perform 10-fold CV cvscores_10 = %timeit cross_val_score(reg, X, y, cv=10) # 10 loops, best of 3: 21 ms per loop print(np.mean(cvscores_10)) # 0.8436128620131201

Regularized regression

Linear regression minimizes a loss function.

It chooses a different co-efficient for each feature.

Large coefficients lead to overfitting. This is harder to see in a higher-dimensional space.

It is common to regularize the coefficients.

We will look first at "Ridge regression".

With the Ridge regression, we ourselves need to pick the value of alpha. It is similar to picking the value k for k-Nearest Neighbors.

Finding the alpha which works best is known as hyperparameter tuning.

Alpha controls model complexity. Alpha = 0: We get back OLS (which can lead to overfitting). Alpha being large means that large residuals are being over-penalized. This can lead to underfitting.

from sklearn.linear_model import Ridge X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Normalize = True will ensure all values are between 0 and 1 ridge = Ridge(alpha=0.1, normalize=True) ridge.fit(X_train, y_train) ridge_pred = ridge.predict(X_test) ridge.score(X_test, y_test) # Example output: 0.87878787

There is another method for Regularized regression called "Lasso regression". Similar to Ridge, but using absolute values in the sum of coefficients as opposed to the squared value.

from sklearn.linear_model import Lasso X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Normalize = True will ensure all values are between 0 and 1 lasso = Lasso(alpha=0.1, normalize=True) lasso.fit(X_train, y_train) lasso_pred = ridge.predict(X_test) lasso.score(X_test, y_test) # Example output: 0.87878787

Lasso can be used to select import features of a dataset. It shrinks the coefficients of less important features to exactly 0.

from sklearn.linear_model import Lasso names = boston.drop('MEDV', axis=1).columns lasso = Lasso(alpha=0.1) # Extract coef_ lasso_coef.fit(X_train, y_train).coef_ _ = plt.plot(range(len(names)), lasso_coef) _ = plt.xticks(range(len(names)), names, rotation=60) _ = plt.ylabel('Coefficients') plot.show()

Using this, we are able to see immediately the most import predictor for our target variable "housing price" is "number of rooms".

Lasso regularization challenge

# Import Lasso from sklearn.linear_model import Lasso # Instantiate a lasso regressor: lasso lasso = Lasso(alpha=0.4, normalize=True) # Fit the regressor to the data lasso_fit = lasso.fit(X, y) # Compute and print the coefficients lasso_coef = lasso_fit.coef_ print(lasso_coef) # Plot the coefficients plt.plot(range(len(df_columns)), lasso_coef) plt.xticks(range(len(df_columns)), df_columns.values, rotation=60) plt.margins(0.02) plt.show()

According to the lasso algorithm, it seems like 'child_mortality' is the most important feature when predicting life expectancy.

Ridge regularization challenge

Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha

def display_plot(cv_scores, cv_scores_std): fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(alpha_space, cv_scores) std_error = cv_scores_std / np.sqrt(10) ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2) ax.set_ylabel('CV Score +/- Std Error') ax.set_xlabel('Alpha') ax.axhline(np.max(cv_scores), linestyle='--', color='.5') ax.set_xlim([alpha_space[0], alpha_space[-1]]) ax.set_xscale('log') plt.show()

The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha.

# Import necessary modules from sklearn.linear_model import Ridge from sklearn.model_selection import cross_val_score # Setup the array of alphas and lists to store scores alpha_space = np.logspace(-4, 0, 50) ridge_scores = [] ridge_scores_std = [] # Create a ridge regressor: ridge ridge = Ridge(normalize=True) # Compute scores over range of alphas for alpha in alpha_space: # Specify the alpha value to use: ridge.alpha ridge.alpha = alpha # Perform 10-fold CV: ridge_cv_scores ridge_cv_scores = cross_val_score(ridge, X, y, cv=10) # Append the mean of ridge_cv_scores to ridge_scores ridge_scores.append(np.mean(ridge_cv_scores)) # Append the standard deviation of ridge_cv_scores to ridge_scores_std ridge_scores_std.append(np.std(ridge_cv_scores)) # Display the plot display_plot(ridge_scores, ridge_scores_std)

Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model?

The next section will reveal answers to those questions.