Regression Models and Evaluation Metrics in Machine Learning
A high-level overview of Linear Regression and common evaluation metrics.
What is Regression?
Regression is a statistical method used in machine learning to predict a continuous numeric label (output) based on one or more input features. Regression analysis aims to establish a mathematical relationship between the dependent variable (label) and the independent variables (features). This relationship helps in predicting the label for new, unseen data.
In simple linear regression, the relationship between the dependent variable Y and the independent variable X is modelled as a linear function. The formula is:
Where:
Y: The dependent variable (the output we are trying to predict).
β0: The intercept (the value of Y when X is 0).
β1: The slope (the change in Y for a one-unit change in X).
X: The independent variable (the input feature used for prediction).
Let's consider a very simple example: We use shoe size (x) to predict height (y).
Using linear regression on this data, we might find the following relationship:
Here:
y: The height of someone being predicted.
150: Represents the intercept of y and x. That is, if the shoe size were hypothetically 0, the height would be 150cm.
5x: Represents the slope of the relationships. For each additional unit increase in shoe size, the height increases by 5 cm.
To predict the height of someone with an 8.5 shoe size, we start with a base height of 150 and add 5cm for every value incremented in shoe size starting from 0. We then multiply 8.5 by 5.
So, the predicted height of someone with a shoe size of 8.5 is 192.5 cm.
Linear regression involves fitting the “line of best fit” to the data. In this case, it represents a perfect relationship: for every 5 cm increase in height, there is an increase of 1 in shoe size.
You can imagine that with less perfect data, not all of the data points would intercept with the line.
Regression as a Type of Supervised Machine Learning
Regression falls under the category of supervised machine learning. In supervised learning, the model is trained on a labelled dataset, meaning each training example consists of input features and the corresponding known output label. The model learns to map inputs to outputs by identifying patterns in the training data.
Key Characteristics of Supervised Learning:
Labelled Data: The training dataset includes input-output pairs where the output is a known value.
Prediction Task: The goal is to predict the output label for new data based on the learned relationship from the training data.
Types of Regression
There are various types of regression algorithms, each suitable for different types of data and relationships:
Linear Regression: Models the relationship between the input features and output as a straight line.
Polynomial Regression: Models the relationship as a polynomial, suitable for more complex, non-linear data.
Ridge and Lasso Regression: Regularised versions of linear regression that add penalty terms to prevent overfitting.
The Training Process for Regression Models
Data Splitting: Randomly split the training data to create a dataset for training the model while holding back a subset of the data to validate the trained model.
Model Training: Fit the training data to a model using an algorithm, such as linear regression.
Model Validation: Test the model using the validation data by predicting labels for the features.
Performance Evaluation: Compare the actual labels in the validation dataset to the predicted labels. Aggregate the differences between predicted and actual label values to calculate a metric indicating the model's accuracy.
Iterative Refinement: Adjust the algorithm and parameters and repeat the training and validation process until the model achieves an acceptable level of predictive accuracy.
Example: Predicting House Prices
Let's explore regression with an example. We have a data set of house prices and their corresponding size
We split the dataset to form a training set, which will be used to train a model to predict house prices (y) based on house size (x) in square meters. The held-back data will be used during the evaluation.
Applying Linear Regression
We can plot the relationship between house size and price on a graph and fit a linear regression line to understand the relationship between the two variables.
The function1 derived by the linear regression algorithm for this data can be represented as:
Where:
f(x): denotes the function f is evaluated at x. In this context, the function f(x) represents the independent variable (x = house size) and predicts the value of the dependent variable (y = house price).
7595.42: This is the intercept term, which is the predicted house price (y) when the house size (x) is 0 square metres.
+3010.27x: This term represents the slope and indicates that for every one-unit increase in x (house size), the value of the function f(x) will increase by $3,010.27.
How are the coefficients calculated? Check out the footnote.
In the context of predicting house prices based on house size:
House Size (x): The independent variable (input feature) represents the size of the house in square meters.
House Price function f(x): The dependent variable (output) represents the house's predicted price.
We can use this regression function to predict house prices for any given size. For example, if the house size is 85 square meters, the model predicts:
So, the predicted price for a house size of 85 square meters is approximately $263,468.37.
Evaluating the Model
To validate and evaluate the model's accuracy, we predict some values (ŷ) based on the held-back data and compare them to the actual values (y) of the held-back data to evaluate performance.
We can measure the model's performance using various metrics by comparing the predicted values (ŷ) to the actual values (y) of the held-back data.
Mean Absolute Error (MAE)
Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions without considering their direction. It is the average of the absolute differences between prediction and actual observation over the test sample, where all individual differences have equal weight.
In this example, the variance indicates how many dollars each prediction was wrong. Importantly, it doesn’t matter if the prediction was over or under; it is simply a measure of variance.
In the house price example, the mean (average) of absolute errors is $8,207.36.
The formula is:
Mean Squared Error (MSE)
Mean Squared Error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
This metric treats all discrepancies between predicted and actual labels equally. It may be preferable to have a model that is slightly off all the time rather than one that makes fewer but more significant errors. Squaring the individual errors and then calculating the mean of these squared values emphasizes the larger errors.
In the house price example, the MSE is 94,049,732.32.
The formula is:
Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE) is the square root of the MSE. It is a frequently used measure that quantifies the differences between values predicted by a model and the observed values. RMSE takes the magnitude of errors into account by squaring them, but as a result, the metric is in squared units of the original label. Thus, stating that the MSE of our model is XX does not provide a direct measure of the error in terms of the original units (dollars, in this case). The MSE is simply a numeric score indicating the overall error level in the validation predictions.
To express the error in terms of dollars, we take the square root of the MSE.
In the house price example, the RMSE is $9,699.99
The formula is:
Coefficient of Determination (R²)
The Coefficient of Determination (R²) is a statistical measure that explains how much of the variability in a dependent variable can be explained by its relationship with an independent variable. In regression, the R² coefficient of determination measures how well the regression predictions approximate the actual data points. An R² of 1 indicates that the regression predictions perfectly fit the data.
This metric compares the sum of squared differences between the predicted and actual labels (residual sum of squares) with the sum of squared differences between the actual label values and the mean of the actual values (total sum of squares).
The resulting value will be between 0 and 1. The closer the value is to 1, the better the model fits the validation data.
In the house price example, the R² calculated from the validation data is 0.9996.
The formula is:
Adjusted R²
Adjusted R² adjusts the R² statistic based on the number of independent variables in the model. Unlike R², it does not always increase when adding a new predictor. This is because Adjusted R² considers the number of predictors relative to the number of data points, penalizing the addition of predictors that do not significantly improve the model.
Why Adjusted R² is a Better Measure for Comparing Models
Penalises Overfitting: R² always increases or stays the same when more predictors are added to the model, regardless of whether the new predictors are actually useful. This can lead to overfitting, where the model fits the training data well but performs poorly on new, unseen data. Adjusted R², on the other hand, increases only if the new predictor improves the model more than would be expected by chance. If the new predictor does not provide a meaningful improvement, Adjusted R² can decrease.
Accounts for the Number of Predictors: Adjusted R² incorporates the number of predictors (p) and the number of observations (n) into its calculation. This means that models with more predictors are not unfairly favoured. The formula for Adjusted R² is:
where R² is the coefficient of determination, n is the number of observations, and p is the number of predictors.
Better Comparison: Because Adjusted R² penalizes models for having unnecessary predictors, it provides a more accurate measure of model performance when comparing models with different numbers of predictors. This makes it a better tool for model selection, especially when dealing with complex models.
Adjusted R² is a more reliable statistic for comparing models because it adjusts for the number of predictors. This helps to avoid overfitting and provides a clearer picture of model performance. It also ensures that only predictors that genuinely improve the model are favoured.
In the house price example, the Adjusted R² is 0.9996.
Mean Bias Deviation (MBD)
Mean Bias Deviation (MBD) measures the average bias in the model predictions. It provides an indication of whether the model tends to overpredict or underpredict. Unlike other error metrics that focus on the magnitude of errors, MBD specifically evaluates the direction of the errors, giving insights into the systematic bias present in the model.
In the house price example, the MBD is $8,207.36
The formula is:
Mean Absolute Percentage Error (MAPE)
Mean Absolute Percentage Error (MAPE) measures a forecasting method's accuracy in terms of percentage error. It is a commonly used metric in regression analysis to assess a model's prediction accuracy. The MAPE is expressed as a percentage, which makes it easier to interpret and compare across different datasets and models.
In the house house price example, the MAPE is 2.02%
The formula is:
Iterative Training
The training process is typically iterative. Data scientists repeatedly train and evaluate a model, varying:
Feature Selection and Preparation: Choosing which features to include and how to preprocess them.
Algorithm Selection: Exploring different regression algorithms.
Hyperparameters: Adjusting the numeric settings that control algorithm behaviour.
After multiple iterations, the model that yields the best evaluation metrics is selected for use.
Determining the regression algorithm
To determine the regression algorithm for the dataset in question, we can use linear regression to fit a line that best describes the relationship between the house size (independent variable x) and the house price (dependent variable y).
Steps to Find the Linear Regression Model
Prepare the Data: List the house sizes and corresponding prices.
Compute the Regression Coefficients: Find the slope (β1) and intercept (β0) of the best-fit line.
Construct the Regression Equation: Use the calculated coefficients to form the regression equation:
Using Python for Linear Regression
We can use Python's numpy
and scikit-learn
libraries to perform linear regression and find the coefficients β0 (intercept) and β1 (slope)
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# Data
data = {
'House Size (x)': [
55, 65, 75, 80, 85, 95, 100, 110, 120, 135, 140, 50, 55, 70, 80, 85, 90, 95, 100, 105,
110, 115, 120, 125, 130, 135, 140, 145, 65, 75, 125
],
'House Price (y)': [
158000, 182000, 230000, 245000, 248000, 285000, 297000, 340000, 360000, 400000, 430000,
155000, 158000, 220000, 245000, 248000, 280000, 285000, 297000, 310000, 340000, 345000,
360000, 375000, 395000, 400000, 430000, 435000, 182000, 230000, 375000
]
}
df = pd.DataFrame(data)
# Features and Labels
X = df[['House Size (x)']]
y = df['House Price (y)']
# Model
model = LinearRegression()
model.fit(X, y)
# Coefficients
intercept = model.intercept_
slope = model.coef_[0]
print(f"Intercept (β₀): {intercept}")
print(f"Slope (β₁): {slope}")
# Regression Equation
print(f"Regression Equation: y = {intercept} + {slope}x")
Output
Intercept (β₀): 7595.42
Slope (β₁): 3010.27
Regression Equation
Based on the linear regression model, the regression equation is:
This equation means that:
The intercept (β0) is approximately 93377.19, which is the predicted house price when the house size is 0 square meters.
The slope (β1) is approximately 2422.81, indicating that for each additional square meter of house size, the house price increases by about $2422.81.
which can be expressed as:
How are the coefficients actually calculated, you ask?
Below are the step-by-step calculations for finding the coefficients (intercept and slope)
Computing coefficients
Definitions
xi: The i-th value of the independent variable (input feature).
yi: The i-th value of the dependent variable (output label).
x̄: The mean of the independent variable values.
ȳ: The mean of the dependent variable values.
n: The number of observations.
Formulas for the Coefficients
Slope (β1)
The slope β1 is calculated as:
Intercept (β0)
The intercept β0 is calculated as:
Step-by-Step Calculation
1- Calculate the means:
2- Calculate the slope (β1):
Compute the numerator:
Compute the denominator:
Divide the numerator by the denominator to get β1:
3 - Calcualte the intercept (β0):
Use the mean values and the slope to find β0: