Predictive Modelling Using Linear Regression (2024)

Predictive Modelling Using Linear Regression (1)

RAJAT PANCHOTIA

Published in

The Startup

Equation with one dependent and one independent variable is defined by the formula:

where y = estimated dependent score

c = constant

b = regression coefficient,

x = independent variable.

Linear Regression

Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).

Logistic Regression

Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Polynomial Regression

A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation. 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑋2. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.

Ridge Regression

Ridge regression is suitable for analyzing multiple regression data that suffers from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.

we have a random sample of 20 students with their height (x) and weight (y) and we need to establish a relationship between the two. One of the first and basic approach to fit a line through the data points is to create a scatter plot of (x,y) and draw a straight line that fits the experimental data.

Predictive Modelling Using Linear Regression (5)

Since there can be multiple lines that fit the data, the challenge arises in choosing the one that best fits. As we already know, the best fit line can be represented as

Predictive Modelling Using Linear Regression (6)

𝑦 denotes the observed response for experimental unit i
𝑥𝑖 denotes the predictor value for experimental unit i
𝑦̂𝑖 is the predicted response (or fitted value) for experimental unit i

When we predict height using the above equation, the predicted value of the prediction wouldn’t be perfectly accurate. It has some “prediction error” (or “residual error”). This can be represented as

Predictive Modelling Using Linear Regression (7)

A line that fits the data best will be one for which the n (i = 1 to n) prediction errors, one for each observed data point, are as small as possible in some overall sense.

One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors.

The equation of the best fitting line is:

Predictive Modelling Using Linear Regression (8)

We need to find the values of b0 and b1 that make the sum of the squared prediction errors the smallest i.e.

Predictive Modelling Using Linear Regression (9)

The equation above is a physical interpretation of each of the coefficients and hence it is very important to understand what the regression equation means.

The coefficient 𝑏0, or the intercept, is the expected value of Y when X =0
The coefficient 𝑏1, or the slope, is the expected change in Y when X is increased by one unit.

The following figure explains the interpretations clearly.

Predictive Modelling Using Linear Regression (10)

An analyst wants to understand what factors (or independent variables) affect credit card sales. Here, the dependent variable is credit card sales for each customer, and the independent variables are income, age, current balance, socio-economic status, current spend, last month’s spend, loan outstanding balance, revolving credit balance, number of existing credit cards and credit limit. In order to understand what factors affect credit card sales, the analyst needs to build a linear regression model.

Trainee is exposed to a sample dateset comprising of telecom customer accounts and their annual income, age along with their average monthly revenue (dependent variable). The trainee is expected to apply the linear regression model using annual income as the single predictor variable.

Once we fit a linear regression model, we need to evaluate the accuracy of the model. In the following sections, we will discuss the various methods used to evaluate the accuracy of the model with respect to its predictive power.

The F-Test indicates whether a linear regression model provides a better fit to the data than a model that contains no independent variables. It consists of the null and alternate hypothesis and the test statistic helps to prove or disprove the null hypothesis.

The R-squared value of the model, which is also called the “Coefficient of Determination”. This statistic calculates the percentage of variation in target variable explained by the model.

Predictive Modelling Using Linear Regression (11)

R-squared is calculated using the following formula:

Predictive Modelling Using Linear Regression (12)

R-squared is always between 0 and 100%. As a guideline, the more the R-squared, the better is the model. The objective is not to maximize the R-squared, since the stability and applicability of the model are equally important

Next, check the Adjusted R-squared value. Ideally, the R-squared and adjusted R-squared values need to be in close proximity of each other. If this is not the case, then the analyst may have over fitted the model and may need to remove the insignificant variables from the model.

The trainee is exposed to a sample dateset capturing telecom customer accounts and their annual income, age, along with their average monthly revenue (dependent variable). The dateset also contains predicted values of “average monthly revenue” from a regression model. The trainee is expected to apply the concept of calculation of coefficient of determination.

The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value can be included in the model because changes in the predictor’s value are related to changes in the response variable.

the traineeis exposed to a sample dataset capturing the flight status of flights with their delay in arrival, along with various possible predictor variables like departure delay, distance, air time, etc. The learner is expected to build a multiple regression model where all the variables are significant.

We can also evaluate a regression model based on various summary statistics on error or residuals.

Some of them are:

Root Mean Square Error (RMSE): Where we find average of squared residuals as per the given formula:

Predictive Modelling Using Linear Regression (13)

Mean Absolute Percentage Error (MAPE): We find the average percentage deviation as per the given formula:

Predictive Modelling Using Linear Regression (14)

Observations are grouped based on predicted values of the target variable. The average of the actual vs. predicted values of the target variable, across the groups, is observed to see if they move in the same direction across the groups (increase or decrease). This is called the rank ordering check.

There are some basic but strong underlying assumptions behind the linear regression model estimation. After fitting a regression model, we should also test the validation of each of these assumptions.

There must be a causal relationship between the dependent and the independent variable(s) which can be expressed as a linear function. A scatter plot of target variable vs. predictor variable can help us validate this.
Error term of one observation is independent of that of the other. Otherwise we say the data has auto-correlation problem.
The mean (or expected value) of errors is zero.
The variance of errors does not depend on the value of any predictor variable. This means, errors have a constant variance along the regression line.
Errors follow normal distribution. We can use normality test on the errors here

The trainee is expected to select the significant variable for the model first and then check if there is any problem of over fitting. If found, trainee should remove the requisite variable(s) and iterate through the variable selection process.

Predictive Modelling Using Linear Regression (2024)

FAQs

Can you use linear regression for prediction? ›

You can use linear regression for causal research, result prediction, or trend prognosis. The linear regression algorithm solves the least squares problem X * B = Y, where X is the input n * p matrix, B is a p * 1 vector, and Y is the n * 1 vector of target values.

Tell Me More ›

How can you determine if a regression model is good enough? ›

The best way to take a look at a regression data is by plotting the predicted values against the real values in the holdout set. In a perfect condition, we expect that the points lie on the 45 degrees line passing through the origin (y = x is the equation). The nearer the points to this line, the better the regression.

What is predictive modeling using linear regression? ›

Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).

Tell Me More ›

Is linear regression not used for predictive analysis? ›

Linear regression is a quiet and the simplest statistical regression method used for predictive analysis in machine learning.

Explore More ›

Why we Cannot use linear regression to make probability predictions? ›

Probability is ranged between 0 and 1, where the probability of something certain to happen is 1, and 0 is something unlikely to happen. But in linear regression, we are predicting an absolute number, which can range outside 0 and 1.

Know More ›

Is predictive modeling the same as linear regression? ›

Linear regression is a statistical modeling tool that we can use to predict one variable using another. This is a particularly useful tool for predictive modeling and forecasting, providing excellent insight on present data and predicting data in the future.

Find Out More ›

How do I determine the best predictor in a linear regression model? ›

There are multiple ways to determine the best predictor. One of the most easy way is to first see correlation matrix even before you perform the regression. Generally variable with highest correlation is a good predictor.

Learn More Now ›

How do you know if a linear regression model is appropriate? ›

If a linear model is appropriate, the histogram should look approximately normal and the scatterplot of residuals should show random scatter . If we see a curved relationship in the residual plot, the linear model is not appropriate. Another type of residual plot shows the residuals versus the explanatory variable.

How do you know if a linear regression is good? ›

Linear Regression

The residuals must follow a normal distribution.
The residuals are hom*ogeneous, there's hom*oscedasticity.
There's no outliers in the errors.
There's no autocorrelation in the errors.
There's no multicolinearity between the independent variables.

Dec 17, 2021

Read The Full Story ›

Why is linear regression good for prediction? ›

Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data.

Show Me More ›

How do you use regression to predict values? ›

We can use the regression line to predict values of Y given values of X. For any given value of X, we go straight up to the line, and then move horizontally to the left to find the value of Y. The predicted value of Y is called the predicted value of Y, and is denoted Y'.

Get More Info ›

What is the predicted value of a linear regression? ›

The predicted value of y (" ") is sometimes referred to as the "fitted value" and is computed as y ^ i = b 0 + b 1 x i . Below, we'll look at some of the formulas associated with this simple linear regression method.

Discover More Details ›

When should you not use linear regression? ›

[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual.

When can linear regression not be used effectively? ›

Obviously, if the relationship between the variables is not linear, then linear regression is not going to be terribly useful,. There are lots of non-linear relationships.

Find Out More ›

What do you mean by predictive modeling? ›

Predictive modeling is a mathematical process used to predict future events or outcomes by analyzing patterns in a given set of input data. It is a crucial component of predictive analytics, a type of data analytics which uses current and historical data to forecast activity, behavior and trends.

What is the concept of predictive modeling? ›

Predictive modeling is a commonly used statistical technique to predict future behavior. Predictive modeling solutions are a form of data-mining technology that works by analyzing historical and current data and generating a model to help predict future outcomes.

View Details ›

What is linear predictive model? ›

Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples. In digital signal processing, linear prediction is often called linear predictive coding (LPC) and can thus be viewed as a subset of filter theory.

What is predictive modeling explained simply? ›

Predictive modeling is a statistical analysis of data done by computers and software with input from operators. It is used to generate possible future scenarios for entities the data used is collected from. It can be used in any industry, enterprise, or endeavor in which data is collected.