Predictive Modelling Using Linear Regression (2024)

Predictive Modelling Using Linear Regression (1)

Regression analysis is a predictive modeling technique that estimates the relationship between two or more variables. Recall that a correlation analysis makes no assumption about the causal relationship between two variables. Regression analysis focuses on the relationship between a dependent (target) variable and an independent variable(s) (predictors). Here, the dependent variable is assumed to be the effect of the independent variable(s). The value of predictors is used to estimate or predict the likely-value of the target variable.

For example to describe the relationship between diesel consumption and industrial production, if it is assumed that “diesel consumption” is the effect of “industrial production”, we can do a regression analysis to predict value of “diesel consumption” for some specific value of “industrial production”

STEPS TO PERFORM LINEAR REGRESSION

STEP 1: Assume a mathematical relationship between the target and the predictor(s). “The relationship can be a straight line (linear regression) or a polynomial curve (polynomial regression) or a non-linear relationship (non-linear regression)”

STEP 2 : Create a scatter plot of the target variable and predictor variable(simplest and most popular way).

Predictive Modelling Using Linear Regression (4)

STEP 3 : Find the most-likely values of the coefficients in the mathematical formula.

Regression analysis comprises of the entire process of identifying the target and predictors,finding the relationship, estimating the coefficients, finding the predicted values of target, and finally evaluating the accuracy of the fitted relationship

Regression analysis estimates the relationship between two or more variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

For example, we want to estimate the credit card spend of the customers in the next quarter. For each customer, we have their demographic and transaction related data which indicate that the credit card spend is a factor of age, credit limit and total outstanding balance on their loans. Using this insight, we can predict future sales of the company based on current and past information.

1. Regression explores significant relationships between dependent variable and independent variable

2. Indicates the strength of impact of multiple independent variables on a dependent variable

3. Allows us to compare the effect of variable measures on different scales and can consider nominal, interval, or categorical variables for analysis.

Equation with one dependent and one independent variable is defined by the formula:

where y = estimated dependent score

c = constant

b = regression coefficient,

x = independent variable.

For predictions, there are many regression techniques available. The type of regression technique to be used is mostly driven by three metrics:

1. Number of independent variables

2. Type of dependent variables

3. Shape of regression line

Linear Regression

Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).

Logistic Regression

Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Polynomial Regression

A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation. 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑋2. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.

Ridge Regression

Ridge regression is suitable for analyzing multiple regression data that suffers from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.

we have a random sample of 20 students with their height (x) and weight (y) and we need to establish a relationship between the two. One of the first and basic approach to fit a line through the data points is to create a scatter plot of (x,y) and draw a straight line that fits the experimental data.

Predictive Modelling Using Linear Regression (5)

Since there can be multiple lines that fit the data, the challenge arises in choosing the one that best fits. As we already know, the best fit line can be represented as

Predictive Modelling Using Linear Regression (6)
  • 𝑦 denotes the observed response for experimental unit i
  • 𝑥𝑖 denotes the predictor value for experimental unit i
  • 𝑦̂𝑖 is the predicted response (or fitted value) for experimental unit i

When we predict height using the above equation, the predicted value of the prediction wouldn’t be perfectly accurate. It has some “prediction error” (or “residual error”). This can be represented as

Predictive Modelling Using Linear Regression (7)

A line that fits the data best will be one for which the n (i = 1 to n) prediction errors, one for each observed data point, are as small as possible in some overall sense.

One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors.

The equation of the best fitting line is:

Predictive Modelling Using Linear Regression (8)

We need to find the values of b0 and b1 that make the sum of the squared prediction errors the smallest i.e.

Predictive Modelling Using Linear Regression (9)

The equation above is a physical interpretation of each of the coefficients and hence it is very important to understand what the regression equation means.

The coefficient 𝑏0, or the intercept, is the expected value of Y when X =0

The coefficient 𝑏1, or the slope, is the expected change in Y when X is increased by one unit.

The following figure explains the interpretations clearly.

Predictive Modelling Using Linear Regression (10)

An analyst wants to understand what factors (or independent variables) affect credit card sales. Here, the dependent variable is credit card sales for each customer, and the independent variables are income, age, current balance, socio-economic status, current spend, last month’s spend, loan outstanding balance, revolving credit balance, number of existing credit cards and credit limit. In order to understand what factors affect credit card sales, the analyst needs to build a linear regression model.

Trainee is exposed to a sample dateset comprising of telecom customer accounts and their annual income, age along with their average monthly revenue (dependent variable). The trainee is expected to apply the linear regression model using annual income as the single predictor variable.

Once we fit a linear regression model, we need to evaluate the accuracy of the model. In the following sections, we will discuss the various methods used to evaluate the accuracy of the model with respect to its predictive power.

The F-Test indicates whether a linear regression model provides a better fit to the data than a model that contains no independent variables. It consists of the null and alternate hypothesis and the test statistic helps to prove or disprove the null hypothesis.

The R-squared value of the model, which is also called the “Coefficient of Determination”. This statistic calculates the percentage of variation in target variable explained by the model.

Predictive Modelling Using Linear Regression (11)

R-squared is calculated using the following formula:

Predictive Modelling Using Linear Regression (12)

R-squared is always between 0 and 100%. As a guideline, the more the R-squared, the better is the model. The objective is not to maximize the R-squared, since the stability and applicability of the model are equally important

Next, check the Adjusted R-squared value. Ideally, the R-squared and adjusted R-squared values need to be in close proximity of each other. If this is not the case, then the analyst may have over fitted the model and may need to remove the insignificant variables from the model.

The trainee is exposed to a sample dateset capturing telecom customer accounts and their annual income, age, along with their average monthly revenue (dependent variable). The dateset also contains predicted values of “average monthly revenue” from a regression model. The trainee is expected to apply the concept of calculation of coefficient of determination.

The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value can be included in the model because changes in the predictor’s value are related to changes in the response variable.

the traineeis exposed to a sample dataset capturing the flight status of flights with their delay in arrival, along with various possible predictor variables like departure delay, distance, air time, etc. The learner is expected to build a multiple regression model where all the variables are significant.

We can also evaluate a regression model based on various summary statistics on error or residuals.

Some of them are:

Root Mean Square Error (RMSE): Where we find average of squared residuals as per the given formula:

Predictive Modelling Using Linear Regression (13)

Mean Absolute Percentage Error (MAPE): We find the average percentage deviation as per the given formula:

Predictive Modelling Using Linear Regression (14)

Observations are grouped based on predicted values of the target variable. The average of the actual vs. predicted values of the target variable, across the groups, is observed to see if they move in the same direction across the groups (increase or decrease). This is called the rank ordering check.

There are some basic but strong underlying assumptions behind the linear regression model estimation. After fitting a regression model, we should also test the validation of each of these assumptions.

  • There must be a causal relationship between the dependent and the independent variable(s) which can be expressed as a linear function. A scatter plot of target variable vs. predictor variable can help us validate this.
  • Error term of one observation is independent of that of the other. Otherwise we say the data has auto-correlation problem.
  • The mean (or expected value) of errors is zero.
  • The variance of errors does not depend on the value of any predictor variable. This means, errors have a constant variance along the regression line.
  • Errors follow normal distribution. We can use normality test on the errors here

The trainee is expected to select the significant variable for the model first and then check if there is any problem of over fitting. If found, trainee should remove the requisite variable(s) and iterate through the variable selection process.

Predictive Modelling Using Linear Regression (2024)

FAQs

Can you use linear regression for prediction? ›

You can use linear regression for causal research, result prediction, or trend prognosis. The linear regression algorithm solves the least squares problem X * B = Y, where X is the input n * p matrix, B is a p * 1 vector, and Y is the n * 1 vector of target values.

How can you determine if a regression model is good enough? ›

The best way to take a look at a regression data is by plotting the predicted values against the real values in the holdout set. In a perfect condition, we expect that the points lie on the 45 degrees line passing through the origin (y = x is the equation). The nearer the points to this line, the better the regression.

What is predictive modeling using linear regression? ›

Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).

Is linear regression not used for predictive analysis? ›

Linear regression is a quiet and the simplest statistical regression method used for predictive analysis in machine learning.

Why we Cannot use linear regression to make probability predictions? ›

Probability is ranged between 0 and 1, where the probability of something certain to happen is 1, and 0 is something unlikely to happen. But in linear regression, we are predicting an absolute number, which can range outside 0 and 1.

Is predictive modeling the same as linear regression? ›

Linear regression is a statistical modeling tool that we can use to predict one variable using another. This is a particularly useful tool for predictive modeling and forecasting, providing excellent insight on present data and predicting data in the future.

How do I determine the best predictor in a linear regression model? ›

There are multiple ways to determine the best predictor. One of the most easy way is to first see correlation matrix even before you perform the regression. Generally variable with highest correlation is a good predictor.

How do you know if a linear regression model is appropriate? ›

If a linear model is appropriate, the histogram should look approximately normal and the scatterplot of residuals should show random scatter . If we see a curved relationship in the residual plot, the linear model is not appropriate. Another type of residual plot shows the residuals versus the explanatory variable.

How do you know if a linear regression is good? ›

Linear Regression
  1. The residuals must follow a normal distribution.
  2. The residuals are hom*ogeneous, there's hom*oscedasticity.
  3. There's no outliers in the errors.
  4. There's no autocorrelation in the errors.
  5. There's no multicolinearity between the independent variables.
Dec 17, 2021

Why is linear regression good for prediction? ›

Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data.

How do you use regression to predict values? ›

We can use the regression line to predict values of Y given values of X. For any given value of X, we go straight up to the line, and then move horizontally to the left to find the value of Y. The predicted value of Y is called the predicted value of Y, and is denoted Y'.

What is the predicted value of a linear regression? ›

The predicted value of y (" ") is sometimes referred to as the "fitted value" and is computed as y ^ i = b 0 + b 1 x i . Below, we'll look at some of the formulas associated with this simple linear regression method.

When should you not use linear regression? ›

[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual.

When can linear regression not be used effectively? ›

Obviously, if the relationship between the variables is not linear, then linear regression is not going to be terribly useful,. There are lots of non-linear relationships.

What do you mean by predictive modeling? ›

Predictive modeling is a mathematical process used to predict future events or outcomes by analyzing patterns in a given set of input data. It is a crucial component of predictive analytics, a type of data analytics which uses current and historical data to forecast activity, behavior and trends.

What is the concept of predictive modeling? ›

Predictive modeling is a commonly used statistical technique to predict future behavior. Predictive modeling solutions are a form of data-mining technology that works by analyzing historical and current data and generating a model to help predict future outcomes.

What is linear predictive model? ›

Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples. In digital signal processing, linear prediction is often called linear predictive coding (LPC) and can thus be viewed as a subset of filter theory.

What is predictive modeling explained simply? ›

Predictive modeling is a statistical analysis of data done by computers and software with input from operators. It is used to generate possible future scenarios for entities the data used is collected from. It can be used in any industry, enterprise, or endeavor in which data is collected.

Top Articles
Latest Posts
Article information

Author: Rueben Jacobs

Last Updated:

Views: 6115

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Rueben Jacobs

Birthday: 1999-03-14

Address: 951 Caterina Walk, Schambergerside, CA 67667-0896

Phone: +6881806848632

Job: Internal Education Planner

Hobby: Candle making, Cabaret, Poi, Gambling, Rock climbing, Wood carving, Computer programming

Introduction: My name is Rueben Jacobs, I am a cooperative, beautiful, kind, comfortable, glamorous, open, magnificent person who loves writing and wants to share my knowledge and understanding with you.