RAJAT PANCHOTIA · Follow
Published in · 9 min read · Aug 4, 2020
--
Regression analysis is a predictive modeling technique that estimates the relationship between two or more variables. Recall that a correlation analysis makes no assumption about the causal relationship between two variables. Regression analysis focuses on the relationship between a dependent (target) variable and an independent variable(s) (predictors). Here, the dependent variable is assumed to be the effect of the independent variable(s). The value of predictors is used to estimate or predict the likely-value of the target variable.
For example to describe the relationship between diesel consumption and industrial production, if it is assumed that “diesel consumption” is the effect of “industrial production”, we can do a regression analysis to predict value of “diesel consumption” for some specific value of “industrial production”
STEPS TO PERFORM LINEAR REGRESSION
STEP 1: Assume a mathematical relationship between the target and the predictor(s). “The relationship can be a straight line (linear regression) or a polynomial curve (polynomial regression) or a non-linear relationship (non-linear regression)”
STEP 2 : Create a scatter plot of the target variable and predictor variable(simplest and most popular way).
STEP 3 : Find the most-likely values of the coefficients in the mathematical formula.
Regression analysis comprises of the entire process of identifying the target and predictors,finding the relationship, estimating the coefficients, finding the predicted values of target, and finally evaluating the accuracy of the fitted relationship
Regression analysis estimates the relationship between two or more variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
For example, we want to estimate the credit card spend of the customers in the next quarter. For each customer, we have their demographic and transaction related data which indicate that the credit card spend is a factor of age, credit limit and total outstanding balance on their loans. Using this insight, we can predict future sales of the company based on current and past information.
1. Regression explores significant relationships between dependent variable and independent variable
2. Indicates the strength of impact of multiple independent variables on a dependent variable
3. Allows us to compare the effect of variable measures on different scales and can consider nominal, interval, or categorical variables for analysis.
Equation with one dependent and one independent variable is defined by the formula:
where y = estimated dependent score
c = constant
b = regression coefficient,
x = independent variable.
For predictions, there are many regression techniques available. The type of regression technique to be used is mostly driven by three metrics:
1. Number of independent variables
2. Type of dependent variables
3. Shape of regression line
Linear Regression
Linear regression is one of the most commonly used predictive modelling techniques.It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).
Logistic Regression
Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Polynomial Regression
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation. 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑋2. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
Ridge Regression
Ridge regression is suitable for analyzing multiple regression data that suffers from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.
we have a random sample of 20 students with their height (x) and weight (y) and we need to establish a relationship between the two. One of the first and basic approach to fit a line through the data points is to create a scatter plot of (x,y) and draw a straight line that fits the experimental data.
Since there can be multiple lines that fit the data, the challenge arises in choosing the one that best fits. As we already know, the best fit line can be represented as
- 𝑦 denotes the observed response for experimental unit i
- 𝑥𝑖 denotes the predictor value for experimental unit i
- 𝑦̂𝑖 is the predicted response (or fitted value) for experimental unit i
When we predict height using the above equation, the predicted value of the prediction wouldn’t be perfectly accurate. It has some “prediction error” (or “residual error”). This can be represented as
A line that fits the data best will be one for which the n (i = 1 to n) prediction errors, one for each observed data point, are as small as possible in some overall sense.
One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors.
The equation of the best fitting line is:
We need to find the values of b0 and b1 that make the sum of the squared prediction errors the smallest i.e.
The equation above is a physical interpretation of each of the coefficients and hence it is very important to understand what the regression equation means.
The coefficient 𝑏0, or the intercept, is the expected value of Y when X =0
The coefficient 𝑏1, or the slope, is the expected change in Y when X is increased by one unit.
The following figure explains the interpretations clearly.
An analyst wants to understand what factors (or independent variables) affect credit card sales. Here, the dependent variable is credit card sales for each customer, and the independent variables are income, age, current balance, socio-economic status, current spend, last month’s spend, loan outstanding balance, revolving credit balance, number of existing credit cards and credit limit. In order to understand what factors affect credit card sales, the analyst needs to build a linear regression model.
Trainee is exposed to a sample dateset comprising of telecom customer accounts and their annual income, age along with their average monthly revenue (dependent variable). The trainee is expected to apply the linear regression model using annual income as the single predictor variable.
Once we fit a linear regression model, we need to evaluate the accuracy of the model. In the following sections, we will discuss the various methods used to evaluate the accuracy of the model with respect to its predictive power.
The F-Test indicates whether a linear regression model provides a better fit to the data than a model that contains no independent variables. It consists of the null and alternate hypothesis and the test statistic helps to prove or disprove the null hypothesis.
The R-squared value of the model, which is also called the “Coefficient of Determination”. This statistic calculates the percentage of variation in target variable explained by the model.
R-squared is calculated using the following formula:
R-squared is always between 0 and 100%. As a guideline, the more the R-squared, the better is the model. The objective is not to maximize the R-squared, since the stability and applicability of the model are equally important
Next, check the Adjusted R-squared value. Ideally, the R-squared and adjusted R-squared values need to be in close proximity of each other. If this is not the case, then the analyst may have over fitted the model and may need to remove the insignificant variables from the model.
The trainee is exposed to a sample dateset capturing telecom customer accounts and their annual income, age, along with their average monthly revenue (dependent variable). The dateset also contains predicted values of “average monthly revenue” from a regression model. The trainee is expected to apply the concept of calculation of coefficient of determination.
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value can be included in the model because changes in the predictor’s value are related to changes in the response variable.
the traineeis exposed to a sample dataset capturing the flight status of flights with their delay in arrival, along with various possible predictor variables like departure delay, distance, air time, etc. The learner is expected to build a multiple regression model where all the variables are significant.
We can also evaluate a regression model based on various summary statistics on error or residuals.
Some of them are:
Root Mean Square Error (RMSE): Where we find average of squared residuals as per the given formula:
Mean Absolute Percentage Error (MAPE): We find the average percentage deviation as per the given formula:
Observations are grouped based on predicted values of the target variable. The average of the actual vs. predicted values of the target variable, across the groups, is observed to see if they move in the same direction across the groups (increase or decrease). This is called the rank ordering check.
There are some basic but strong underlying assumptions behind the linear regression model estimation. After fitting a regression model, we should also test the validation of each of these assumptions.
- There must be a causal relationship between the dependent and the independent variable(s) which can be expressed as a linear function. A scatter plot of target variable vs. predictor variable can help us validate this.
- Error term of one observation is independent of that of the other. Otherwise we say the data has auto-correlation problem.
- The mean (or expected value) of errors is zero.
- The variance of errors does not depend on the value of any predictor variable. This means, errors have a constant variance along the regression line.
- Errors follow normal distribution. We can use normality test on the errors here
The trainee is expected to select the significant variable for the model first and then check if there is any problem of over fitting. If found, trainee should remove the requisite variable(s) and iterate through the variable selection process.