Linear Regression vs Logistic Regression for Classification Tasks
Published in · 6 min read · May 7, 2019
--
This article explains why logistic regression performs better than linear regression for classification problems, and 2 reasons why linear regression is not suitable:
- the predicted value is continuous, not probabilistic
- sensitive to imbalance data when using linear regression for classification
Supervised learning is an essential part of machine learning. That is a task of learning from the examples in a training dataset, by mapping input variables to the outcome labels, which then can be used for predicting the outcome of a new observation. Examples of supervised learning classification tasks are:
- Given a list of passengers who survived and did not survive the sinking of the Titanic, predict if someone might survive the disaster (from Kaggle)
- Given a set of images of cats and dogs, identify if the next image contains a dog or a cat (from Kaggle)
- Given a set of movie reviews with sentiment label, identify a new review’s sentiment (from Kaggle)
- Given images of hand-drawn digit from 0 to 9, identify a number on a hand-drawn digit image (from Kaggle)
Examples 1 and 2 are examples of binary classification problems, where there are only two possible outcomes (or classes). Examples 3 and 4 are examples of multiclass classification problems where there are more than two outcomes.
Let’s say we create a perfectly balanced dataset (as all things should be), where it contains a list of customers and a label to determine if the customer had purchased. In the dataset, there are 20 customers. 10 customers age between 10 to 19 who purchased, and 10 customers age between 20 to 29 who did not purchase. “Purchased” is a binary label denote by 0 and 1, where 0 denote “customer did not make a purchase” and 1 denote “customer made a purchase”.