In-Depth Overview of Logistic Regression Modelling

A Simplified and Detailed Explanation of Everything a Data Scientist Should know about Logistic Regression Modelling

13 min readJul 23, 2024

This blog post aims to cover the most essential concepts of logistic regression that every data scientist should know, complemented with practical examples. While logistic regression is a vast topic, and a comprehensive exploration could require a book series, our focus here is to provide a clear and concise overview of the fundamental principles and applications.

So let’s go!

Introduction

Logistic regression is a supervised learning algorithm that aims to predict discrete outcomes for observations based on the information provided by their features. This is typically a classification problem.

Discrete classes can be simply explained as categories or groups that are distinct and separate from one another, with no overlap between them. For instance, classifying an animal into different classes such as mammals, reptiles, or birds, or groups such as carnivores, herbivores, or omnivores, or types such as domesticated and wild. With relevant features and enough data and computing resources available, the number of classes to predict from is practically endless.

Classification problems could either be binary, multiclass or multilabel. In binary classification, we predict a single class out of two possible classes for each observation ( for example, each email is classified as either spam or not spam) while in multiclass classification, the prediction involves selecting from more than two possible outcomes ( In this case, handwritten digits (0 through 9) are classified into one of ten categories). The one where you predict more than one class for each observation is known as multilabel classification (for instance, a movie can be assigned multiple genre labels based on its content and themes. For example a romantic comedy movie might be labeled with both “romance” and “comedy” labels.

So, whether it’s for predicting customer churn, or detecting spam emails, logistic regression provides a reliable algorithm for predicting the classes of observations based on their features.

How Logistic Regression Approaches Classification Problems

To explain how logistic regression approaches classification, let’s start with a simple review of Linear regression modelling.

In linear regression modeling, we find the best-fitting straight line (regression line) through the sample points which can be used in estimating a continuous target output (y) based on input features (X).

Example of a simple Linear regression with its best fit line and a sample prediction

Rather than model the relationship between features and continuous targets, imagine that we are instead modeling the relationship between features and discrete targets, say for example features of certain flowers (X) and their specific type (y). Using two features and classes from the Iris dataset for instance, we could see the class distinctions in the plot of the sepal width against the sepal length of the flowers.

Visualizing features and targets of observations

You can see from the plot that the observations are clearly separated into distinct classes, with the Setosa flowers (occupying the top left) typically having shorter but broader sepals as compared to the Versicolor flowers (in the bottom right corner). This is an example of a binary classification problem (the complete Iris dataset has three classes, which makes it a multiclass problem). Therefore, a linear function would neither be able to differentiate the different classes in our data nor identify the boundaries for each class. What we need is a function that can separate each class of in our dataset, set the class boundaries and determine the class for new observations based on the segment that their respective features fall into. This is exactly what logistic regression does uisng the Sigmoid function. Now let’s look at how exactly it achieves this.

The Sigmoid function and its role in logistic regression

The Sigmoid function, also known as the logistic function maps real-valued number to a value between 0 and 1. Therefore, it is able to accept continuous values for the targets variables and transform them into probabilities (values between 0 and 1) that helps us determine if an observation falls within a class, or the other. Simply put, it models continuous outputs for the target variable, transforms it into probabilities that tell the possibility that an observation belongs to a class. Higher probabilities for the target variables ( standard is >= 0.5) would mean that the observation belongs in that class (known as the positive class), while lower probabilities (< 0.5) would mean otherwise (i.e it belongs in the negative class). This is typically in a two class (binary) system where we label observations as belonging tothe positive or negative class. We would consider how this works in a multiclass situation later on.

Let’s look at the mathematical formula for the sigmoid function:

To better understand the sigmoid function, let’s plot real-valued numbers (X) between -10 and 10 against their Sigmoid outputs (sigmoid(x)).

Visualization of the Sigmoid curve for numbers between -10 and 10

From the chart above, we can infer the following properties of the sigmoid function:

S-shaped curve: The Sigmoid curve starts at 0 for very large negative values of z, approaches 0.5 at z =0, and approaches 1 for very large positive values of z.
Bounded output: The output of the logistic function is bounded between 0 and 1, which makes it ideal for representing probabilities.

Thus, we can use the probability outputs of the Sigmoid function to determine the probability that a given input belongs to a particular class (0 or 1). The higher the probability is (closer to 1), the more confidence we have that it belongs to the positive class, the closer it is to zero, the more confident we are that it belongs to the negative class.

Therefore, we can model our logistic regression model as :

Implementation for Logistic Regression

Before we build a logistic regression model from scratch in Python, let’s write a pseudocode of the logistic regression approach to classification problems:

Select a class as the positive class and the other as the negative class.
Obtain the value of z as a linear combination of the features (x)and the parameters (feature coefficients and intercept).
Apply the Sigmoid function to z to transform the output into the range [0, 1], representing probabilities.
Use the output of the logistic function (Sigmoid(z)) as the probability that an observation belongs to the positive class.
Predict the observation as belonging to the positive class if the probability meets a set threshold (e.g 0.5), else predict it as belonging to the negative class.

Now, let’s implement the above pseudocode with Python on the Iris dataset. We would use the gradient descent method to optimize the parameters of the linear combination as described in Linear Regression Modelling.

Logistic Regression Modelling

Visualizing the different classes with their respective decision boundaries

We can see that the model is able to differentiate between flowers that are of Setosa type and does that are not. We also see values that are between the decision boundaries. Such observations would have probabilities that are close to 0.5 when predicted with a decent model.

Now that we understand how logistic regression approaches binary classification, how would this differ for multiclass classification?

Multiclass Classification

For multiclass classification, we can either use the One-vs-Rest (OvR), One-vs-One (OvR) strategies, or the Softmax function.

One-vs-Rest (OvR) Classification involves training a separate binary logistic regression classifier for each class, treating it as the positive class while the rest forms the negative class, and then selecting the class that is predicted with the highest probability as the final predicted class for that observation.

For a practical example, suppose there are three distinct classes (A, B, and C) in our dataset, we would train three binary classifiers for :

A as the positive class vs. {B, C} as the negative class,
B as the positive class vs. {A, C} as the negative class, and…
C as the positive class vs. {A, B} as the negative class.

Therefore, we would have 3 probabilities per observation. Let’s say we get the probabilities (0.35, 0.78, 0.22) for models A, B and C respectively, we would have a final prediction of class B for our observation, has it has the highest probability of the three models.

While the OvR approach is a fairly useful formulticlass classification, it suffers from potential issues for real world application. For example, because the classifier trains separate binary classifiers for each class, treating one class as the positive class and the rest as the negative class, this can result in overlapping decision boundaries where multiple classifiers may predict an observation as belonging to their respective positive classes, leading to ambiguity in the final class assignment. In our case, imagine if we have predictions of (0.78, 0.78, 0.22) for our three models, how then do we ascertain the most probable class for our obeservation?

Also, each binary classifier in OvR is trained on an imbalanced dataset where one class is the positive class and all other classes combined form the negative class. This imbalance can lead to biased classifiers that may perform well on the dominant negative class but poorly on the positive class. To mitigate the issues related to the OvR approach, we can instead use the One-vs-One approach.

The One-vs-One (OvO) classification approach involves training a separate binary classifier for each pair of classes in a multiclass classification problem. Following our instance, if we have three classes (A, B, and C), we would train classifiers for pairs (A vs B), (A vs C), and (B vs C), using only data for the respective classes. Each classifier is responsible for distinguishing between a pair of classes, ignoring the other class(es) during its training.

When classifying a new instance, each of the classifiers makes a prediction. This results in each classifier voting for one of the two classes it was trained on. The final predicted class is the one that receives majority of votes from all the binary classifiers.

While the OvO approach is more precise and provides well-defined decision boundaries compared to the OvR approach, it comes with increased computational complexity and prediction time due to the necessity of training and evaluating large number of classifiers. This is where Softmax classification comes in.

Softmax classification is usually the most desired approach for multiclass classification as it directly generalizes logistic regression to multiple classes by providing consistent probability estimates and a unified decision algorithm, ensuring that the probabilities sum up to 1.

Softmax is a generalization of Logistic Regression that summarizes a k dimensional vector of arbitrary values (z) to a k dimensional vector of values bounded in the range (0, 1), which serves as the probability of an observation belonging to each of the multiple classes. In simpler terms, it allows us to achieve the same probability bounds of (0,1) of binary classes for mutliple classes as well.

Remember that using the sigmoid function, we get a probability output that an observation belongs to the positive class, with its complement (1- sigmoid(z)) being the probability that it does not belong to the positive class (i.e it belongs to the negative class). For softmax classification, we get k number of probabilities that sum up to 1, with k being the number of distinct classes in our dataset. Our prediction then becomes the class with the highest probability.

Following our instance from before, imagine we get an output of (0.16, 0.72, 0.12) from our softmax classifier for the classes A, B and C respectively, we can then predict the observation to belonging to class B, as it has the highest probability of the three. Unlike the output of the OVR approach, notice notice that our probabilities sum up to 1, as the softmax function normalizes its output.

The formula for the softmax function is given as:

Here is an implementation for the multiclass classification using the softmax function, notice the differences from the sigmoid function.

3D Visualization of Multiclass Classification

The Scikit-Learn package provides a simple, yet robust and efficient implementation of logistic regression that includes support for both binary and multiclass classification, along with various options for regularization, and performance evaluation, making it a versatile tool for practical machine learning applications. Let’s apply it to our sample problem. This time we would use 2 principal components from our data so that we can visualize the decision boundaries in 2D.

Decsion Boundaries for Iris Dataset using Logistic Regression

From the plot, we can clearly see that the observations are separated into the different classes they belong to with decision boundaries that show the separation between each class.

While the decision boundary plot gives us a good sense of judgment on how well our model is able to identify the class that each flower belongs to based on its features, there are a number of classification metrics that provide a more quantitative and detailed assessment of the model’s performance. As a last step, let’s look at the evaluation metrics that are generally useful for evaluating classification models.

Evaluation Metrics For Classification Models

Accuracy Score: Accuracy measures the proportion of correctly classified instances out of the total instances. It is calculated as the number of correctly predicted instances divided by the total number of instances.

We achieve an accuracy score of 100%, which indicates that we are able to perfectly predict the accurate classes for all observations in the test set.

2. Confusion Matrix: A confusion matrix is a table that summarizes the performance of a classification model on a set of test data. It displays the counts of true positives, true negatives, false positives, and false negatives.

Confusion Matrix showing the Positive and Negative Predictions

If we regard the Versicolor flower type as our positive class and the Setosa type as the negative class, we can interpret our confusion matrix thus:

a. True Negative: Flowers of Setosa type that were rightly predicted as such: 10

b. True Positives: Flowers of Versicolor type that were rightly predicted as such: 9

c. False Positives: Flowers of Versicolor type that were wrongly predicted as Setosa: 0

d. False Negative: Flowers of Setosa type that were wrongly predicted as Versicolor: 0

3. Precision, Recall and F1 Score: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It is calculated as the number of true positives divided by the sum of true positives and false positives. Precision is important when the cost of false positives is high.

Recall measures the proportion of true positive predictions out of all actual positive instances in the dataset. It is calculated as the number of true positives divided by the sum of true positives and false negatives. Recall is important when the cost of false negatives is high.

Precision and recall often have a trade-off relationship influenced by the set decision threshold value. A lower decision threshold increases recall but may decrease precision, as more false positives are included. Conversely, a higher decision threshold increases precision but may decrease recall, as more false negatives are included.

The F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is useful when the classes are imbalanced. The F1 score is calculated as:

F1 score = 2 * (precision * recall) / (precision + recall)

We can easily visualize the precision, recall and F1 score of our model using the classification report in Scikit-Learn.

Classification Report showing the precision, recall and F1 Score for the logistic model

4. Receiver Operating Characteristic (ROC) Curve and Area Under the ROC Curve (AUC): The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values. It is useful for understanding how the precision and recall of the model vary with different threshold values. A desirable model would maintain high precision and recall values regardless of the threshold used. Let’s visualize the ROC curve for our model.

Visualizing the ROC Curve and AUC value for our model

We see from the orange line that our model is perfect regardless of the threshold value.

The AUC represents the area under the ROC curve and provides a measure of the model’s ability to discriminate between positive and negative classes. A higher AUC indicates better performance and an AUC of 1 represents a perfect classifier.

Conclusion

In this article, we provided an in-depth overview of logistic regression modelling, covering key concepts essential for data scientists. We began by understanding the problem logistic regression aims to solve, specifically classification tasks. We looked into the theoretical foundations of logistic regression, exploring the logistic function and how it helps predict probabilities for binary and multiclass classification problems.

We walked through a simple example and implemented logistic regression from scratch using Python, highlighting the underlying mathematical principles. Additionally, we discussed various evaluation metrics such as accuracy, precision, recall, F1 score, and the confusion matrix, which are crucial for assessing the performance of classification models. Finally, we demonstrated a practical application using the Iris dataset and the Scikit-Learn library, visualizing decision boundaries and interpreting the results.

I encourage you to further experiment with different datasets, adjust threshold values to balance precision and recall, and consider advanced topics such as regularization and feature selection to enhance model performance as the journey of mastering logistic regression opens doors to more complex machine learning algorithms and a deeper understanding of predictive modelling.