Logistic Regression
From historical intuition to mathematical mastery. A beginner-friendly, visually interactive deep dive into the most fundamental classification algorithm.
Begin Learning ↓From historical intuition to mathematical mastery. A beginner-friendly, visually interactive deep dive into the most fundamental classification algorithm.
Begin Learning ↓Understanding the journey from step functions to smooth probability curves.
Imagine you are a doctor in the 1800s. You have patient data - age, blood pressure, symptoms - and you need to predict a simple outcome: will the patient survive or not? This is a binary classification problem. The answer is either 0 (no) or 1 (yes).
Early statisticians faced exactly this challenge. They needed a mathematical function that could take continuous inputs and produce a binary output. Their first attempt was brilliantly simple.
Named after Oliver Heaviside (1850-1925), the step function was the natural first choice. The idea is straightforward:
You compute a weighted sum of your inputs $z = w^Tx + b$, and if it is positive, you predict class 1. If negative, class 0. Simple, clean, intuitive.
Think about it this way: if a patient has values just slightly on one side of the threshold versus the other, the step function gives completely different predictions with no nuance. What we really need is:
Outputs between 0 and 1, representing confidence levels
Smooth everywhere so calculus-based optimization works
Higher inputs should give higher probability
The solution came from the world of statistics and population growth models. The logistic function (sigmoid) was first studied by Pierre-Francois Verhulst in the 1830s-1840s while modeling population growth in Belgium.
This elegant function solved all three problems simultaneously:
Why linear regression fails for classification and how logistic regression fixes it.
Linear regression predicts a continuous value:
This works great for predicting house prices, temperatures, or stock values. But what happens when we try to use it for classification?
For classification, we need the output to be a valid probability: a number between 0 and 1. Instead of predicting the class directly, we predict the probability of belonging to class 1:
The key insight of logistic regression is to pass the linear combination through the sigmoid function:
The mathematical beauty of logistic regression comes from the concept of odds:
Ranges from 0 to 1. Example: 0.8 probability of passing an exam.
Ratio of success to failure. Ranges from 0 to infinity. Same example: 4:1 odds of passing.
Take the logarithm of odds. Ranges from negative infinity to positive infinity. Now we can model it linearly!
Logistic regression models the log-odds as a linear function of the features. The logit transformation converts a bounded probability (0,1) into an unbounded real number (-∞, +∞), which can be naturally expressed as a linear combination of features.
The decision boundary is the line (or hyperplane in higher dimensions) where the model is equally uncertain - where $P(y=1|x) = 0.5$.
Since $\sigma(z) = 0.5$ when $z = 0$, the decision boundary is defined by:
This is a linear equation - which is why logistic regression produces linear decision boundaries despite being a nonlinear model.
Watch how the decision boundary separates two classes. Adjust the weight and bias sliders below.
The complete mathematical framework, derived step by step.
We start with a linear combination of features:
Where $w = [w_1, w_2, \ldots, w_n]^T$ is the weight vector, $x = [x_1, x_2, \ldots, x_n]^T$ is the feature vector, and $b$ is the bias term.
We pass the linear combination through the sigmoid (logistic) function:
Adjust the slope parameter to see how the sigmoid changes shape.
Midpoint is always at 0.5
Approaches 1 for large positive inputs
Approaches 0 for large negative inputs
Beautiful derivative expressed in terms of itself
Combining the linear model with the sigmoid gives us the logistic regression model:
And consequently:
The odds of an event is the ratio of the probability of the event occurring to the probability of it not occurring:
If we substitute our logistic model, something beautiful happens:
Taking the natural logarithm of the odds:
The logit function is the inverse of the sigmoid function. By applying it to the probability, we transform the nonlinear relationship back into a linear one. This is why logistic regression is called a generalized linear model - the log-odds are a linear function of the features, even though the probabilities themselves are not.
Deriving the Binary Cross-Entropy loss from Maximum Likelihood Estimation.
In linear regression, we minimize the Mean Squared Error (MSE):
If we try using MSE with the sigmoid output, we get a non-convex loss surface with many local minima. Gradient descent would get stuck and fail to find the optimal solution.
Since our output is binary (0 or 1), each data point follows a Bernoulli distribution:
Where $p = \sigma(w^Tx + b)$ is the predicted probability. When $y = 1$, this gives $p$. When $y = 0$, this gives $1-p$.
We want to find the weights that make our observed data most probable. The likelihood of the entire dataset (assuming independent samples) is:
Products are numerically unstable and hard to optimize. Taking the logarithm converts the product to a sum:
Since we want to minimize a loss (convention in optimization), we negate the log-likelihood:
This is the Binary Cross-Entropy (BCE) Loss, also known as the Log Loss.
If we predict $p = 0.99$ (very confident and correct), loss = 0.01 (small). If we predict $p = 0.01$ (very confident but wrong), loss = 4.6 (huge penalty!).
If we predict $p = 0.01$ (correctly predicting class 0), loss is small. If we predict $p = 0.99$ (wrongly predicting class 1), loss is huge.
The Binary Cross-Entropy loss is convex with respect to the weights. This means there is a single global minimum, and gradient descent is guaranteed to find it (with proper learning rate).
Drag to rotate. The bowl shape confirms convexity - there is only one minimum.
Why logistic regression requires iterative optimization.
In linear regression, we can solve directly:
But in logistic regression, the sigmoid function makes the equation nonlinear in the weights. When we try to set the gradient to zero:
We cannot isolate w analytically because the sigmoid introduces exponential terms that cannot be algebraically inverted when combined with the sum over all data points.
Instead, we use iterative methods that take small steps toward the optimal solution:
Simple, works well for large datasets
Uses one sample at a time, faster per iteration
Uses second-order info (Hessian), converges faster
Reformulates as weighted least squares at each step
Step-by-step gradient derivation and convergence analysis.
Let us derive the gradient of the Binary Cross-Entropy loss with respect to the weights. This is where the beautiful derivative of the sigmoid pays off.
The gradient has an incredibly simple form: it is the prediction error $(p_i - y_i)$ multiplied by the input features $x_i$, averaged over all samples. Larger errors lead to larger weight updates, and the update direction is determined by the features.
Where $\eta$ is the learning rate.
The BCE loss is convex because the Hessian matrix (matrix of second derivatives) is positive semi-definite:
Where $S$ is a diagonal matrix with $S_{ii} = p_i(1-p_i) > 0$. Since $S$ is positive definite and $X^TSX$ is a product of the form $A^TDA$, the Hessian is positive semi-definite.
This guarantees a unique global minimum that gradient descent will find.
Watch gradient descent optimize the loss. Adjust the learning rate to see how it affects convergence.
Very slow convergence. May take thousands of iterations to reach the minimum.
Smooth, efficient convergence. The sweet spot for most problems.
Overshoots the minimum, may oscillate or diverge entirely.
Hands-on interactive components to build deep intuition.
Click on the canvas to add data points (left-click for class 0, right-click for class 1). The decision boundary updates in real-time as the model trains.
Visualize the logistic regression surface in 3D. The sigmoid creates a smooth probability surface over the 2D feature space.
Watch how the loss decreases during training iterations.
When to use logistic regression and when to look for alternatives.
Unlike many classifiers, logistic regression directly outputs calibrated probabilities. You know not just the prediction but how confident the model is.
The loss function is convex, guaranteeing that gradient descent finds the global optimum. No worrying about local minima.
Each weight directly tells you how much that feature contributes to the log-odds. A one-unit increase in feature $x_j$ changes the log-odds by $w_j$.
When the classes can be separated by a hyperplane, logistic regression performs excellently and efficiently.
Without manual feature engineering (polynomial features, interactions), logistic regression can only create linear decision boundaries.
Extreme values in the features can significantly influence the learned weights and decision boundary placement.
The fundamental assumption that log-odds are linear in features may not hold for complex real-world relationships.
While it can handle correlated features, multicollinearity can make coefficient estimates unstable and hard to interpret.
Extending logistic regression to more powerful frameworks.
To prevent overfitting and improve generalization, we add a penalty term to the loss function:
Shrinks all weights toward zero but never exactly to zero. Good when all features are potentially relevant.
Can drive weights exactly to zero, performing automatic feature selection. Good for sparse models.
When we have more than two classes, we extend logistic regression using the softmax function:
This is also called Multinomial Logistic Regression. The loss function becomes the Categorical Cross-Entropy:
Logistic regression is literally a single-layer neural network with a sigmoid activation:
When we stack multiple layers of logistic units, we get a deep neural network. The building blocks are the same - logistic regression is the foundation upon which all of deep learning is built.
Logistic regression can also be derived from the principle of maximum entropy. Among all probability distributions that satisfy the constraints imposed by the data, logistic regression chooses the one with the highest entropy (the least additional assumptions).
This is why logistic regression is sometimes called a Maximum Entropy Classifier (MaxEnt) in natural language processing.