Logistic Regression from theory to code implementation

Imagine you are building a document filter that takes documents as input and decide whether they are fraud or not. You will need a model that doesn't just predict yes or no but gives you a probability. Like "this document is 40% likely to be fraud". Logistic Regression is perfect for this kind of problem. In this post, we'll break down the math behind *logistic regression * step-by-step. No scary equations, just clear, intuitive explanations — with a little help from Python code along the way! What is Logistic Regression? Logistic Regression is a supervised learning algorithm used for categorical classification based on threshold value between 0 and 1 (call is threshold probability if you wish).This could involve predicting if something belong to 1 out of 2 categories(binary classification) or 1 out of many discrete categories ().This could involve classifying emails into spam or not spam, classifying persons as sick or healthy , infected or not infected just to name a few. While it has "regression" in the name, logistic regression is actually about classification, not predicting a continuous number like standard linear regression does. But regression is the bases for this classification as the classification is done using continous values between 0 and 1. Mathematical background 1.Probability: A probability is just a number between 0 and 1 that tells us the likely of an event occuring. 0 means impossible and 1 means certain and 0.6 means 60% chance of occuring. 2.Odds:Odds are just another way of expressing probability. It is the ratio between probability of success(p) and probability of failure(1 - p) this is called odds for or probability of failure(1 - p) divided by probability of success (p). from this point i will only be talking about odds for Odds=p1−p \text {Odds} = \frac {p} {1 - p} Odds=1−pp​ Odds ranges from 0 to infinity. odds > 1 means success is more likely odds < 1 means failure is more likely odds = 1 means 50-50 chance for failure and success Odds Can only take positive numbers and we need a way to map these numbers to the set that ranges from -infinity to +infinity Logarithms are perfect for solving this problem 3.log(Odds) or Logit function: So we take the logarithm of odds, called the log-odds (or logit). Formula: log(Odds) or logit(p)=log(p1−p) \text {log(Odds)} \text{ } or \text{ } logit(p) = log(\frac {p} {1 - p}) log(Odds) or logit(p)=log(1−pp​) if p > 0.5 logit(p) is positive if p < 0.5 logit(p) is negative if p = 0.5 logit(p) = 0 why are we doing all these linear models like Logistic Regression produce output from -infinity to +infinity But probabilities are restricted between 0 and 1 So we model log-odds (which stretch across the whole real number line) as a linear function of inputs. that is ; log(p1−p)=w.x+b log(\frac {p} {1 - p}) = \text {w.x} + \text{b} log(1−pp​)=w.x+b where w and x are vectors this is log base e where e = 2.718... if we know that z = w.x + b gives the log(odds) how do we get the probability (p) . The Sigmoid function: This function takes a number between -infinity and +infinity and map it to a number between 0 and 1 ( a probability) It is obtained by solving for p in the logit function take  z=w.x+b  \text{ } {z} = \text{w}. {x} + {b} \text { } z=w.x+b  then σ(z)=11+e−z \sigma_ (z)= \frac {1} {1 + e^{-z}} σ(​z)=1+e−z1​ if z is large and positive output will be close to 1 if z is large and negative output will be close to 0 if z = 0 output will be 0.5 To see what is looks like import numpy as np import matplotlib.pyplot as plt def sigmoid(z): return 1 / (1 + np.exp(-z)) #create 100 numbers from -10 to 10 z = np.linspace(-10, 10, 100) #Plot in a graph plt.plot(z, sigmoid(z)) plt.title("Sigmoid Function") plt.xlabel("z") plt.ylabel("σ(z)") plt.grid() plt.show() The Decision Boundary Once we have the output probability y^=σ(z)\hat{y} = \sigma(z)y^​=σ(z) , how do we decide the class? Easy: If y^≥0.5\hat{y} \geq 0.5y^​≥0.5 , predict class 1 (positive) If y^

Apr 21, 2025 - 17:18
 0
Logistic Regression from theory to code implementation

Imagine you are building a document filter that takes documents as input and decide whether they are fraud or not. You will need a model that doesn't just predict yes or no but gives you a probability. Like "this document is 40% likely to be fraud".

Logistic Regression is perfect for this kind of problem.
In this post, we'll break down the math behind *logistic regression * step-by-step. No scary equations, just clear, intuitive explanations — with a little help from Python code along the way!

What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for categorical classification based on threshold value between 0 and 1 (call is threshold probability if you wish).This could involve predicting if something belong to 1 out of 2 categories(binary classification) or 1 out of many discrete categories ().This could involve classifying emails into spam or not spam, classifying persons as sick or healthy , infected or not infected just to name a few.
While it has "regression" in the name, logistic regression is actually about classification, not predicting a continuous number like standard linear regression does.

But regression is the bases for this classification as the classification is done using continous values between 0 and 1.

Mathematical background

1.Probability: A probability is just a number between 0 and 1 that tells us the likely of an event occuring. 0 means impossible and 1 means certain and 0.6 means 60% chance of occuring.

2.Odds:Odds are just another way of expressing probability. It is the ratio between probability of success(p) and probability of failure(1 - p) this is called odds for or probability of failure(1 - p) divided by probability of success (p).

from this point i will only be talking about odds for

Odds=p1−p \text {Odds} = \frac {p} {1 - p} Odds=1pp

Odds ranges from 0 to infinity.

odds > 1 means success is more likely
odds < 1 means failure is more likely
odds = 1 means 50-50 chance for failure and success

Odds Can only take positive numbers and we need a way to map these numbers to the set that ranges from -infinity to +infinity
Logarithms are perfect for solving this problem

3.log(Odds) or Logit function:
So we take the logarithm of odds, called the log-odds (or logit).
Formula:

log(Odds) or logit(p)=log(p1−p) \text {log(Odds)} \text{ } or \text{ } logit(p) = log(\frac {p} {1 - p}) log(Odds) or logit(p)=log(1pp)
if p > 0.5 logit(p) is positive
if p < 0.5 logit(p) is negative
if p = 0.5 logit(p) = 0

why are we doing all these
linear models like Logistic Regression produce output from -infinity to +infinity
But probabilities are restricted between 0 and 1
So we model log-odds (which stretch across the whole real number line) as a linear function of inputs.

that is ;

log(p1−p)=w.x+b log(\frac {p} {1 - p}) = \text {w.x} + \text{b} log(1pp)=w.x+b



where w and x are vectors

this is log base e where e = 2.718...

if we know that z = w.x + b gives the log(odds) how do we get the probability (p) .

The Sigmoid function:

This function takes a number between -infinity and +infinity and map it to a number between 0 and 1 ( a probability)
It is obtained by solving for p in the logit function

take  z=w.x+b  \text{ } {z} = \text{w}. {x} + {b} \text { } z=w.x+b  then

σ(z)=11+e−z \sigma_ (z)= \frac {1} {1 + e^{-z}} σ(z)=1+ez1

if z is large and positive output will be close to 1
if z is large and negative output will be close to 0
if z = 0 output will be 0.5

To see what is looks like

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

#create 100 numbers from -10 to 10
z = np.linspace(-10, 10, 100)

#Plot in a graph
plt.plot(z, sigmoid(z))
plt.title("Sigmoid Function")
plt.xlabel("z")
plt.ylabel("σ(z)")
plt.grid()
plt.show()

The Decision Boundary

Once we have the output probability

y^=σ(z)\hat{y} = \sigma(z)y^=σ(z) , how do we decide the class?

Easy:

  • If y^≥0.5\hat{y} \geq 0.5y^0.5 , predict class 1 (positive)
  • If y^<0.5\hat{y} < 0.5y^<0.5 , predict class 0 (negative) The decision boundary is where σ(z)=0.5\sigma(z) = 0.5σ(z)=0.5 , which happens when z=0z = 0 z=0 . So, the equation w⋅x+b=0w \cdot x + b = 0wx+b=0 defines the decision boundary — a straight line (or a hyperplane in higher dimensions).

Cost Function: Measuring How Bad Our Predictions Are.

We need the cost function to measure how bad our model's prediction are
In linear regression, we used Mean Squared Error (MSE).
But in logistic regression, MSE doesn’t work well because of the sigmoid's non-linear nature — it causes messy, non-convex optimization.

Instead, we use Log-Loss (aka Cross-Entropy Loss)

The cross-entropy loss tells us how good our prediction $\hat{y}$ is compared to the true label yyy .

The formula is:

L(y^,y)=−(ylog⁡(y^)+(1−y)log⁡(1−y^)) L(\hat{y}, y) = - \left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right) L(y^,y)=(ylog(y^)+(1y)log(1y^))
  • y is a label, a class, a category (0 or 1)

  • If the true label y=1y = 1y=1 , we want y^\hat{y}y^ to be close to 1.

  • If y=0y = 0y=0 , we want y^\hat{y}y^ to be close to 0.

  • The closer the prediction is to the truth, the smaller the loss!

For multiple examples, we just average the losses:

Average loss=1m∑i=1mL(y^(i),y(i)) \text{Average loss} = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) Average loss=m1i=1mL(y^(i),y(i))

where mmm is the number of examples.

Goal: Make the loss as small as possible by adjusting the model!


Optimization: Finding the Best Weights that minimizes the loss

We want to minimize the total loss over all data points.
We'll do this using gradient descent
The gradient is defined using partial derrivatives
The gradient at any point in space will tell you the direction you should follow if you want to reach the highest point in that space as fast as possible .
On the contrary gradient descent does the opposite (the fastest direction to the bottom)

how is this achieved

1.compute the gradient of the loss with respect to 
each parameter (w and b).
2.Update the parameters a little bit opposite to the 
gradient (downhill!).

Update Rules (Gradient Descent)

After computing the gradients, we update the weights and bias like this:

  • Weight update:

    w:=w−α∂L∂w w := w - \alpha \frac{\partial L}{\partial w} w:=wαwL
  • Bias update:

    b:=b−α∂L∂b b := b - \alpha \frac{\partial L}{\partial b} b:=bαbL

where:

  • LLL is the loss
  • α\alphaα (alpha) is the learning rate — it controls how big the update steps are.
# Assume we have loss gradient dw, db
w = w - learning_rate * dw
b = b - learning_rate * db

If you have more than two categories (e.g., cat, dog, rabbit), you extend logistic regression into Softmax Regression (a.k.a. Multinomial Logistic Regression).

Softmax generalizes sigmoid to handle multi-class classification.

Now that we fully understand the math behind logistic regression — from the linear model(w.x + b) to the sigmoid function( σ(z)\sigma(z)σ(z) ) and log-loss — it's time to bring it to life with real-world data.

We'll start by implementing logistic regression using Scikit-learn, a popular machine learning library that makes applying models incredibly easy.
After that, we'll also build the same model using TensorFlow/Keras to show how logistic regression fits naturally into deep learning workflows

Even though libraries like Scikit-learn and TensorFlow handle all the math for us under the hood — like computing the log-odds, applying the sigmoid function, and minimizing the cross-entropy loss — understanding the math gives us intuition about what the model is doing behind the scenes

We'll use the Pima Indians Diabetes Dataset, a famous dataset where the goal is to predict whether a patient has diabetes based on medical information like glucose level, BMI, age, and more

I) Implementation using scikit learn

Step 1: Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Load the Dataset

# Load dataset
data = pd.read_csv('diabetes.csv')

# View the first few rows
print(data.head())

Step 3: Prepare the Data

# Split into features (X) and labels (y)
X = data.drop('Outcome', axis=1)  # 'Outcome' is the target
y = data['Outcome']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Train the Model

# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

Step 5: Evaluate the Model

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

II)Implementation using TensorFlow/keras

Step 1: Import Libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Step 2: Build the Model

model = Sequential([
    Dense(1, activation='sigmoid', input_shape=(X_train.shape[1],))
])

Explanation:
Only 1 neuron because it’s binary classification.
Sigmoid activation because we want output probabilities between 0 and 1.

Step 3: Compile the Model

model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])

Explanation:
Adam optimizer for efficient gradient descent.
Binary Crossentropy because it’s binary classification.

Step 4: Train the Model

history = model.fit(
                 X_train, 
                 y_train, 
                 epochs=100,
                 batch_size=32, 
                 validation_split=0.2, 
                 verbose=1
                 )

Epochs = 100: Train 100 passes through the dataset.
Batch size = 32: Process 32 samples at a time.

Step 5: Evaluate the Model

# Evaluate on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Test Accuracy:", accuracy)

As we can see, both Scikit-learn **and **TensorFlow make it incredibly easy to build a logistic regression model.
While Scikit-learn is perfect for quick classical machine learning models, TensorFlow shines when you want to extend logistic regression into deep learning architectures later on.

Visualizing model performance

To better understand how our logistic regression model performed, let's visualize the confusion matrix. and also take a look at the model's learning curves over the training process.

1. Plotting the Confusion Matrix (Scikit-learn)
Confusion matrices help you see not just accuracy, but where the model is making mistakes.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Scikit-learn Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

2. Plotting Training History (TensorFlow/Keras)
TensorFlow gives you the history object, which tracks loss and accuracy during training.
Let’s plot how the model learned over time:

# Plot training & validation accuracy values
plt.figure(figsize=(12,5))

# Accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy over Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')

# Loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss over Epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')

plt.tight_layout()
plt.show()