Getting a Grip on Linear Algebra for Data Science: It's Not as Scary as It Sounds!

Alright, so you're diving into data science, and everyone keeps talking about linear algebra. Don't sweat it! It's a super fundamental part of this field, and honestly, once you get a handle on a few key ideas, a lot of the data science stuff starts making more sense. Usually, a full linear algebra course is pretty long, like 36 hours! But we're just going to grab the most important bits, the ones you'll actually use in data science, especially for what we're covering here. We'll keep it simple, explain the ideas without getting too formal, but definitely no hand-waving either – we'll explain things properly, just in a way that's easy to digest. When we talk data science, a big part of it is representing your data and then figuring out what that data is really telling you. Like, how many different things are actually important in your data, and are some things related to each other? Linear algebra gives us the tools to answer these questions, which is super handy before you even get to the fancy machine learning algorithms. First Off: How Do We Even Organize Our Data? Meet the Matrix! When you're dealing with data in data science, figuring out how to arrange it is a big deal. And guess what? Data is usually represented in this thing called a matrix. Think of a matrix as just a neat way to put your data into rows and columns. It's basically a rectangular grid. Imagine you're an engineer checking on a factory reactor. You're getting readings from sensors – pressure, temperature, how thick something is (density), maybe viscosity too. And you're taking these readings over and over, maybe a thousand times. How do you keep all this info straight so you can use it later? A matrix is perfect for this. You can make each column a different measurement (pressure, temperature, density, viscosity) and each row one of your measurement times, one of your "samples." So, if you took 1000 sets of readings for 4 different things, you'd have a 1000-row, 4-column matrix. The number in the first row, first column? That's the pressure at your first measurement. The number in the 500th row, second column? That's the temperature at your 500th measurement. Easy, right? We'll usually stick to this way: rows for samples, columns for what you measured (the variables or attributes). Let's look at a simple example of this kind of data matrix from our reactor scenario, but with just 3 samples: A = [[2.5, 120, 1.2, 3.7], # Sample 1: Pressure, Temperature, Density, Viscosity [5.0, 240, 2.4, 7.4], # Sample 2: Pressure, Temperature, Density, Viscosity [3.0, 180, 1.8, 5.5]] # Sample 3: Pressure, Temperature, Density, Viscosity Matrices aren't just for data. Sometimes, they can also represent equations. If you have a bunch of linear equations, you can put the coefficients of the variables into a matrix. This lets you use linear algebra tools to work with and solve those equations. They're also used to represent pictures! Ever wonder how computers "see" pictures? They often turn them into matrices! A picture is broken down into tiny dots called pixels. Each pixel gets a number based on its color or brightness. So, a photo becomes a huge matrix of numbers. If it's a black and white picture, a white spot might be a large number, a black spot a small one. This lets the computer do calculations on the matrix – using linear algebra! – to figure out if two pictures are similar, or to spot things inside a picture. It's all about turning the visual into numbers the computer can work with. Basically, whether it's sensor data, pictures, or coefficients from equations, the matrix is our go-to structure. Rows are usually your individual data points or samples, and columns are the different characteristics or variables you measured. Digging Into the Data: Are All My Measurements Really Different? Okay, we've got our data in a matrix. Now, what do we do with it? One of the first things you might wonder is, "Are all these measurements I took actually telling me something new? Or are some of them just kind of repeating information I already have from the others?" Think back to that reactor data with pressure, temperature, density, and viscosity. You might already know that density sort of depends on pressure and temperature. If that link is a simple linear one, then knowing pressure and temperature is enough to figure out density. You don't really need density as a separate, independent piece of information. Knowing this is super important for understanding how much actual, unique information is in your dataset and maybe even making your data smaller by getting rid of redundant stuff. This is where a cool concept called the rank of a matrix comes in handy. The rank is just the number of columns (or rows, it works out the same) that are truly linearly independent. They aren't just simple combinations of the other columns or rows. The rank tells you the real number of distinct variables or samples you'

May 1, 2025 - 13:42

Getting a Grip on Linear Algebra for Data Science: It's Not as Scary as It Sounds!

Alright, so you're diving into data science, and everyone keeps talking about linear algebra. Don't sweat it! It's a super fundamental part of this field, and honestly, once you get a handle on a few key ideas, a lot of the data science stuff starts making more sense. Usually, a full linear algebra course is pretty long, like 36 hours! But we're just going to grab the most important bits, the ones you'll actually use in data science, especially for what we're covering here. We'll keep it simple, explain the ideas without getting too formal, but definitely no hand-waving either – we'll explain things properly, just in a way that's easy to digest.

When we talk data science, a big part of it is representing your data and then figuring out what that data is really telling you. Like, how many different things are actually important in your data, and are some things related to each other? Linear algebra gives us the tools to answer these questions, which is super handy before you even get to the fancy machine learning algorithms.

First Off: How Do We Even Organize Our Data? Meet the Matrix!

When you're dealing with data in data science, figuring out how to arrange it is a big deal. And guess what? Data is usually represented in this thing called a matrix. Think of a matrix as just a neat way to put your data into rows and columns. It's basically a rectangular grid.

Imagine you're an engineer checking on a factory reactor. You're getting readings from sensors – pressure, temperature, how thick something is (density), maybe viscosity too. And you're taking these readings over and over, maybe a thousand times. How do you keep all this info straight so you can use it later? A matrix is perfect for this. You can make each column a different measurement (pressure, temperature, density, viscosity) and each row one of your measurement times, one of your "samples." So, if you took 1000 sets of readings for 4 different things, you'd have a 1000-row, 4-column matrix. The number in the first row, first column? That's the pressure at your first measurement. The number in the 500th row, second column? That's the temperature at your 500th measurement. Easy, right? We'll usually stick to this way: rows for samples, columns for what you measured (the variables or attributes).

Let's look at a simple example of this kind of data matrix from our reactor scenario, but with just 3 samples:

A = [[2.5, 120, 1.2, 3.7],   # Sample 1: Pressure, Temperature, Density, Viscosity
     [5.0, 240, 2.4, 7.4],   # Sample 2: Pressure, Temperature, Density, Viscosity  
     [3.0, 180, 1.8, 5.5]]   # Sample 3: Pressure, Temperature, Density, Viscosity

Matrices aren't just for data. Sometimes, they can also represent equations. If you have a bunch of linear equations, you can put the coefficients of the variables into a matrix. This lets you use linear algebra tools to work with and solve those equations.

They're also used to represent pictures! Ever wonder how computers "see" pictures? They often turn them into matrices! A picture is broken down into tiny dots called pixels. Each pixel gets a number based on its color or brightness. So, a photo becomes a huge matrix of numbers. If it's a black and white picture, a white spot might be a large number, a black spot a small one. This lets the computer do calculations on the matrix – using linear algebra! – to figure out if two pictures are similar, or to spot things inside a picture. It's all about turning the visual into numbers the computer can work with.

Basically, whether it's sensor data, pictures, or coefficients from equations, the matrix is our go-to structure. Rows are usually your individual data points or samples, and columns are the different characteristics or variables you measured.

Digging Into the Data: Are All My Measurements Really Different?

Okay, we've got our data in a matrix. Now, what do we do with it? One of the first things you might wonder is, "Are all these measurements I took actually telling me something new? Or are some of them just kind of repeating information I already have from the others?"

Think back to that reactor data with pressure, temperature, density, and viscosity. You might already know that density sort of depends on pressure and temperature. If that link is a simple linear one, then knowing pressure and temperature is enough to figure out density. You don't really need density as a separate, independent piece of information. Knowing this is super important for understanding how much actual, unique information is in your dataset and maybe even making your data smaller by getting rid of redundant stuff.

This is where a cool concept called the rank of a matrix comes in handy. The rank is just the number of columns (or rows, it works out the same) that are truly linearly independent. They aren't just simple combinations of the other columns or rows. The rank tells you the real number of distinct variables or samples you're dealing with, in terms of linear relationships.

Let's look at our reactor data matrix again:

A = [[2.5, 120, 1.2, 3.7],   # Sample 1
     [5.0, 240, 2.4, 7.4],   # Sample 2
     [3.0, 180, 1.8, 5.5]]   # Sample 3

Let's check the rows for independence. Is Row 2 just a scaled version of Row 1?
$5.0/2.5=2$
$240/120=2$
$2.4/1.2=2$
$7.4/3.7=2$
Yes! Row 2 is exactly 2 times Row 1. This means Row 1 and Row 2 are linearly dependent.
Is Row 3 a scaled version of Row 1?
$3.0/2.5=1.2$
$180/120=1.5$
No, the scaling factor isn't constant. So, Row 3 is independent of Row 1.

Since Row 2 depends on Row 1, the set of independent rows is {Row 1, Row 3}. There are 2 independent rows. The rank of the matrix is 2.
So, even though we have 4 variables and 3 samples, the rank is 2. This tells us that in terms of linear combinations, the data essentially lives in a 2-dimensional space. There are only 2 independent sources of linear information here.

You can easily find the rank using software. In Python, for instance, you'd just use a command like np.linalg.matrix_rank(A) after setting up your matrix $A$.

# Let's make a matrix similar to our reactor example
import numpy as np

# Create the matrix
A = np.array([
    [2.5, 120, 1.2, 3.7],   # Sample 1
    [5.0, 240, 2.4, 7.4],   # Sample 2
    [3.0, 180, 1.8, 5.5]    # Sample 3
])

# Number of columns
print("Number of columns:")
print(A.shape[1])

# Calculate rank of the matrix
print("Rank of the matrix:")
rank = np.linalg.matrix_rank(A)
print(rank)

# Calculate nullity
print("Nullity (Number of Relationships):")
nullity = A.shape[1] - rank
print(nullity)

You can find more linear algebra code examples in my Math-in-AI GitHub repository.

For this matrix, the rank is 2 and the nullity is $4-2=2$. This tells us there are 2 independent linear relationships among the 4 variables.

Finding the Connections: What Are the Actual Relationships?

Alright, we know if there are relationships (if the rank is less than the number of variables), but what are they? How do we find the actual equations that link these variables?

This is where we look at something called the null space and its size, the nullity. Imagine we have our data matrix, let's call it $A$. If we can find a non-zero vector (just a list of numbers in a column) called $\beta$ (
$\beta = [\beta_1, \beta_2, ..., \beta_n]^T$
), such that when you multiply $A$ by $\beta$, you get a vector of all zeros ($A\beta = \mathbf{0}$), that $\beta$ vector is in the null space of $A$.

Setting up $A\beta = \mathbf{0}$ for our reactor matrix gives us a system of equations, one for each row (sample):

Sample 1: $2.5\beta_1 + 120\beta_2 + 1.2\beta_3 + 3.7\beta_4 = 0$

Sample 2: $5.0\beta_1 + 240\beta_2 + 2.4\beta_3 + 7.4\beta_4 = 0$

Sample 3: $3.0\beta_1 + 180\beta_2 + 1.8\beta_3 + 5.5\beta_4 = 0$

Because Sample 2 is just 2 times Sample 1, the second equation is also just 2 times the first equation ($2 \times (2.5\beta_1 + 120\beta_2 + 1.2\beta_3 + 3.7\beta_4) = 5.0\beta_1 + 240\beta_2 + 2.4\beta_3 + 7.4\beta_4$). So, the second equation doesn't give us new, independent information about the
$\beta$ values. We effectively have two independent equations (from rows 1 and 3) for our four unknowns (
$\beta_1, \beta_2, \beta_3, \beta_4$).

$$2.5\beta_1 + 120\beta_2 + 1.2\beta_3 + 3.7\beta_4 = 0$$

$$3.0\beta_1 + 180\beta_2 + 1.8\beta_3 + 5.5\beta_4 = 0$$

See what's happening? The same
$\beta_1, \beta_2, \beta_3, \beta_4$ values must work for all your samples for the product to be the zero vector. This means you've found a general linear equation that connects the variables themselves, no matter which sample you look at:

$$\beta_1 \times (\text{Variable 1}) + \beta_2 \times (\text{Variable 2}) + \beta_3 \times (\text{Variable 3}) + \beta_4 \times (\text{Variable 4}) = 0$$

This $\beta$ vector gives you the coefficients of that linear relationship!

The nullity of matrix $A$ is simply how many of these independent $\beta$ vectors exist in the null space. Each independent null space vector means there's another unique linear relationship hiding in your data.

The Rank-Nullity Theorem ties these together:
$$\text{Nullity of } A + \text{Rank of } A = \text{Total number of variables (columns in } A)$$

For our matrix, Nullity + 2 = 4, so Nullity = 2. There are 2 independent linear relationships.

To find these relationships, we need to solve the system of equations for the vectors
$\beta$ that satisfy $A\beta = \mathbf{0}$. When you solve the system for our reactor matrix:
$$2.5\beta_1 + 120\beta_2 + 1.2\beta_3 + 3.7\beta_4 = 0$$
$$3.0\beta_1 + 180\beta_2 + 1.8\beta_3 + 5.5\beta_4 = 0$$

you find two independent solution vectors that form a basis for the null space. The basis for the null space is given by the vectors (you can see the step-by-step solution here):

Vector 1: $[0, -1/100, 1, 0]^T \approx [0, -0.01, 1, 0]^T$

Vector 2: $[-1/15, -53/1800, 0, 1]^T \approx [-0.067, -0.029, 0, 1]^T$

Let's see what these vectors mean in terms of relationships between our variables (Pressure, Temperature, Density, Viscosity).

From Vector 1: $[0, -1/100, 1, 0]^T$

The coefficients are $0, -1/100, 1,$ and $0$. The relationship is:
$$0 \times (\text{Pressure}) + (-1/100) \times (\text{Temperature}) + 1 \times (\text{Density}) + 0 \times (\text{Viscosity}) = 0$$
This simplifies to: $-\frac{1}{100} \times \text{Temperature} + \text{Density} = 0$, or $\text{Density} = \frac{1}{100} \times \text{Temperature}$.
This relationship tells us that in this dataset, the Density reading is always 1/100th of the Temperature reading.

From Vector 2: $[-1/15, -53/1800, 0, 1]^T$

The coefficients are $-1/15, -53/1800, 0,$ and $1$. The relationship is:
$$(-1/15) \times (\text{Pressure}) + (-53/1800) \times (\text{Temperature}) + 0 \times (\text{Density}) + 1 \times (\text{Viscosity}) = 0$$
This simplifies to: $-\frac{1}{15} \times \text{Pressure} - \frac{53}{1800} \times \text{Temperature} + \text{Viscosity} = 0$, or $\text{Viscosity} = \frac{1}{15} \times \text{Pressure} + \frac{53}{1800} \times \text{Temperature}$.
This is the second independent linear relationship in your data, showing how Viscosity depends on both Pressure and Temperature.

So, by setting up and solving the system $A\beta = \mathbf{0}$, we found the null space vectors. These vectors provide the exact coefficients for the linear equations that describe the relationships between the variables that hold true for all your samples.

Why Does This Matter for Machine Learning?

Okay, so we can represent data in matrices, find out how many variables are truly independent (rank), and even get the exact equations linking dependent variables (null space/nullity). Why is this a big deal for machine learning?

Well, a lot of machine learning algorithms work by doing calculations on these data matrices. If you want to reduce the number of variables in your dataset to make things simpler or faster (that's called dimensionality reduction), you absolutely need to understand the concepts of independence and rank. Algorithms like Principal Component Analysis (PCA) rely on finding the most important, independent directions (or components) in your data, which is directly related to the rank. Knowing the relationships (from the null space) can also be useful; sometimes, algorithms need to be built in a way that respects these inherent links in the data.

So, getting a solid grasp of these matrix ideas, rank, and null space is really your entry ticket to understanding how many machine learning techniques actually work under the hood.

Wrapping It Up

To sum it all up, linear algebra, especially working with matrices, is super important in data science. Matrices are our standard way to store data, with rows for each sample and columns for each variable. The rank of a matrix quantifies the number of independent variables, telling us how much unique information is present. And if there are dependencies, the null space and its size, the nullity, help us find the actual linear equations that describe the relationships between those variables. These aren't just abstract math ideas; they are practical tools for understanding your data better and are absolutely essential for getting into machine learning.