Practical Eigenvectors
Learn 80% about what they are and their applications with 20% effort The post Practical Eigenvectors appeared first on Towards Data Science.

Motivation
Eigenvectors are a central concept of linear algebra with a wide range of exciting applications. However, they may be non-intuitive and intimidating, and linear algebra defines these concepts in very rigorous and generic terms spanning hundreds of pages. Moreover, the information on what they are and how they’re used in various applications is scattered in different places.
This article makes eigenvectors friendlier with simple visualizations and exciting applications.
Scope
We assume that the reader is familiar with the basic matrix addition and multiplication operations. We only discuss finite-dimensional vector spaces over ℝ (real numbers)[1].
Vectors and bases
In the N-dimensional space, a vector \(v\) is a list of N scalars: \[v=\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_N\end{bmatrix}\]
The standard basis (S) is a special group of N vectors \(s_1, s_2, \dots, s_N\) such that \(s_k\) has 1 on kth position and 0 otherwise.
By default, every vector is defined with respect to the standard basis. In other words, the meaning of \(v\) above is that \(v = x_1 \cdot s_1 + x_2 \cdot s_2 + \dots + x_N \cdot s_N\). To make the basis explicit, we subscript it: \(v=v_S\).
Geometrically, a vector is an arrow starting from the fixed origin of the N-dimensional space and ending in the point identified by its components.
The image below depicts in 2D space the standard basis \(s_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), \(s_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\) and two other vectors \(v_S = \begin{bmatrix} 3 \\ -1 \end{bmatrix}\), \(w_S = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\):
A group of vectors are independent if none of them can be written as a weighted sum of the others. Vectors \(\begin{bmatrix} 3 \\ -1 \end{bmatrix}\) and \(\begin{bmatrix} 1 \\ 1 \end{bmatrix}\) are independent, but \(v = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\) and \(u = \begin{bmatrix} 2 \\ 2 \end{bmatrix}\) are not because \(u = 2 \cdot v\).
In an N-dimensional space, a basis is any group of N vectors that are independent. The standard basis is not the only basis. Given a basis, every vector of the space can be written uniquely as a weighted sum of those basis vectors.
Therefore, the same vector can be defined with respect to different bases. In each case the value and meaning of each of its components may change, but the vector remains the same. In the example above we chose the standard basis and defined vectors \(v\) and \(w\) with respect to \(s_1\) and \(s_2\). Now let’s choose the basis as vectors \(v\) and \(w\), and instead write \(s_1\) and \(s_2\) respect to this new basis.
How to change a vector’s basis
Say \(v_S\) is defined with respect to the standard basis and we want to redefine it as \(v_B\) with respect to another basis B (\(b_1, b_2, \dots, b_N\)).
First, define the N-by-N matrix \(B\) such that its jth column is \(b_{jS}\).
Then \(v_B = B^{-1} \cdot v_S\) and \(v_S = B \cdot v_B\).
Operators
An operator is an N-by-N matrix \(O_S\) describing how it maps a vector (\(v_S\)) to another vector (\(u_S\)): \(u_S=O_S \cdot v_S\).
Think of vectors as “data”, and of operators as “transformation[3]” of the data.
In the 2D space we find some familiar classes of operators.
Scale operator
\(O_1 = \begin{bmatrix} k_x & 0 \\ 0 & k_y \end{bmatrix}\), for example \(O_1 = \begin{bmatrix} 1.5 & 0 \\ 0 & 2 \end{bmatrix}\).
Below, the left image shows the original 2D space, the middle shows the space after being transformed by the operator \(O_1\), and the right image shows scaled gradients of how the points get moved.
Shear operator
\(O_2 = \begin{bmatrix} 1 & s_x \\ s_y & 1 \end{bmatrix}\), for example \(O_2 = \begin{bmatrix} 1 & 0.25 \\ 0.5 & 1 \end{bmatrix}\).
Rotate operator
\(O_3 = \begin{bmatrix} cos \phi & -sin \phi \\ sin \phi & cos \phi \end{bmatrix}\) rotates the vectors counter-clockwise by \(\phi\).
For example \(O_3 = \begin{bmatrix} 0.866 & -0.5 \\ 0.5 & 0.866 \end{bmatrix}\) rotates by \(30^{\circ} \).
Composite operators
If operator \(O\) is the composition of two operators \(O_1\) and \(O_2\), such that first we transform vectors with \(O_1\) and subsequently with \(O_2\), then \(O = O_2 \cdot O_1\).
For example, to compose the operators above (rotate, then shear, then scale): \(O_4 = O_1 \cdot O_2 \cdot O_3 = \begin{bmatrix} 1.5 & 0 \\ 0 & 2\end{bmatrix} \cdot \begin{bmatrix} 1 & 0.25 \\ 0.5 & 1 \end{bmatrix} \cdot \begin{bmatrix} 0.866 & -0.5 \\ 0.5 & 0.866 \end{bmatrix} \), hence \(O_4 = \begin{bmatrix} 1.4865 & -0.42525 \\ 1.866 & 1.232 \end{bmatrix} \).
No translate?
Perhaps surprisingly, translation is not an operator (not a linear-transform). It can be implemented as an operator by adding a temporary dimension to the space.
Example: in 2D, to translate vectors \(v = \begin{bmatrix} v_x \\ v_y \end{bmatrix} \) horizontally by \(t_x\) and vertically by \(t_y\) to \(u = \begin{bmatrix} v_x + t_x \\ v_y + t_y \end{bmatrix}\), first add a third dimension to it with a component of 1: \(v = \begin{bmatrix} v_x \\ v_y \\ 1 \end{bmatrix} \). Now we can use that extra component with an operator like \(O=\begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \\ \end{bmatrix}\). Then \(u = O \cdot v = \begin{bmatrix} v_x + t_x \\ v_y + t_y \\ 1 \end{bmatrix} \). Finally, drop the temporary dimension.
Note: affine transformation in homogeneous coordinates works similarly – here.
Note: SVG implements 2D translation this way – here.
How to chage an operator’s basis
If vector definitions change with respect to different bases, so do operators.
Say \(O_S\) is defined with respect to the standard basis, and we want to redefine it as \(O_B\) with respect to another basis B (\(b_1, b_2, \dots, b_N\)).
Once again, define the N-by-N matrix \(B\) such that its jth column is \(b_{jS}\).
Then \(O_B = B^{-1} \cdot O_S \cdot B \) and \(O_S = B \cdot O_B \cdot B^{-1} \).
Eigenvalues and eigenvectors
Given operator \(O\), an eigenvector[2] is any non-zero vector which remains on the same axis (i.e., maintains or reverses direction) when transformed by \(O\). The eigenvector may change its length. Eigenvectors characterize the transformation (not the data).
Thus, if there is a vector \(e \neq 0\) and a scalar \(\lambda \) such that \(O \cdot e = \lambda \cdot e\), then \(e\) is an eigenvector and \(\lambda \) is its eigenvalue.
If \(e\) is an eigenvector, then so it every multiple of \(e\) (but they’re not independent). Therefore, we are generally interested in the axis of an eigenvector rather than the particular eigenvector on it.
The operator may have up to N independent eigenvectors. Any list of N independent eigenvectors is a basis (eigenbasis).
Repeatedly applying the operator to any vector \(v \neq 0 \) will eventually converge towards an eigenvector with the largest absolute eigenvalue (unless \(v\) is an eigenvector already). This is depicted intuitively in the gradient-images below, and will become more obvious once we discover the diagonal form of an operator (in application #1). Some operators converge slowly, but those with sparse matrices converge quickly.
Examples in 2D
\(O=\begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix}\) has two eigenvector axes \(e_1=\begin{bmatrix} t \\ t \end{bmatrix} \), \(e_2=\begin{bmatrix} t \\ -t \end{bmatrix}, \forall t \neq 0 \) with \(\lambda_1=3\), \(\lambda_2=-1\) respectively.
The images below depict this transformation and the two eigenvector axes shown as red lines in the rightmost image.
\(O=\begin{bmatrix} 1 & 0.5 \\ 0 & 1 \end{bmatrix}\) has single eigenvector axis \(e=\begin{bmatrix} t \\ 0 \end{bmatrix}, \forall t \neq 0 \), \(\lambda=1\).
\(O=\begin{bmatrix} 0.866 & -0.5 \\ 0.5 & 0.866 \end{bmatrix}\) has no eigenvectors.
Note: in 2D rotations have no eigenvectors (in 3D they have one eigenvector axis).
\(O=\begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\) has all non-zero vectors as eigenvectors with \(\lambda=2\).
Note: for identity or uniform scale operators (where \(k_x = k_y\)) all vectors are eigenvectors. Although all axes are eigenvector axes, you can only choose 2 (N in N-dimensions) such that they are independent.
Determining the eigenvalues
Recall that an eigenvalue is the scalar \(\lambda\) in the equation \(O \cdot e = \lambda \cdot e\).
So we want to find \(\lambda\) such that \((O-\lambda \cdot I) \cdot e=0, e \neq 0\).
Thus find \(\lambda\) such that \(det(O-\lambda \cdot I)=0\).
In 2D, if \(O=\begin{bmatrix} a & b \\ c & d \end{bmatrix} \) then \(\lambda_{1,2}=\frac{a+d \pm \sqrt{(a+d)^2 – 4 (a d – b c)} }{2} \):
- if the \(\sqrt{\cdot}\) term is undefined in ℝ, then the operator has no eigenvalues (and no eigenvectors).
Note: it would always be defined if our space was over ℂ (complex numbers), in which case even rotations would have eigenvalues - if \(\lambda_1 \neq \lambda_2\), then the operator has exactly two eigenvector axes
- if \(\lambda_1 = \lambda_2\), then the operator either has a single eigenvector axis or all axes are
Determining the eigenvectors
First determine the eigenvalues. Then for each eigenvalue \(\lambda_k\), solve for \(e_k\) in the system of equations: \((O-\lambda_k \cdot I) \cdot e_k=0, e_k \neq 0\). Recall that \(det(O-\lambda \cdot I)=0\) thus these equations are not independent. So expect to find not unique solutions but classes of solutions where at least one variable remains free.
In 2D, if \(O=\begin{bmatrix} a=1 & b=2 \\ c=2 & d=1 \end{bmatrix} \) then \(\lambda_1=3\) and \(\lambda_2=-1\).
From \((O-\lambda_1 \cdot I) \cdot e_1=0\) then \(\begin{bmatrix} 1-3 & 2 \\ 2 & 1-3 \end{bmatrix} \cdot e_1=0\).
Then \(-2 \cdot e_{1x} + 2 \cdot e_{1y} = 0\) then \(e_{1x}=e_{1y}=t\) so \(e_1=\begin{bmatrix} t \\ t \end{bmatrix}, \forall t \neq 0 \).
Similarly, from \((O-\lambda_2 \cdot I) \cdot e_2=0\) we get \(e_2=\begin{bmatrix} t \\ -t \end{bmatrix}, \forall t \neq 0 \).
A few properties
- A square matrix \(A\) and its transpose \(A^T\) have the same eigenvalues[18].
- A stochastic[4] matrix has only positive values and each row sums up to 1. A stochastic matrix always has \(\lambda=1\) as an eigenvalue, which is also its largest[17].
- All independent eigenvectors of a symmetric matrix are orthogonal to each other[20]. In other words the projection of one onto another is \(0=\sum_{k}{e_{ik} \cdot e_{jk}}\).
Applications
It may seem that the eigenvectors are so narrowly specified that they can’t be very consequential. They are! Let’s look at some exciting applications.
1. Matrix diagonalization and exponentiation
Given matrix \(A\), what is \(A^k (k \in ℕ, k \gg 1)\)?
To solve this, consider \(A\) as the definition of an operator \(O_S\) (relative to the standard basis). Choose an eigenbasis E and redefine \(O_S\) as \(O_E\) (relative to the eigenbasis). Relative to E, \(O_E\) looks like a simple scaling operator. In other words, \(O_E\) is a diagonal matrix, whose diagonal is the eigenvalues.
So \(A=O_S=E \cdot O_E \cdot E^{-1} \) where \(E=\begin{bmatrix} \overrightarrow{e_1} & | & \overrightarrow{e_2} & | & \dots & | & \overrightarrow{e_N} \end{bmatrix}\) (eigenvectors as columns) and \(O_E=\begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_N \end{bmatrix} \) (eigenvalues as diagonal).
Because \(A^k\) means applying the transformation k times, then \(A^k=E \cdot O_E^k \cdot E^{-1} \). Finally, because \(O_E\) is a diagonal matrix, its kth power is trivial: \(O_E^k=\begin{bmatrix} \lambda_1^k & 0 & \dots & 0 \\ 0 & \lambda_2^k & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_N^k \end{bmatrix} \).
Once we determine matrices \(O_E\) and \(E\) and \(E^{-1}\), computing \(A^k\) is \(O(N^3)\) operations (down from \(O(k \cdot N^3) \) of the naive approach). This makes it possible to compute large (sometimes infinite) powers of a matrix.
Problem: let \(A=\begin{bmatrix} -2 & 1 \\ -4 & 3 \end{bmatrix} \), what is \(A^{1000}\)?
First, determine the eigenvalues as \(\lambda_1=-1\) and \(\lambda_2=2\).
Next, find the eigenbasis as \(e_1=\begin{bmatrix} 1 \\ 1 \end{bmatrix} \) and \(e_2=\begin{bmatrix} 1 \\ 4 \end{bmatrix} \).
Thus \(E=\begin{bmatrix} 1 & 1 \\ 1 & 4 \end{bmatrix} \) and \(E^{-1}=\begin{bmatrix} 4 & -1 \\ -1 & 1 \end{bmatrix} \cdot \frac{1}{3} \) and \(O_E=\begin{bmatrix} -1 & 0 \\ 0 & 2 \end{bmatrix} \).
Then \(A^n=E \cdot O_E^n \cdot E^{-1}=\begin{bmatrix} 1 & 1 \\ 1 & 4 \end{bmatrix} \cdot \begin{bmatrix} (-1)^n & 0 \\ 0 & 2^n \end{bmatrix} \cdot \begin{bmatrix} 4 & -1 \\ -1 & 1 \end{bmatrix} \cdot \frac{1}{3} \).
Then \(A^n=\begin{bmatrix} 4 \cdot (-1)^n-2^n & (-1)^{n+1}+2^{1000} \\ 4 \cdot (-1)^n-2^{n+2} & (-1)^{n+1}+2^{1002} \end{bmatrix} \cdot \frac{1}{3} \).
Finally \(A^{1000}=\begin{bmatrix} 4-2^{1000} & 2^{1000}-1 \\ 4-2^{1002} & 2^{1002}-1 \end{bmatrix} \cdot \frac{1}{3} \).
2. Recursive series
Problem: what is the direct formula for the nth Fibonacci term?
Because each \(f_k\) is the sum of the previous two, we need a system with two units of memory – a 2D space.
Let \(v_{kS}=\begin{bmatrix} f_{k-1} \\ f_k \end{bmatrix} \) and \(v_{1S}=\begin{bmatrix} f_0 \\ f_1 \end{bmatrix}=\begin{bmatrix} 0 \\ 1 \end{bmatrix} \). See the first few vectors:
Let operator \(O_S=\begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}\) so that \(v_{k+1} = O_S \cdot v_k = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}\ \cdot \begin{bmatrix} f_{k-1} \\ f_k \end{bmatrix} = \begin{bmatrix} f_k \\ f_{k+1} \end{bmatrix}\).
Therefore \(v_{nS}=O_S^{n-1} \cdot v_{1S}\).
Next, find that \(\lambda_1=\frac{1+\sqrt{5}}{2}\), \(\lambda_2=\frac{1-\sqrt{5}}{2}\), and \(e_1=\begin{bmatrix} 1 \\ \lambda_1 \end{bmatrix} \), \(e_2=\begin{bmatrix} 1 \\ \lambda_2 \end{bmatrix} \).
Hence \(O_E=\begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix}\), \(E=\begin{bmatrix} 1 & 1 \\ \lambda_1 & \lambda_2 \end{bmatrix}\), \(E^{-1}=\begin{bmatrix} -\lambda_2 & 1 \\ \lambda_1 & -1 \end{bmatrix} \cdot \frac{1}{\sqrt{5}} \).
So \(v_{nS}=O_S^{n-1} \cdot v_{1S} = E \cdot O_E^{n-1} \cdot E^{-1} \cdot v_{1S}\).
\(v_{nS}=\begin{bmatrix} \lambda_1^{n-1}-\lambda_2^{n-1} \\ \lambda_1^n – \lambda_2^n \end{bmatrix} \cdot \frac{1}{\sqrt{5}} = \begin{bmatrix} f_{n-1} \\ f_n \end{bmatrix}\).
Finally, \(f_n=\frac{1}{\sqrt{5}}\cdot(\frac{1+\sqrt{5}}{2})^n – \frac{1}{\sqrt{5}}\cdot(\frac{1-\sqrt{5}}{2})^n\).
Problem: what is the formula for geometric series?
Let the geometric series be \(g_n=c + c \cdot t^1 + c \cdot t^2 + \dots + c \cdot t^n \).
Rewrite it in a recursive fashion: \(g_{n+1}=g_n + t \cdot a^n \), with \(a_n=c \cdot t^n\). Once again we need a system with two memory units.
Let \(v_{kS}=\begin{bmatrix} a_k \\ g_k \end{bmatrix} \) and \(v_{0S}=\begin{bmatrix} c \\ c \end{bmatrix} \).
Let operator \(O_S=\begin{bmatrix} t & 0 \\ t & 1 \end{bmatrix}\) so that \(v_{k+1} = O_S \cdot v_k = \begin{bmatrix} t & 0 \\ t & 1 \end{bmatrix}\ \cdot \begin{bmatrix} a_k \\ g_k \end{bmatrix} = \begin{bmatrix} t \cdot a_k \\ t \cdot a_k + g_k \end{bmatrix} = \begin{bmatrix} a_{k+1} \\ g_{k+1} \end{bmatrix}\).
Next, find that \(\lambda_1=1\), \(\lambda_2=t\), and \(e_1=\begin{bmatrix} 0 \\ 1 \end{bmatrix} \), \(e_2=\begin{bmatrix} \frac{t-1}{t} \\ 1 \end{bmatrix} \).
Hence \(O_E=\begin{bmatrix} 1 & 0 \\ 0 & t \end{bmatrix}\), \(E=\begin{bmatrix} 0 & \frac{t-1}{t} \\ 1 & 1 \end{bmatrix}\), \(E^{-1}=\begin{bmatrix} \frac{t}{1-t} & 1 \\ -\frac{t}{1-t} & 0 \end{bmatrix} \).
So \(v_{nS}=O_S^n \cdot v_{0S} = E \cdot O_E^n \cdot E^{-1} \cdot v_{0S}\).
\(v_{nS}= c \cdot \begin{bmatrix} t^n \\ \frac{1-t^{n+1}}{1-t} \end{bmatrix} = \begin{bmatrix} a_n \\ g_n \end{bmatrix}\).
Finally, \(g_n=c \cdot \frac{1-t^{n+1}}{1-t}, \forall t > 1\).
3. Markov chains
A Markov Chain is a weighted and directed graph such that for each node the sum of all outgoing edges is 1. Self-edges are allowed, and each node may hold a value.
One interpretation is that each node represents a certain state (with a certain initial probability), and at each iteration the next state is one of the adjacent nodes with a probability equal to the weight of the edge.
Another interpretation is that each node begins with a certain amount of energy, and in each iteration it passes it to each adjacent node proportionally to the edge weights.
Either way, the info on the nodes make up a piece of data (a vector) and the edges make up a transformation (an operator). N nodes means an N dimensional space.
The operator \(O_S\) defining the transition with each iteration is a column-stochastic matrix – so it has values between 0 and 1, and each column sums up to 1. Specifically, its kth column is the probability vector associated with the outgoing edges of node k.
Stochastic matrices always have \(\lambda_1=1\) as their largest eigenvalue. The corresponding eigenvector \(e_1\) (such that \(A \cdot e_1 = e_1 \) and \(sum(e_1)=1\)) represents the steady-state of the system: the probability that a random walker ends in each of the nodes after \(\infty\) steps (or the energy each node has after \(\infty\) iterations). The only exception is when the initial system state is already an eigenvector, in which case the system remains locked in that state.
Problem: find the steady-state of a simple Markov chain
For this Markov chain, the adjacency matrix is \(A=\begin{bmatrix} 0.4 & 0.3 \\ 0.6 & 0.7 \end{bmatrix} \).
\(\lambda_1=1\) (stochastic matrix) and \(e_1=\begin{bmatrix} t \\ 2t \end{bmatrix}\). But enforcing \(sum(e_1)=1\) means \(e_1=\begin{bmatrix} \frac{1}{3} \\ \frac{2}{3} \end{bmatrix}\). Validate that \(A \cdot e_1 = e_1 \) as expected.
Finally, after \(\infty\) iterations, a random walker is in n1 with probability ⅓ and in n2 with ⅔. Or, the energy in n1 is ⅓ and in n2 is ⅔ of the global energy.
Problem: compute Google Page-Rank for all web pages
Part of the calculation of the rank (importance) of each web page, is determining how other pages link to it and their own importance.
Therefore, a giant Markov Chain is created such that each node is a web page, and each edge represents that the source node links to the target node. For each node, the weights on the outgoing edges are all equal: \(weight=\frac{1}{degree(source node)}\).
Note: additional tricks are employed to ensure no node becomes a deadend[5] (no energy sinks).
Then the eigenvector \(e_1\) (the steady-state of the graph) is the page-rank of each node.
Note: for practical reasons, considering the enormous size of this system, \(e_1\) is not calculated directly. Instead, it is approximated by applying the transform operator on an initial vector a number of times. Given that the operator matrix is sparse, it will quickly converge to \(e_1\).
4. Spectral clustering of graphs
Given a connected undirected graph (or network), find K clusters (or communities) such that nodes in each cluster are more connected to each other than to nodes outside the cluster.
Note: the quality of the clusters is hard to measure, and the problem statement is more intuitive than rigorous. For this many measures have been proposed, with modularity[7] (by Newman) defined as “the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random” being the most popular.
First, define the following matrices:
- \(A\) is the adjacency matrix
- \(D\) s a diagonal matrix, such that \(D_{ii}=degree(node_i)=\sum_j A_{ij}\)
- \(L=D-A\) called the Laplacian matrix[14].
Intuition: L is akin to a differential operator, therefore eigenvectors with large eigenvalues correspond to cuts with “high vibrations” (maximal cuts). But a good clustering is similar to a minimal cut, so in this case we’re interested in eigenvectors with smallest eigenvalues[22].
Let’s convene that that \(\lambda_1\) through \(\lambda_N\) are in ascending order – in fact \(\lambda_1=0\) is the smallest[19].
Note: if the graph is unconnected, 0 will appear multiple times as an eigenvalue. In fact, if there are C connected components in the graph, 0 will appear as an eigenvalue with multiplicity C. However, in this section we’re assuming the graph is connected (since that’s the interesting part), so 0 has multiplicity 1.
Note that L is symmetric and each row (and column) sums up to 0: \(\sum_j L_{ij}=0, \forall i\). So L has \(\lambda_1=0\) and \(e_1=\begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \).
Proof: \((L-0 \cdot I) \cdot e_1 = L \cdot e_1 = 0 = 0 \cdot e_1\).
Also, because L is symmetric, all eigenvectors of its eigenbasis are orthogonal to each other: \(\sum_k {e_{ik} \cdot e_{jk}} = 0\). Thus, if we choose \(j=1\) in the previous equation we find that each eigenvector (other than \(e_1\)) sums up to 0: \(\sum_k{e_{ik}}=0, \forall i \ge 2\).
Finally, if we want to find:
- \(K=2\) clusters, simply look at \(e_2\) to group the nodes into:
- those corresponding to a positive component of \(e_2\), and
- those corresponding to a negative component of \(e_2\)
- \(K \ge 3\) clusters, then for each node define \(v_i=\begin{bmatrix} e_{2i} \\ e_{2i} \\ \vdots \\ e_{Ki} \end{bmatrix} \) as the projection of \(s_i\) onto each of the eigenvectors \(e_2\) through \(e_K\). Finally, cluster the nodes using the K-means algorithm on the newly defined \(v_i\) vectors.
Problem: find the 2 communities in Zachary’s Karate Club network
Zachary’s Karate Club[8] network is a famous network that represents the 78 social connections (members interacting outside the club) between the 34 members of a karate club he studied from 1970 to 1972.
Later, due to a leadership disagreement, the 34 members split: half of the members formed a new club around the old instructor, and the others found a new instructor or gave up karate.
Based on collected data Zachary correctly assigned all but one member (#9) of the club to the groups they actually joined after the split.
In the Python code below, variable “a” is the 34-by-34 adjacency matrix. It runs an algorithm similar to the one above as a one-liner:
labels = sklearn.cluster.SpectralClustering(n_clusters=2).fit(a).labels_
The resulting partition correctly matches the real world split described in Zachary’s original paper, except for a single node (the same #9) which is described in the original paper[9] as having low-affinity to either group.
Note: the network clustering problem is NP-complete, and while “spectral clustering” is known to perform very well it doesn’t guarantee absolute optimal results.
5. Dimensionality reduction with PCA
If the samples in a dataset are N-dimensional vectors, we want to reduce them to K-dimensional vectors while retaining most of the information. This is valuable for several reasons:
- compress the data
- visualize (in 2D or 3D) data that cannot be visualized otherwise
- improve model efficiency – train faster and with less resources
- mitigate overfitting – remove redundancy to reduce learning “the noise” in the data
There are many algorithms for Dimensionality Reduction[13], including PCA, LDA, T-SNE, UMAP and Autoencoders. However, in this section we’ll focus on PCA only.
PCA[12] stands for Principal Component Analysis, and we’ll soon see that eigenvectors are the principal components[15].
First, normalize the dataset such that it is centered in 0 with a standard deviation of 1 by:
- subtracting from each vector the average vector across the dataset
- then for each dimension divide by its standard-deviation across the dataset
Next, compute the N-by-N covariance matrix[11] \(V\) for the N dimensions of the dataset and find its Eigenvectors. If the eigenvalues are sorted in descending order, then we choose \(\lambda_1\) through \(\lambda_K\) and \(e_1\) through \(e_K\) (such that they’re unit-length).
Interpretation: \(e_1\) through \(e_K\) are the K principal components, or the axes along which the dataset has the highest variance. The variance along the axis of \(e_i\) is \(\lambda_i\). Intuitively, matrix \(V\) defines the operator that is the inverse of a whitening operator, such that \(V^{-1}\) would transform our dataset into white noise[10] (uncorrelated data with variance 1 whose covariance is the identity matrix).
Now define matrix K-by-N matrix \(E\) such that its ith row is \(e_i\). Hence, it’s transpose is \(E^T=\begin{bmatrix} \overrightarrow{e_1} & | & \overrightarrow{e_2} & | & \dots & | & \overrightarrow{e_N} \end{bmatrix}\).
Finally, to reduce any sample \(d_N\) from N to K dimensions, simply compute \(d_K=E \cdot d_N\).
Problem: compress a dataset of photos of human faces
Eigenfaces is a way to convert an image of N pixels of a human face to K numbers such that \(K \ll N\). It computes the first K principal components of the dataset (the K eigenfaces), then it expresses the original image in terms of these K eigenvectors. This method is generally used in face recognition problems. In this example, we use it to compress a dataset of human faces.
For this experiment I used Python (code here) and the “Biggest Gender/Face Recognition” dataset[21] (licensed CC0) which includes over 10,000 images of 250 x 250 = 62500 color pixels. I transform them into grayscale (so each pixel is one number, not three), because computing eigenvectors for such a large matrix can be slow.
Each picture becomes a PyTorch tensor of N=62500 dimensions. Normalize (subtract the average and divide by standard deviation) and find the largest K=1500 eigenvectors using SciPy eigh function (use eigh not eig since a covariance matrix is symmetric). Now we have matrix \(E\).
To demo it, I take three images and convert each to N-dimensional \(v_N\) same as above. Then the compressed face is \(v_K=E \cdot v_N\), and now \(v_N \approx v_K \cdot E\). Finally, un-normalize it to visualize. The next collage shows 3 original photos (top) and their reconstruction (bottom):
The next image shows the top 25 eigenfaces:
Finally, the next image shows the average face (left), standard deviation face (middle), and their ratio (ration):
Problem: visualize high-dimensional data in 2D
We can’t visualize 4+dimensional space, but we can visualize 2D and 3D spaces. This is useful when we want to understand the data better, to debug a classifier that doesn’t quite work, or to understand if the data naturally forms clear groups.
To demo this, I use the IRIS dataset[23] (license CC BY 4.0) and project each sample (originally 4D) against the top two eigenvectors.
Then plot the results in a 2D plane colored by iris species. The Python code is here.
The results are promising, as the three iris species segregate quite well along the dominant two eigenvectors. These two eigenvectors would be effective in an iris species classifier.
Thank you
Thank you for reading this far!
I hope this article made eigenvectors intuitive and offered an exciting overview of their wide applicability.
References
The main inspiration for this article is Sheldon Axler’s book Linear Algebra Done Right[1].
- Sheldon Axler, Linear Algebra Done Right (2024), Springer
- Eigenvalues and eigenvectors, Wikipedia
- Transformation matrix, Wikipedia
- Stochastic matrix, Wikipedia
- PageRank Algorithm – The Mathematics of Google Search (2009), Cornell
- Perron–Frobenius theorem, Wikipedia
- Modularity (networks), Wikipedia
- Zachary’s karate club, Wikipedia
- Wayne W. Zachary, An Information Flow Model for Conflict and Fission in Small Groups (1977), Journal of Anthropological Research: Vol 33, No 4
- Whitening transformation, Wikipedia
- Covariance matrix, Wikipedia
- Principal component analysis, Wikipedia
- Stephen Oladele, Top 12 Dimensionality Reduction Techniques for Machine Learning (2024)
- MIT OpenCourseWare, Finding Clusters in Graphs (2019), YouTube
- Victor Lavrenko, PCA 4: principal components = eigenvectors (2014), YouTube
- 3Blue1Brown, Eigenvectors and eigenvalues | Chapter 14, Essence of linear algebra (2016), YouTube
- Proof that the largest eigenvalue of a stochastic matrix is 1 (2011), Mathematics Stack Exchange
- A matrix and its transpose have the same set of eigenvalues/other version (2012), Mathematics Stack Exchange
- Laplacian matrix, Wikipedia
- Orthogonality of eigenvectors of laplacian (2013), Mathematics Stack Exchange
- Biggest gender/face recognition dataset (2021), Kaggle
- Shriphani Palakodety, The Smallest Eigenvalues of a Graph Laplacian (2015)
- R. A. Fisher, Iris dataset (2021), UCI Machine Learning Repository
The post Practical Eigenvectors appeared first on Towards Data Science.