# A Beginner’s Guide to Singular Value Decomposition (SVD)

## No prior knowledge required — just common sense and a brain

--

“Any fool can make something complicated.”

As a 16 year old new to machine learning, I came across a ton of acronyms which had absolutely no meaning: PCA, SVD, MLP, etc. It’s almost like FDR’s alphabet soup from WWII (CCC, WPA, PWA (for all you history buffs ;)). It just blew my mind how tons of sources and in professional academia papers outlined the same exact definition but didn’t explain what this core machine learning concept ultimately meant:

The SVD of an mxn matrix A with real values is a factorization of A as U∑V^T, where U is an mxm orthogonal matrix, V is an nxn orthogonal matrix, and ∑ is a diagonal matrix with nonnegative real entries on the diagonal.

SVD is a core concept in machine learning and it’s used to make sense of data with tons of noise. However, it’s normally explained with math — but that doesn’t help students understand the topic conceptually.

Here’s my attempt at explaining this seemingly unintuitive and complex topic without any of the random linear algebra jargon thrown at you from nowhere after a straight 8 hour sprint :) Let’s get right to it — you’ll see how intuitive this concept actually is.

But…before we jump in, we do have to understand a few terms, namely:

**span**— basically gives us the entire range of possible values that you can get by multiplying each vector in a set of vectors by any real coefficient and then adding these values up (each of these combinations is called a**linear combination**)- orthogonal — essentially the same thing as being perpendicular but in multi-dimensional space

Imagine that you are a four year old and you have a set of m&ms and you have to sort them. What’s the first characteristic, or “feature” of the m&m dataset that you would look at? Here’s our little dataset of m&m’s below to help you visualize.

You’d go by color right? It’s obvious to a 4 year old but apparently not to a computer — and here’s why.

Color isn’t the only feature of m&ms right? There are tons of other features that I could list below, including…

- The amount of chocolate on the inside
- The curvature of the m&ms
- The size in mms of the m inscribed onto the m&m
- microscopic dents and their position on the m&m surface
- The shade of the particular color of the m&m

Now imagine if you went through the m&ms trying to sort them based on all of these minute characteristics + 1000s more. You’d probably take years — all while the color is starting **right at you**.

All the features except for color create **noise** in the data. In fact, an infinitesimal shift in the the position of the m on one m&m would cost you so much if you were to take all factors into consideration.

So how does a computer sort all of the m&ms without spending millions of years? And how does it know what class a new m&m falls? How does it define classes based on color rather than based off of irrelevant things like dents in the m&ms? It does all of this through a process called Singular Value Decomposition or SVD.

What SVD is doing is trying to identify a relevant sub-space of all of the dimensions created by the noise in the data. In other words — it’s trying to simply down the data so that it knows that color… and maybe one or two other factors are important in determining where them m&m falls on a scale of tone and shade.

You can almost think about this of a line fitting problem. The SVD is trying to find a line in multi-dimensional space that allows it to classify the data in the most simple way.

Let n = the number of dimensions present in the data. At first, the SVD algorithm will to try to find the best linear approximations. Then it will try to solve for the best fitting line in n-1 dimensions. Then n-2, n-3 etc all the way until the last dimension: a single line. As the value of n goes down in dimension, SVD goes from weaker approximations to stronger approximations of the data. But how “strong” an approximation is needs to be taken in perspective along with how important the features left in a particular dimension are.

These “approximations” are calculated by the SVD algorithm to form what are known as “singular vectors” and “singular values.”

Okay, let’s go back to some high school math. Remember the pythagorean theorem

Given a one dimensional subspace, the goal is to find the vector of all the other features which best approximates the data.

Now imagine that we’re in one dimension. If the line growing through b is the span of the one dimensional vector that we’re given, then our goal is to minimize the distance between c and b by either elongating or shortening the vector v so that it is closest to the actual data vector x. In other words, we want to optimize *x(x*v)* (the length of *b*) to be the largest number possible in order to minimize the distance (*a*) between the two vectors.

Essentially, by maximizing B in A² + B² = C², we can minimize the value of A (the distance between the two vectors)

The vector *v* in *x(x*v)* that optimizes to minimize the distance between the data and the projection is known as the **singular vector**. The value of the data matrix multiplied by the vector is known as the singular value. It tells us how much of the data is approximated by the vector. The larger the singular value, the greater the data is approximated by the line.

This is a strong approximation, but it likely doesn’t encapsulate all of the variance in the data. It’s kinda like dividing our set of m&m’s into the three primary colors (red, yellow, and blue). Those are still valid categories, but it doesn’t capture all of the variation in color.

But this is only one dimension. Now we must look to approximating additional dimensions and looking to add additional dimensions to capture more variation. Therefore, we do the same exact process — except for vectors orthogonal to the first vector that we looked at (in essence, we’re repeating this one dimensional process except in a new dimension). We do this process again, optimizing to find the vector in the second dimension.

Then, we take the span of *{v1, v2}* to create the 2 dimension approximation.

We can keep continuing this sequential process until *Av = 0. *(A is the entire data matrix) This could happen when k=n or when k<n (in which case not all of the rows of the data are independent of each other or add a new dimension).

That’s pretty much it! Let’s look back at the definition from the beginning.

The SVD of an mxn matrix A with real values is a factorization of A as U∑V^T, where U is an mxm orthogonal matrix, V is an nxn orthogonal matrix, and ∑ is a diagonal matrix with nonnegative real entries on the diagonal.

V^T is the orthonormal matrix containing all of the vectors v all orthogonal to each and which best approximate the data in the given dimension.

∑ is the diagonal matrix that contains all of the singular values. Finally, U is the matrix containing all of the standardized values for Av divided by their singular values.

The singular values go down with each dimension, which tells us that each dimension is adding less and less value. Our goal is to stop adding dimensions to our approximation when the singular values become relatively negligible.

There we have it! Singular value decomposition is nothing but decomposing a given matrix of data to understand how to best approximate the data to eliminate noise!