A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS 1 2

This is the html version of the file http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf.

**Google**automatically generates html versions of documents as we crawl the web.

Page 1 |

A

**TUTORIAL**ON PRINCIPAL COMPONENT ANALYSISDerivation, Discussion and Singular Value Decomposition

Jon Shlens | jonshlens@ucsd.edu

25 March 2003 | Version 1

Principal component analysis (

**PCA**) is a mainstayof modern data analysis - a black box that is widely

used but poorly understood. The goal of this paper is

to dispel the magic behind this black box. This

**tutorial**focuses on building a solid intuition for how and why

principal component analysis works; furthermore, it

crystallizes this knowledge by deriving from first prin-

cipals, the mathematics behind

**PCA**. This**tutorial**does not shy away from explaining the ideas infor-

mally, nor does it shy away from the mathematics.

The hope is that by addressing both aspects, readers

of all levels will be able to gain a better understand-

ing of the power of

**PCA**as well as the when, the howand the why of applying this technique.

1

Overview

Principal component analysis (

**PCA**) has been calledone of the most valuable results from applied lin-

ear algebra.

**PCA**is used abundantly in all formsof analysis - from neuroscience to computer graphics

- because it is a simple, non-parametric method of

extracting relevant information from confusing data

sets. With minimal additional effort

**PCA**providesa roadmap for how to reduce a complex data set to

a lower dimension to reveal the sometimes hidden,

simplified dynamics that often underlie it.

The goal of this

**tutorial**is to provide both an intu-itive feel for

**PCA**, and a thorough discussion of thistopic. We will begin with a simple example and pro-

vide an intuitive explanation of the goal of

**PCA**. Wewill continue by adding mathematical rigor to place

it within the framework of linear algebra and explic-

itly solve this problem. We will see how and why

**PCA**is intimately related to the mathematical tech-

nique of singular value decomposition (SVD). This

understanding will lead us to a prescription for how

to apply

**PCA**in the real world. We will discuss boththe assumptions behind this technique as well as pos-

sible extensions to overcome these limitations.

The discussion and explanations in this paper are

informal in the spirit of a

**tutorial**. The goal of thispaper is to educate. Occasionally, rigorous mathe-

matical proofs are necessary although relegated to

the Appendix. Although not as vital to the

**tutorial**,the proofs are presented for the adventurous reader

who desires a more complete understanding of the

math. The only assumption is that the reader has a

working knowledge of linear algebra. Nothing more.

Please feel free to contact me with any suggestions,

corrections or comments.

2

Motivation: A Toy Example

Here is the perspective: we are an experimenter. We

are trying to understand some phenomenon by mea-

suring various quantities (e.g. spectra, voltages, ve-

locities, etc.) in our system. Unfortunately, we can

not figure out what is happening because the data

appears clouded, unclear and even redundant. This

is not a trivial problem, but rather a fundamental

obstacle to experimental science. Examples abound

from complex systems such as neuroscience, photo-

science, meteorology and oceanography - the number

of variables to measure can be unwieldy and at times

even deceptive, because the underlying dynamics can

often be quite simple.

Take for example a simple toy problem from

physics diagrammed in Figure 1. Pretend we are

studying the motion of the physicist's ideal spring.

This system consists of a ball of mass m attached to

a massless, frictionless spring. The ball is released a

small distance away from equilibrium (i.e. the spring

is stretched). Because the spring is "ideal," it oscil-

lates indefinitely along the x-axis about its equilib-

rium at a set frequency.

This is a standard problem in physics in which the

1

Figure 1: A diagram of the toy example.

motion along the x direction is solved by an explicit

function of time. In other words, the underlying dy-

namics can be expressed as a function of a single vari-

able x.

However, being ignorant experimenters we do not

know any of this. We do not know which, let alone

how many, axes and dimensions are important to

measure. Thus, we decide to measure the ball's posi-

tion in a three-dimensional space (since we live in a

three dimensional world). Specifically, we place three

movie cameras around our system of interest. At

200 Hz each movie camera records an image indicat-

ing a two dimensional position of the ball (a projec-

tion). Unfortunately, because of our ignorance, we

do not even know what are the real "x", "y" and "z"

axes, so we choose three camera axes {a, b, c} at some

arbitrary angles with respect to the system. The an-

gles between our measurements might not even be

90

o

! Now, we record with the cameras for 2 minutes.

The big question remains: how do we get from this

data set to a simple equation of x?

We know a-priori that if we were smart experi-

menters, we would have just measured the position

along the x-axis with one camera. But this is not

what happens in the real world. We often do not

know what measurements best reflect the dynamics

of our system in question. Furthermore, we some-

times record more dimensions than we actually need!

Also, we have to deal with that pesky, real-world

problem of noise. In the toy example this means

that we need to deal with air, imperfect cameras or

even friction in a less-than-ideal spring. Noise con-

taminates our data set only serving to obfuscate the

dynamics further. This toy example is the challenge

experimenters face everyday. We will refer to this

example as we delve further into abstract concepts.

Hopefully, by the end of this paper we will have a

good understanding of how to systematically extract

x using principal component analysis.

3

Framework: Change of Basis

The Goal: Principal component analysis computes

the most meaningful basis to re-express a noisy, gar-

bled data set. The hope is that this new basis will

filter out the noise and reveal hidden dynamics. In

the example of the spring, the explicit goal of

**PCA**isto determine: "the dynamics are along the x-axis."

In other words, the goal of

**PCA**is to determine thatx - the unit basis vector along the x-axis - is the im-

portant dimension. Determining this fact allows an

experimenter to discern which dynamics are impor-

tant and which are just redundant.

3.1

A Naive Basis

With a more precise definition of our goal, we need

a more precise definition of our data as well. For

each time sample (or experimental trial), an exper-

imenter records a set of data consisting of multiple

measurements (e.g. voltage, position, etc.). The

number of measurement types is the dimension of

the data set. In the case of the spring, this data set

has 12,000 6-dimensional vectors, where each camera

contributes a 2-dimensional projection of the ball's

position.

In general, each data sample is a vector in m-

dimensional space, where m is the number of mea-

surement types. Equivalently, every time sample is

a vector that lies in an m-dimensional vector space

spanned by an orthonormal basis. All measurement

vectors in this space are a linear combination of this

set of unit length basis vectors. A naive and simple

2

choice of a basis B is the identity matrix I.

B =

b

1

b

2

.

.

.

b

m

=

1 0 · · · 0

0 1 · · · 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 · · · 1

= I

where each row is a basis vector b

i

with m compo-

nents.

To summarize, at one point in time, camera A

records a corresponding position (x

A

(t), y

A

(t)). Each

trial can be expressed as a six dimesional column vec-

tor X.

X =

x

A

y

A

x

B

y

B

x

C

y

C

where each camera contributes two points and the

entire vector X is the set of coefficients in the naive

basis B.

3.2

Change of Basis

With this rigor we may now state more precisely

what

**PCA**asks: Is there another basis, which is alinear combination of the original basis, that best re-

expresses our data set?

A close reader might have noticed the conspicuous

addition of the word linear. Indeed,

**PCA**makes onestringent but powerful assumption: linearity. Lin-

earity vastly simplifies the problem by (1) restricting

the set of potential bases, and (2) formalizing the im-

plicit assumption of continuity in a data set. A subtle

point it is, but we have already assumed linearity by

implicitly stating that the data set even characterizes

the dynamics of the system! In other words, we are

already relying on the superposition principal of lin-

earity to believe that the data characterizes or pro-

vides an ability to interpolate between the individual

data points

1

.

1

To be sure, complex systems are almost always nonlinear

and often their main qualitative features are a direct result of

their nonlinearity. However, locally linear approximations usu-

ally provide a good approximation because non-linear, higher

order terms vanish at the limit of small perturbations.

With this assumption

**PCA**is now limited to re-expressing the data as a linear combination of its ba-

sis vectors. Let X and Y be m×n matrices related by

a linear transformation P. X is the original recorded

data set and Y is a re-representation of that data set.

PX = Y

(1)

Also let us define the following quantities

2

.

• p

i

are the rows of P

• x

i

are the columns of X

• y

i

are the columns of Y.

Equation 1 represents a change of basis and thus can

have many interpretations.

1. P is a matrix that transforms X into Y.

2. Geometrically, P is a rotation and a stretch

which again transforms X into Y.

3. The rows of P, {p

1

, . . . , p

m

}, are a set of new

basis vectors for expressing the columns of X.

The latter interpretation is not obvious but can be

seen by writing out the explicit dot products of PX.

PX =

p

1

.

.

.

p

m

[ x

1

· · · x

n

]

Y =

p

1

· x

1

· · · p

1

· x

n

.

.

.

.

.

.

.

.

.

p

m

· x

1

· · · p

m

· x

n

We can note the form of each column of Y.

y

i

=

p

1

· x

i

.

.

.

p

m

· x

i

We recognize that each coefficient of y

i

is a dot-

product of x

i

with the corresponding row in P. In

2

In this section x

i

and y

i

are column vectors, but be fore-

warned. In all other sections x

i

and y

i

are row vectors.

3

other words, the j

th

coefficient of y

i

is a projection

on to the j

th

row of P. This is in fact the very form

of an equation where y

i

is a projection on to the basis

of {p

1

, . . . , p

m

}. Therefore, the rows of P are indeed

a new set of basis vectors for representing of columns

of X.

3.3

Questions Remaining

By assuming linearity the problem reduces to find-

ing the appropriate change of basis. The row vectors

{p

1

, . . . , p

m

} in this transformation will become the

principal components of X. Several questions now

arise.

• What is the best way to "re-express" X?

• What is a good choice of basis P?

These questions must be answered by next asking our-

selves what features we would like Y to exhibit. Ev-

idently, additional assumptions beyond linearity are

required to arrive at a reasonable result. The selec-

tion of these assumptions is the subject of the next

section.

4

Variance and the Goal

Now comes the most important question: what does

"best express" the data mean? This section will build

up an intuitive answer to this question and along the

way tack on additional assumptions. The end of this

section will conclude with a mathematical goal for

deciphering "garbled" data.

In a linear system "garbled" can refer to only two

potential confounds: noise and redundancy. Let us

deal with each situation individually.

4.1

Noise

Noise in any data set must be low or - no matter the

analysis technique - no information about a system

can be extracted. There exists no absolute scale for

noise but rather all noise is measured relative to the

Figure 2: A simulated plot of (x

A

, y

A

) for camera A.

The signal and noise variances σ

2

signal

and σ

2

noise

are

graphically represented.

measurement. A common measure is the the signal-

to-noise ratio (SNR), or a ratio of variances σ

2

.

SNR =

σ

2

signal

σ

2

noise

(2)

A high SNR (≫ 1) indicates high precision data,

while a low SNR indicates noise contaminated data.

Pretend we plotted all data from camera A from

the spring in Figure 2. Any individual camera should

record motion in a straight line. Therefore, any

spread deviating from straight-line motion must be

noise. The variance due to the signal and noise are

indicated in the diagram graphically. The ratio of the

two, the SNR, thus measures how "fat" the oval is - a

range of possiblities include a thin line (SNR ≫ 1), a

perfect circle (SNR = 1) or even worse. In summary,

we must assume that our measurement devices are

reasonably good. Quantitatively, this corresponds to

a high SNR.

4.2

Redundancy

Redundancy is a more tricky issue. This issue is par-

ticularly evident in the example of the spring. In

this case multiple sensors record the same dynamic

4

Figure 3: A spectrum of possible redundancies in

data from the two separate recordings r

1

and r

2

(e.g.

x

A

, y

B

). The best-fit line r

2

= kr

1

is indicated by

the dashed line.

information. Consider Figure 3 as a range of possi-

bile plots between two arbitrary measurement types

r

1

and r

2

. Panel (a) depicts two recordings with no

redundancy. In other words, r

1

is entirely uncorre-

lated with r

2

. This situation could occur by plotting

two variables such as (x

A

, humidity).

However, in panel (c) both recordings appear to be

strongly related. This extremity might be achieved

by several means.

• A plot of (x

A

, ˜x

A

) where x

A

is in meters and ˜x

A

is in inches.

• A plot of (x

A

, x

B

) if cameras A and B are very

nearby.

Clearly in panel (c) it would be more meaningful to

just have recorded a single variable, the linear com-

bination r

2

− kr

1

, instead of two variables r

1

and r

2

separately. Recording solely the linear combination

r

2

− kr

1

would both express the data more concisely

and reduce the number of sensor recordings (2 → 1

variables). Indeed, this is the very idea behind di-

mensional reduction.

4.3

Covariance Matrix

The SNR is solely determined by calculating vari-

ances. A corresponding but simple way to quantify

the redundancy between individual recordings is to

calculate something like the variance. I say "some-

thing like" because the variance is the spread due to

one variable but we are concerned with the spread

between variables.

Consider two sets of simultaneous measurements

with zero mean

3

.

A = {a

1

, a

2

, . . . , a

n

} , B = {b

1

, b

2

, . . . , b

n

}

The variance of A and B are individually defined as

follows.

σ

2

A

= 〈a

i

a

i

〉

i

, σ

2

B

= 〈b

i

b

i

〉

i

where the expectation

4

is the average over n vari-

ables. The covariance between A and B is a straigh-

forward generalization.

covariance of A and B ≡ σ

2

AB

= 〈a

i

b

i

〉

i

Two important facts about the covariance.

• σ

2

AB

= 0 if and only if A and B are entirely

uncorrelated.

• σ

2

AB

= σ

2

A

if A = B.

We can equivalently convert the sets of A and B into

corresponding row vectors.

a = [a

1

a

2

. . . a

n

]

b = [b

1

b

2

. . . b

n

]

so that we may now express the covariance as a dot

product matrix computation.

σ

2

ab

≡

1

n − 1

ab

T

(3)

where the beginning term is a constant for normal-

ization

5

.

Finally, we can generalize from two vectors to an

arbitrary number. We can rename the row vectors

x

1

≡ a, x

2

≡ b and consider additional indexed row

3

These data sets are in mean deviation form because the

means have been subtracted off or are zero.

4

〈·〉

i

denotes the average over values indexed by i.

5

The simplest possible normalization is

1

n

. However, this

provides a biased estimation of variance particularly for small

n. It is beyond the scope of this paper to show that the proper

normalization for an unbiased estimator is

1

n

−

1

.

5

vectors x

3

, . . . , x

m

. Now we can define a new m × n

matrix X.

X =

x

1

.

.

.

x

m

(4)

One interpretation of X is the following. Each row

of X corresponds to all measurements of a particular

type (x

i

). Each column of X corresponds to a set

of measurements from one particular trial (this is X

from section 3.1). We now arrive at a definition for

the covariance matrix S

X

.

S

X

≡

1

n − 1

XX

T

(5)

Consider how the matrix form XX

T

, in that order,

computes the desired value for the ij

th

element of S

X

.

Specifically, the ij

th

element of the variance is the dot

product between the vector of the i

th

measurement

type with the vector of the j

th

measurement type.

• The ij

th

value of XX

T

is equivalent to substi-

tuting x

i

and x

j

into Equation 3.

• S

X

is a square symmetric m × m matrix.

• The diagonal terms of S

X

are the variance of

particular measurement types.

• The off-diagonal terms of S

X

are the covariance

between measurement types.

Computing S

X

quantifies the correlations between all

possible pairs of measurements. Between one pair of

measurements, a large covariance corresponds to a

situation like panel (c) in Figure 3, while zero covari-

ance corresponds to entirely uncorrelated data as in

panel (a).

S

X

is special. The covariance matrix describes all

relationships between pairs of measurements in our

data set. Pretend we have the option of manipulat-

ing S

X

. We will suggestively define our manipulated

covariance matrix S

Y

. What features do we want to

optimize in S

Y

?

4.4

Diagonalize the Covariance Matrix

If our goal is to reduce redundancy, then we would

like each variable to co-vary as little as possible with

other variables. More precisely, to remove redun-

dancy we would like the covariances between separate

measurements to be zero. What would the optimized

covariance matrix S

Y

look like? Evidently, in an "op-

timized" matrix all off-diagonal terms in S

Y

are zero.

Therefore, removing redundancy diagnolizes S

Y

.

There are many methods for diagonalizing S

Y

. It

is curious to note that

**PCA**arguably selects the eas-iest method, perhaps accounting for its widespread

application.

**PCA**assumes that all basis vectors {p

1

, . . . , p

m

}

are orthonormal (i.e. p

i

· p

j

= δ

ij

). In the language

of linear algebra,

**PCA**assumes P is an orthonormalmatrix. Secondly

**PCA**assumes the directions withthe largest variances are the most "important" or in

other words, most principal. Why are these assump-

tions easiest?

Envision how

**PCA**might work. By the varianceassumption

**PCA**first selects a normalized directionin m-dimensional space along which the variance in

X is maximized - it saves this as p

1

. Again it finds

another direction along which variance is maximized,

however, because of the orthonormality condition, it

restricts its search to all directions perpindicular to

all previous selected directions. This could continue

until m directions are selected. The resulting ordered

set of p's are the principal components.

In principle this simple pseudo-algorithm works,

however that would bely the true reason why the

orthonormality assumption is particularly judicious.

Namely, the true benefit to this assumption is that it

makes the solution amenable to linear algebra. There

exist decompositions that can provide efficient, ex-

plicit algebraic solutions.

Notice what we also gained with the second as-

sumption. We have a method for judging the "im-

portance" of the each prinicipal direction. Namely,

the variances associated with each direction p

i

quan-

tify how principal each direction is. We could thus

rank-order each basis vector p

i

according to the cor-

responding variances.

We will now pause to review the implications of all

6

the assumptions made to arrive at this mathematical

goal.

4.5

Summary of Assumptions and Limits

This section provides an important context for un-

derstanding when

**PCA**might perform poorly as wellas a road map for understanding new extensions to

**PCA**. The final section provides a more lengthy dis-

cussion about recent extensions.

I. Linearity

Linearity frames the problem as a change

of basis. Several areas of research have

explored how applying a nonlinearity prior

to performing

**PCA**could extend this algo-rithm - this has been termed kernel

**PCA**.II. Mean and variance are sufficient statistics.

The formalism of sufficient statistics cap-

tures the notion that the mean and the vari-

ance entirely describe a probability distribu-

tion. The only zero-mean probability dis-

tribution that is fully described by the vari-

ance is the Gaussian distribution.

In order for this assumption to hold, the

probability distribution of x

i

must be Gaus-

sian. Deviations from a Gaussian could in-

validate this assumption

6

. On the flip side,

this assumption formally guarantees that

the SNR and the covariance matrix fully

characterize the noise and redundancies.

III. Large variances have important dynamics.

This assumption also encompasses the belief

6

A sidebar question: What if x

i

are not Gaussian dis-

tributed? Diagonalizing a covariance matrix might not pro-

duce satisfactory results. The most rigorous form of removing

redundancy is statistical independence.

P(y

1

, y

2

) = P(y

1

)P(y

2

)

where P(·) denotes the probability density. The class of algo-

rithms that attempt to find the y

1

, , . . . , y

m

that satisfy this

statistical constraint are termed independent component anal-

ysis (ICA). In practice though, quite a lot of real world data

are Gaussian distributed - thanks to the Central Limit Theo-

rem - and

**PCA**is usually a robust solution to slight deviationsfrom this assumption.

that the data has a high SNR. Hence, princi-

pal components with larger associated vari-

ances represent interesting dynamics, while

those with lower variancees represent noise.

IV. The principal components are orthogonal.

This assumption provides an intuitive sim-

plification that makes

**PCA**soluble withlinear algebra decomposition techniques.

These techniques are highlighted in the two

following sections.

We have discussed all aspects of deriving

**PCA**-what remains is the linear algebra solutions. The

first solution is somewhat straightforward while the

second solution involves understanding an important

decomposition.

5

Solving

**PCA**: Eigenvectors ofCovariance

We will derive our first algebraic solution to

**PCA**using linear algebra. This solution is based on an im-

portant property of eigenvector decomposition. Once

again, the data set is X, an m×n matrix, where m is

the number of measurement types and n is the num-

ber of data trials. The goal is summarized as follows.

Find some orthonormal matrix P where

Y = PX such that S

Y

≡

1

n

−

1

YY

T

is diag-

onalized. The rows of P are the principal

components of X.

We begin by rewriting S

Y

in terms of our variable of

choice P.

S

Y

=

1

n − 1

YY

T

=

1

n − 1

(PX)(PX)

T

=

1

n − 1

PXX

T

P

T

=

1

n − 1

P(XX

T

)P

T

S

Y

=

1

n − 1

PAP

T

7

Note that we defined a new matrix A ≡ XX

T

, where

A is symmetric (by Theorem 2 of Appendix A).

Our roadmap is to recognize that a symmetric ma-

trix (A) is diagonalized by an orthogonal matrix of

its eigenvectors (by Theorems 3 and 4 from Appendix

A). For a symmetric matrix A Theorem 4 provides:

A = EDE

T

(6)

where D is a diagonal matrix and E is a matrix of

eigenvectors of A arranged as columns.

The matrix A has r ≤ m orthonormal eigenvectors

where r is the rank of the matrix. The rank of A is

less than m when A is degenerate or all data occupy

a subspace of dimension r ≤ m. Maintaining the con-

straint of orthogonality, we can remedy this situation

by selecting (m − r) additional orthonormal vectors

to "fill up" the matrix E. These additional vectors

do not effect the final solution because the variances

associated with these directions are zero.

Now comes the trick. We select the matrix P to

be a matrix where each row p

i

is an eigenvector of

XX

T

. By this selection, P ≡ E

T

. Substituting into

Equation 6, we find A = P

T

DP. With this relation

and Theorem 1 of Appendix A (P

−

1

= P

T

) we can

finish evaluating S

Y

.

S

Y

=

1

n − 1

PAP

T

=

1

n − 1

P(P

T

DP)P

T

=

1

n − 1

(PP

T

)D(PP

T

)

=

1

n − 1

(PP

−

1

)D(PP

−

1

)

S

Y

=

1

n − 1

D

It is evident that the choice of P diagonalizes S

Y

.

This was the goal for

**PCA**. We can summarize theresults of

**PCA**in the matrices P and SY

.

• The principal components of X are the eigenvec-

tors of XX

T

; or the rows of P.

• The i

th

diagonal value of S

Y

is the variance of

X along p

i

.

In practice computing

**PCA**of a data set X entails (1)subtracting off the mean of each measurement type

and (2) computing the eigenvectors of XX

T

. This so-

lution is encapsulated in demonstration Matlab code

included in Appendix B.

6

A More General Solution: Sin-

gular Value Decomposition

This section is the most mathematically involved and

can be skipped without much loss of continuity. It is

presented solely for completeness. We derive another

algebraic solution for

**PCA**and in the process, findthat

**PCA**is closely related to singular value decom-position (SVD). In fact, the two are so intimately

related that the names are often used interchange-

ably. What we will see though is that SVD is a more

general method of understanding change of basis.

We begin by quickly deriving the decomposition.

In the following section we interpret the decomposi-

tion and in the last section we relate these results to

**PCA**.

6.1

Singular Value Decomposition

Let X be an arbitrary n × m matrix

7

and X

T

X be a

rank r, square, symmetric n × n matrix. In a seem-

ingly unmotivated fashion, let us define all of the

quantities of interest.

• {v

1

,v

2

, . . . ,v

r

} is the set of orthonormal

m × 1 eigenvectors with associated eigenvalues

{λ

1

, λ

2

, . . . , λ

r

} for the symmetric matrix X

T

X.

(X

T

X)v

i

= λ

i

v

i

• σ

i

≡

√λ

i

are positive real and termed the sin-

gular values.

• {u

1

,u

2

, . . . ,u

r

} is the set of orthonormal n × 1

vectors defined byu

i

≡

1

σ

i

Xv

i

.

We obtain the final definition by referring to Theorem

5 of Appendix A. The final definition includes two

new and unexpected properties.

7

Notice that in this section only we are reversing convention

from m×n to n×m. The reason for this derivation will become

clear in section 6.3.

8

• û

i

· û

j

= δ

ij

• Xv

i

= σ

i

These properties are both proven in Theorem 5. We

now have all of the pieces to construct the decompo-

sition. The "value" version of singular value decom-

position is just a restatement of the third definition.

Xv

i

= σ

i

u

i

(7)

This result says a quite a bit. X multiplied by an

eigenvector of X

T

X is equal to a scalar times an-

other vector. The set of eigenvectors {v

1

,v

2

, . . . ,v

r

}

and the set of vectors {u

1

,u

2

, . . . ,u

r

} are both or-

thonormal sets or bases in r-dimensional space.

We can summarize this result for all vectors in

one matrix multiplication by following the prescribed

construction in Figure 4. We start by constructing a

new diagonal matrix Σ.

Σ ≡

σ

˜

1

.

.

.

0

σ

˜r

0

0

.

.

.

0

where σ

˜

1

≥ σ

˜

2

≥ . . . ≥ σ

˜r

are the rank-ordered set of

singular values. Likewise we construct accompanying

orthogonal matrices V and U.

V = [v

˜

1

v

˜

2

. . .v

˜m

]

U = [u

˜

1

u

˜

2

. . .u

˜n

]

where we have appended an additional (m − r) and

(n − r) orthonormal vectors to "fill up" the matri-

ces for V and U respectively

8

. Figure 4 provides a

graphical representation of how all of the pieces fit

together to form the matrix version of SVD.

XV = UΣ

(8)

where each column of V and U perform the "value"

version of the decomposition (Equation 7). Because

8

This is the same procedure used to fixe the degeneracy in

the previous section.

V is orthogonal, we can multiply both sides by

V

−

1

= V

T

to arrive at the final form of the decom-

position.

X = UΣV

T

(9)

Although it was derived without motivation, this de-

composition is quite powerful. Equation 9 states that

any arbitrary matrix X can be converted to an or-

thogonal matrix, a diagonal matrix and another or-

thogonal matrix (or a rotation, a stretch and a second

rotation). Making sense of Equation 9 is the subject

of the next section.

6.2

Interpreting SVD

The final form of SVD (Equation 9) is a concise but

thick statement to understand. Let us instead rein-

terpret Equation 7.

Xa = kb

where a and b are column vectors and k is a scalar

constant.

The set {v

1

,v

2

, . . . ,v

m

} is analogous

to a and the set {u

1

,u

2

, . . . ,u

n

} is analogous to

b. What is unique though is that {v

1

,v

2

, . . . ,v

m

}

and {u

1

,u

2

, . . . ,u

n

} are orthonormal sets of vectors

which span an m or n dimensional space, respec-

tively. In particular, loosely speaking these sets ap-

pear to span all possible "inputs" (a) and "outputs"

(b). Can we formalize the view that {v

1

,v

2

, . . . ,v

n

}

and {u

1

,u

2

, . . . ,u

n

} span all possible "inputs" and

"outputs"?

We can manipulate Equation 9 to make this fuzzy

hypothesis more precise.

X = UΣV

T

U

T

X = ΣV

T

U

T

X = Z

where we have defined Z ≡ ΣV

T

.

Note that

the previous columns {u

1

,u

2

, . . . ,u

n

} are now

rows in U

T

. Comparing this equation to Equa-

tion 1, {u

1

,u

2

, . . . ,u

n

} perform the same role as

{p

1

,p

2

, . . . ,p

m

}. Hence, U

T

is a change of basis

from X to Z. Just as before, we were transform-

ing column vectors, we can again infer that we are

9

The "value" form of SVD is expressed in equation 7.

Xv

i

= σ

i

u

i

The mathematical intuition behind the construction of the matrix form is that we want to express all

n "value" equations in just one equation. It is easiest to understand this process graphically. Drawing

the matrices of equation 7 looks likes the following.

We can construct three new matrices V, U and Σ. All singular values are first rank-ordered

σ

˜

1

≥ σ

˜

2

≥ . . . ≥ σ

˜r

, and the corresponding vectors are indexed in the same rank order. Each pair

of associated vectorsv

i

andu

i

is stacked in the i

th

column along their respective matrices. The cor-

responding singular value σ

i

is placed along the diagonal (the ii

th

position) of Σ. This generates the

equation XV = UΣ, which looks like the following.

The matrices V and U are m × m and n × n matrices respectively and Σ is a diagonal matrix with

a few non-zero values (represented by the checkerboard) along its diagonal. Solving this single matrix

equation solves all n "value" form equations.

Figure 4: How to construct the matrix form of SVD from the "value" form.

transforming column vectors. The fact that the or-

thonormal basis U

T

(or P) transforms column vec-

tors means that U

T

is a basis that spans the columns

of X. Bases that span the columns are termed the

column space of X. The column space formalizes the

notion of what are the possible "outputs" of any ma-

trix.

There is a funny symmetry to SVD such that we

can define a similar quantity - the row space.

XV = ΣU

(XV)

T

= (ΣU)

T

V

T

X

T

= U

T

Σ

V

T

X

T

= Z

where we have defined Z ≡ U

T

Σ. Again the rows of

V

T

(or the columns of V) are an orthonormal basis

for transforming X

T

into Z. Because of the trans-

pose on X, it follows that V is an orthonormal basis

spanning the row space of X. The row space likewise

formalizes the notion of what are possible "inputs"

into an arbitrary matrix.

We are only scratching the surface for understand-

ing the full implications of SVD. For the purposes of

this

**tutorial**though, we have enough information tounderstand how

**PCA**will fall within this framework.10

6.3

SVD and

**PCA**With similar computations it is evident that the two

methods are intimately related. Let us return to the

original m × n data matrix X. We can define a new

matrix Y as an n × m matrix

9

.

Y ≡

1

√n

− 1

X

T

where each column of Y has zero mean. The defini-

tion of Y becomes clear by analyzing Y

T

Y.

Y

T

Y = (

1

√n

− 1

X

T

)

T

(

1

√n

− 1

X

T

)

=

1

n − 1

X

T T

X

T

=

1

n − 1

XX

T

Y

T

Y = S

X

By construction Y

T

Y equals the covariance matrix

of X. From section 5 we know that the principal

components of X are the eigenvectors of S

X

. If we

calculate the SVD of Y, the columns of matrix V

contain the eigenvectors of Y

T

Y = S

X

. Therefore,

the columns of V are the principal components of X.

This second algorithm is encapsulated in Matlab code

included in Appendix B.

What does this mean? V spans the row space of

Y ≡

1

√n

−

1

X

T

. Therefore, V must also span the col-

umn space of

1

√n

−

1

X. We can conclude that finding

the principal components

10

amounts to finding an or-

thonormal basis that spans the column space of X.

7

Discussion and Conclusions

7.1

Quick Summary

Performing

**PCA**is quite simple in practice.9

Y is of the appropriate n × m dimensions laid out in the

derivation of section 6.1. This is the reason for the "flipping"

of dimensions in 6.1 and Figure 4.

10

If the final goal is to find an orthonormal basis for the

coulmn space of X then we can calculate it directly without

constructing Y. By symmetry the columns of U produced by

the SVD of

1

√n

−

1

X must also be the principal components.

Figure 5: Data points (black dots) tracking a person

on a ferris wheel. The extracted principal compo-

nents are (p

1

, p

2

) and the phase isθ.

1. Organize a data set as an m × n matrix, where

m is the number of measurement types and n is

the number of trials.

2. Subtract off the mean for each measurement type

or row x

i

.

3. Calculate the SVD or the eigenvectors of the co-

variance.

In several fields of literature, many authors refer to

the individual measurement types x

i

as the sources.

The data projected into the principal components

Y = PX are termed the signals, because the pro-

jected data presumably represent the "true" under-

lying probability distributions.

7.2

Dimensional Reduction

One benefit of

**PCA**is that we can examine the vari-ances S

Y

associated with the principle components.

Often one finds that large variances associated with

11

the first k < m principal components, and then a pre-

cipitous drop-off. One can conclude that most inter-

esting dynamics occur only in the first k dimensions.

In the example of the spring, k = 1.

Like-

wise, in Figure 3 panel (c), we recognize k = 1

along the principal component of r

2

= kr

1

. This

process of of throwing out the less important axes

can help reveal hidden, simplified dynamics in high

dimensional data.

This process is aptly named

dimensional reduction.

7.3

Limits and Extensions of

**PCA**Both the strength and weakness of

**PCA**is that it isa non-parametric analysis. One only needs to make

the assumptions outlined in section 4.5 and then cal-

culate the corresponding answer. There are no pa-

rameters to tweak and no coefficients to adjust based

on user experience - the answer is unique

11

and inde-

pendent of the user.

This same strength can also be viewed as a weak-

ness. If one knows a-priori some features of the dy-

namics of a system, then it makes sense to incorpo-

rate these assumptions into a parametric algorithm -

or an algorithm with selected parameters.

Consider the recorded positions of a person on a

ferris wheel over time in Figure 5. The probability

distributions along the axes are approximately Gaus-

sian and thus

**PCA**finds (p1

, p

2

), however this an-

swer might not be optimal. The most concise form of

dimensional reduction is to recognize that the phase

(or angle along the ferris wheel) contains all dynamic

information. Thus, the appropriate parametric algo-

rithm is to first convert the data to the appropriately

centered polar coordinates and then compute

**PCA**.This prior non-linear transformation is sometimes

termed a kernel transformation and the entire para-

metric algorithm is termed kernel

**PCA**. Other com-mon kernel transformations include Fourier and

11

To be absolutely precise, the principal components are not

uniquely defined. One can always flip the direction by multi-

plying by −1. In addition, eigenvectors beyond the rank of

a matrix (i.e. σ

i

= 0 for i > rank) can be selected almost

at whim. However, these degrees of freedom do not effect the

qualitative features of the solution nor a dimensional reduc-

tion.

Figure 6: Non-Gaussian distributed data causes

**PCA**to fail. In exponentially distributed data the axes

with the largest variance do not correspond to the

underlying basis.

Gaussian transformations. This procedure is para-

metric because the user must incorporate prior

knowledge of the dynamics in the selection of the ker-

nel but it is also more optimal in the sense that the

dynamics are more concisely described.

Sometimes though the assumptions themselves are

too stringent. One might envision situations where

the principal components need not be orthogonal.

Furthermore, the distributions along each dimension

(x

i

) need not be Gaussian. For instance, Figure 6

contains a 2-D exponentially distributed data set.

The largest variances do not correspond to the mean-

ingful axes of thus

**PCA**fails.This less constrained set of problems is not triv-

ial and only recently has been solved adequately via

Independent Component Analysis (ICA). The formu-

lation is equivalent.

Find a matrix P where Y = PX such that

S

Y

is diagonalized.

however it abandons all assumptions except linear-

ity, and attempts to find axes that satisfy the most

formal form of redundancy reduction - statistical in-

dependence. Mathematically ICA finds a basis such

that the joint probability distribution can be factor-

12

ized

12

.

P(y

i

, y

j

) = P(y

i

)P(y

j

)

for all i and j, i = j. The downside of ICA is that it

is a form of nonlinear optimizaiton, making the solu-

tion difficult to calculate in practice and potentially

not unique. However ICA has been shown a very

practical and powerful algorithm for solving a whole

new class of problems.

Writing this paper has been an extremely in-

structional experience for me.

I hope that

this paper helps to demystify the motivation

and results of

**PCA**, and the underlying assump-tions behind this important analysis technique.

8

Appendix A: Linear Algebra

This section proves a few unapparent theorems in

linear algebra, which are crucial to this paper.

1. The inverse of an orthogonal matrix is

its transpose.

The goal of this proof is to show that if A is an

orthogonal matrix, then A

−

1

= A

T

.

Let A be an m × n matrix.

A = [a

1

a

2

. . . a

n

]

where a

i

is the i

th

column vector. We now show that

A

T

A = I where I is the identity matrix.

Let us examine the ij

th

element of the matrix

A

T

A. The ij

th

element of A

T

A is (A

T

A)

ij

= a

i

T

a

j

.

Remember that the columns of an orthonormal ma-

trix are orthonormal to each other. In other words,

the dot product of any two columns is zero. The only

exception is a dot product of one particular column

with itself, which equals one.

(A

T

A)

ij

= a

i

T

a

j

= {

1 i = j

0 i = j

A

T

A is the exact description of the identity matrix.

The definition of A

−

1

is A

−

1

A = I. Therefore,

12

Equivalently, in the language of information theory the

goal is to find a basis P such that the mutual information

I(y

i

, y

j

) = 0 for i = j.

because A

T

A = I, it follows that A

−

1

= A

T

.

2. If A is any matrix, the matrices A

T

A

and AA

T

are both symmetric.

Let's examine the transpose of each in turn.

(AA

T

)

T

= A

T T

A

T

= AA

T

(A

T

A)

T

= A

T

A

T T

= A

T

A

The equality of the quantity with its transpose

completes this proof.

3. A matrix is symmetric if and only if it is

orthogonally diagonalizable.

Because this statement is bi-directional, it requires

a two-part "if-and-only-if" proof. One needs to prove

the forward and the backwards "if-then" cases.

Let us start with the forward case. If A is orthog-

onally diagonalizable, then A is a symmetric matrix.

By hypothesis, orthogonally diagonalizable means

that there exists some E such that A = EDE

T

,

where D is a diagonal matrix and E is some special

matrix which diagonalizes A. Let us compute A

T

.

A

T

= (EDE

T

)

T

= E

T T

D

T

E

T

= EDE

T

= A

Evidently, if A is orthogonally diagonalizable, it

must also be symmetric.

The reverse case is more involved and less clean so

it will be left to the reader. In lieu of this, hopefully

the "forward" case is suggestive if not somewhat

convincing.

4. A symmetric matrix is diagonalized by a

matrix of its orthonormal eigenvectors.

Restated in math, let A be a square n × n

symmetric matrix with associated eigenvectors

{e

1

, e

<