This example demonstrates how to create a function that can be used to perform Principal Component Analysis (PCA) of a large data set. It shows, step by step, how to use the toolbox for this task.
This example is not intended to explain how you can use PCA to analyse a data set. For that you should consult a book on statistics such as [14].
From a numerical point of view, a PCA is just an SVD of a centered data matrix, i.e. a data matrix where a common mean is removed. Typically, statisticians use the opposite convention of numerical analysts and arrange objects as rows rather than columns [14].
If X
is an m
-by-n
data-matrix with
m
objects arranged as row-vectors, then we may compute the centered
data matrix:
mu=mean(X); XX=X-repmat(mu,m,1);
or, equivalently:
mu=mean(X); XX=X-ones(m,1)*mu;
To complete the PCA we compute the SVD:
[U,S,V]=svd(XX,0);
One of the reasons of doing a PCA is to reduce the dimensionality of the large,
often sparse, data set. We may reduce the set to L
dimensions:
sig=diag(S); sig=sig(1:L); U=U(:,1:L); V=V(:,1:L);
In PCA jargon, the matrix U*diag(sig)
is called the factor scores
or component scores matrix. Similarly, V
is the factor loading
or component loading matrix. If we keep all components we may reconstruct the
centered data matrix as XX=U*S*V'
. We will return U
and
sig
separately in this example.
Note: some authors arrange the objects as columns rather than rows. Another
common variation is to divide each variable (column in our convention) by an estimate of
the variance. Finally, the elements in the vector sig
must be squared to
get the eigenvalues of the covariance matrix.
The procedure above can be costly for a large data-set, both in terms of memory and
computations. Let us instead derive a function, based on
bbsvds
, that accomplishes the same task.
We start by settling on a syntax:
[sig,mu,U,V]=bbpca(X,L) [sig,mu,U,V]=bbpca(X,L,opt) [sig,mu,U,V]=bbpca(X,L,arr) [sig,mu,U,V]=bbpca(X,L,arr,opt)
To make the program slightly more general, we add the optional argument
arr
, which is a string that specifies whether the objects are
arranged is rows (arr='row'
, the default) or columns
(arr='col'
). opt
can be used to give parameters to
bbsvds
for the computation.
This leads to the function definition:
function [sig,mu,U,V]=bbpca(X,L,varargin)
This is a matter of trivial, but tedious, Matlab programming:
opt=struct; arr='row'; if nargin==4 arr=lower(varargin{1}); opt=varargin{2}; elseif (nargin==3) && isa(varargin{1},'struct') opt=varargin{1}; elseif nargin==3 arr=lower(varargin{1}); end
We now create a black-box operator corresponding to the centered
data-matrix XX
. The reason XX
is expensive to produce is that
it is full, even when X
is extremely sparse.
Fortunately, XX
can be formulated in a much more friendly way:
[m,n]=size(X); if strcmp(arr,'row') XX=X-bbprod(bbconst(m,1),mean(X,1)); else XX=X-bbprod(mean(X,2),bbconst(1,n)); end
This code creates a matrix of rank-1 which is equivalent to the repmat
used above. bbconst
creates a vector of
ones, while
works the usual way. The product is formed using
bbprod
, which creates an operator instead of
evaluating the products.
This code produces a centered data-matrix for any matrix or black-box operator
X
. The matrix is full, but the matrix-vector products very fast and
a negligible amount of memory is required aside from storing X
.
The scheme above conserves memory even if X
is dense. Although it is
not shown by
,
Matlab does not allocate memory for a copy of a matrix unless it is modified. Although
XX
contains a copy of X
, it will not take up extra memory
unless X
is modified. Therefore, forming the centered matrix as above
may be beneficial even if X
is dense.
Normally, there will not be a need to tune the computation. Basically, we just need
to tell bbsdvs
whether we want the singular
triplets, or just the values. This is decided from the number of output arguments
above:
opt.trip=(nargout>=3); [U,S,V]=bbsvds(XX,L,opt);
Although trivial in this case, it may sometimes be necessary to adjust the result. In this case we want to return the singular values as a vector:
sig=diag(S);
The only thing left is to put in some error checking. Any quality implementation
of a function must do this, although is not strictly needed to make bbpca
work.
A complete, working, implementation of
bbpca
could look like this.
%BBPCA Principal component analysis % [SIG,MU,U,V]=BBPCA(X,L), or [SIG,MU,U,V]=BBPCA(X,L,'row'), computes % the L leading principal components of the data-matrix X, assuming the % objects are arranged as rows in X. X can be a matrix or black-box % operator. % % SIG is a vector of the singular values of the centered data-matrix. % The eigenvalues of the covariance matrix is given by LAMBDA=SIG.^2. % % MU is the empirical mean of the objects, which is subtracted each % object before the leading components are computed. % % U contains the eigenvectors of the empirical covariance matrix. % The factor-scores matrix is given by U*DIAG(sig), while the % factor loading matrix is given by V. % % The data-matrix can be written in terms of all principal components: % X = U*DIAG(SIG)*V' + REPMAT(MU,SIZE(X,1),1) % % [SIG,MU,U,V]=BBPCA(X,L,'col') computes the PCA when the columns are % arranged in columns. Assuming all principal components are computed, % the data-matrix is given by: % X = U*DIAG(SIG)*V' + REPMAT(MU,1,SIZE(X,2)) % % The factor-scores matrix is given by DIAG(sig)*V, and the % factor-loading matrix is U. % % BBPCA uses BBSVDS for the computation. As a last argument you may % pass a struct to control this computation. % % Example: % [sig,mu,U,V]=bbpca(X,100); % % See also BBSVDS, BBSVDSOPT. % Copyright (C) 2005-2006, Esben Hgh-Rasmussen. function [sig,mu,U,V]=bbpca(X,L,varargin) % Get arguments. if (nargin<2) || (nargin>4) error('Illegal number of arguments.'); end opt=struct; arr='row'; if nargin==4 arr=lower(varargin{1}); opt=varargin{2}; elseif (nargin==3) && isa(varargin{1},'struct') opt=varargin{1}; elseif nargin==3 arr=lower(varargin{1}); end % Check. if ~(numel(L)==1) || (round(L)~=L) || (L<1) error('Illegal number of components.'); end if (~isa(arr,'string')) || ~any(strcmp(arr,{'row','col'})) error('Data-arrangement must be either ''row'' or ''col''.'); end if ~isa(opt,'struct') error('Options for SVDS must be a struct.'); end % Create centered black-box. [m,n]=size(X); if strcmp(arr,'row') mu=mean(X,1); XX=X-bbprod(bbconst(m,1),mu); else mu=mean(X,2); XX=X-bbprod(mu,bbconst(1,n)); end % Compute SVD. opt.trip=(nargout>=3); [U,S,V]=bbsvds(XX,L,opt); sig=diag(S);
[14] Kanti V. Mardia, John T. Kent, and John M. Bibby. Multivariate Analysis. Academic Press, London, 1995. ISBN: 0-12-471252-5. 10th print. (Book)