Black-Box Toolbox: PCA Example

Principal Component Analysis

This example demonstrates how to create a function that can be used to perform Principal Component Analysis (PCA) of a large data set. It shows, step by step, how to use the toolbox for this task.

Introduction to PCA

This example is not intended to explain how you can use PCA to analyse a data set. For that you should consult a book on statistics such as [14].

From a numerical point of view, a PCA is just an SVD of a centered data matrix, i.e. a data matrix where a common mean is removed. Typically, statisticians use the opposite convention of numerical analysts and arrange objects as rows rather than columns [14].

If X is an m-by-n data-matrix with m objects arranged as row-vectors, then we may compute the centered data matrix:

mu=mean(X);
XX=X-repmat(mu,m,1);

or, equivalently:

mu=mean(X);
XX=X-ones(m,1)*mu;

To complete the PCA we compute the SVD:

[U,S,V]=svd(XX,0);

One of the reasons of doing a PCA is to reduce the dimensionality of the large, often sparse, data set. We may reduce the set to L dimensions:

sig=diag(S); sig=sig(1:L);
U=U(:,1:L); V=V(:,1:L);

In PCA jargon, the matrix U*diag(sig) is called the factor scores or component scores matrix. Similarly, V is the factor loading or component loading matrix. If we keep all components we may reconstruct the centered data matrix as XX=U*S*V'. We will return U and sig separately in this example.

Note: some authors arrange the objects as columns rather than rows. Another common variation is to divide each variable (column in our convention) by an estimate of the variance. Finally, the elements in the vector sig must be squared to get the eigenvalues of the covariance matrix.

Creating a Program

The procedure above can be costly for a large data-set, both in terms of memory and computations. Let us instead derive a function, based on bbsvds, that accomplishes the same task.

Step 1: Decide Function Syntax

We start by settling on a syntax:

[sig,mu,U,V]=bbpca(X,L)
[sig,mu,U,V]=bbpca(X,L,opt)
[sig,mu,U,V]=bbpca(X,L,arr)
[sig,mu,U,V]=bbpca(X,L,arr,opt)

To make the program slightly more general, we add the optional argument arr, which is a string that specifies whether the objects are arranged is rows (arr='row', the default) or columns (arr='col'). opt can be used to give parameters to bbsvds for the computation.

This leads to the function definition:

function [sig,mu,U,V]=bbpca(X,L,varargin)

Step 2: Get Input Arguments

This is a matter of trivial, but tedious, Matlab programming:

opt=struct; arr='row';
if nargin==4
  arr=lower(varargin{1});
  opt=varargin{2};
elseif (nargin==3) && isa(varargin{1},'struct')
  opt=varargin{1};
elseif nargin==3
  arr=lower(varargin{1});
end

Step 3: Create a Centered Black-Box

We now create a black-box operator corresponding to the centered data-matrix XX. The reason XX is expensive to produce is that it is full, even when X is extremely sparse.

Fortunately, XX can be formulated in a much more friendly way:

[m,n]=size(X);
if strcmp(arr,'row')
  XX=X-bbprod(bbconst(m,1),mean(X,1));
else
  XX=X-bbprod(mean(X,2),bbconst(1,n));
end

This code creates a matrix of rank-1 which is equivalent to the repmat used above. bbconst creates a vector of ones, while works the usual way. The product is formed using bbprod, which creates an operator instead of evaluating the products.

This code produces a centered data-matrix for any matrix or black-box operator X. The matrix is full, but the matrix-vector products very fast and a negligible amount of memory is required aside from storing X.

Memory Conservation

The scheme above conserves memory even if X is dense. Although it is not shown by , Matlab does not allocate memory for a copy of a matrix unless it is modified. Although XX contains a copy of X, it will not take up extra memory unless X is modified. Therefore, forming the centered matrix as above may be beneficial even if X is dense.

Step 4: Compute the SVD

Normally, there will not be a need to tune the computation. Basically, we just need to tell bbsdvs whether we want the singular triplets, or just the values. This is decided from the number of output arguments above:

opt.trip=(nargout>=3);
[U,S,V]=bbsvds(XX,L,opt);

Although trivial in this case, it may sometimes be necessary to adjust the result. In this case we want to return the singular values as a vector:

sig=diag(S);

Step 5: Putting it Together

The only thing left is to put in some error checking. Any quality implementation of a function must do this, although is not strictly needed to make bbpca work.

A complete, working, implementation of bbpca could look like this.

%BBPCA  Principal component analysis
%   [SIG,MU,U,V]=BBPCA(X,L), or [SIG,MU,U,V]=BBPCA(X,L,'row'), computes
%   the L leading principal components of the data-matrix X, assuming the
%   objects are arranged as rows in X. X can be a matrix or black-box
%   operator.
%
%   SIG is a vector of the singular values of the centered data-matrix.
%   The eigenvalues of the covariance matrix is given by LAMBDA=SIG.^2.
%
%   MU is the empirical mean of the objects, which is subtracted each
%   object before the leading components are computed.
%
%   U contains the eigenvectors of the empirical covariance matrix.
%   The factor-scores matrix is given by U*DIAG(sig), while the
%   factor loading matrix is given by V.
%
%   The data-matrix can be written in terms of all principal components:
%     X = U*DIAG(SIG)*V' + REPMAT(MU,SIZE(X,1),1)
%
%   [SIG,MU,U,V]=BBPCA(X,L,'col') computes the PCA when the columns are
%   arranged in columns. Assuming all principal components are computed,
%   the data-matrix is given by:
%     X = U*DIAG(SIG)*V' + REPMAT(MU,1,SIZE(X,2))
%
%   The factor-scores matrix is given by DIAG(sig)*V, and the
%   factor-loading matrix is U.
%
%   BBPCA uses BBSVDS for the computation. As a last argument you may
%   pass a struct to control this computation.
%
%   Example:
%     [sig,mu,U,V]=bbpca(X,100);
%
%   See also BBSVDS, BBSVDSOPT.

% Copyright (C) 2005-2006, Esben Hgh-Rasmussen.
function [sig,mu,U,V]=bbpca(X,L,varargin)
  % Get arguments.
  if (nargin<2) || (nargin>4)
    error('Illegal number of arguments.');
  end
  opt=struct; arr='row';
  if nargin==4
    arr=lower(varargin{1});
    opt=varargin{2};
  elseif (nargin==3) && isa(varargin{1},'struct')
    opt=varargin{1};
  elseif nargin==3
    arr=lower(varargin{1});
  end

  % Check.
  if ~(numel(L)==1) || (round(L)~=L) || (L<1)
    error('Illegal number of components.');
  end
  if (~isa(arr,'string')) || ~any(strcmp(arr,{'row','col'}))
    error('Data-arrangement must be either ''row'' or ''col''.');
  end
  if ~isa(opt,'struct')
    error('Options for SVDS must be a struct.');
  end

  % Create centered black-box.
  [m,n]=size(X);
  if strcmp(arr,'row')
    mu=mean(X,1);
    XX=X-bbprod(bbconst(m,1),mu);
  else
    mu=mean(X,2);
    XX=X-bbprod(mu,bbconst(1,n));
  end

  % Compute SVD.
  opt.trip=(nargout>=3);
  [U,S,V]=bbsvds(XX,L,opt);
  sig=diag(S);

References

[14] Kanti V. Mardia, John T. Kent, and John M. Bibby. Multivariate Analysis. Academic Press, London, 1995. ISBN: 0-12-471252-5. 10th print. (Book)