Principal Component Analysis (PCA) MATLAB project example “Fitting an Orthogonal Regression Using Principal Components Analysis”

Fitting an Orthogonal Regression Using Principal Components Analysis

Orthogonal regression fitting via principal component analysis

1. Background analysis

This example shows how to use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as Orthogonal Regression or Total Least Squares, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error. This is in contrast to the usual regression assumption that predictor variables are measured exactly, and only the response variable has an error component.

This example shows how to use principal component analysis (PCA) to perform a linear regression fit. PCA achieves fitting by minimizing the vertical distance from the data to the fitted model. This is the linear case known as orthogonal regression or total least squares, and is applicable when there is no natural distinction between the independent and dependent variables, or when all variables are subject to measurement error. This is contrary to the usual regression assumptions, which assume that the independent variables are accurately measured and only the dependent variable has an error component.

For example, given two data vectors x and y, you can fit a line that minimizes the perpendicular distances from each of the points (x(i), y(i)) to the line. More generally, with p observed variables, you can fit an r-dimensional hyperplane in p-dimensional space (r < p). The choice of r is equivalent to choosing the number of components to retain in PCA. It may be based on prediction error, or it may simply be a pragmatic choice to reduce data to a manageable number of dimensions.

For example, given two data vectors x and y, you can fit a line that minimizes the perpendicular distance of each point (x(i), y(i)) from the line. More generally, when there are p observed variables, you can fit an r-dimensional hyperplane (r < p) in p-dimensional space. Choosing r is equivalent to choosing the number of principal components to retain in PCA. This choice can be based on prediction error, or it can simply be a practical choice made to reduce the data to a manageable dimension.

In this example, we fit a plane and a line through some data on three observed variables. It’s easy to do the same thing for any number of variables, and for any dimension of model, although visualizing a fit in higher dimensions would obviously not be straightforward.

In this example, we fit a plane and a line to some data on three observed variables. The same approach can be used for any number of variables and for models of any dimension, although visualizing the fit in higher dimensions is obviously not as straightforward.

2. Fitting a plane to three-dimensional data

Fitting a Plane to 3-D Data

First, we generate some trivariate normal data for the example. Two of the variables are fairly strongly correlated.

First, we generate some triple-normally distributed data for this example. There is a strong correlation between two of the variables.

rng(5,'twister');
X = mvnrnd([0 0 0], [1 .2 .7; .2 1 0; .7 0 1],50);
plot3(X(:,1),X(:,2),X(:,3),'bo');
grid on;
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);

This code uses rng to set the random seed to facilitate the reproduction of the same random matrix multiple times. The mvnrnd function sets the population mean to [0 0 0], and samples n samples from a normal population. Obviously, the expectation/mean of these n samples is not necessarily [0 0 0]. Subsequently, the three generated X features were used as three coordinate axes, and plot3 was used to draw the feature space. Finally, do some scaling and rotation of the image to make the whole thing more intuitive.

Knowledge points:

1.rng controls the random number generator

rng is used to control random seeds. The general calling format is rng(seed) or rng(seed, generator). seed is an integer variable, and generator has several specified parameters. When using the same seed, running the code multiple times produces the same amount of randomness. In addition, if you do not specify a seed and only specify a generator, you need to use the format rng(‘default’).

2.mvnrnd multivariate normal random number

The calling format is R = mvnrnd(mu, sigma, n) or R = mvnrnd(mu, sigma). The difference between the two is simply the number of generated data (in other words, the dimension of R). The latter has only one row, while the former has n lines. The generated data matrix lines different samples and lists different features. In this example, the mean and covariance matrices are both 3×3, and the feature dimension is specified as 3, so the generated X size is [n, 3]. If the second calling format is used, the generated data is [1, 3]. It should be noted here that the covariance matrix specified by sigma must be a symmetric matrix and positive semi-definite.

3.plot3 three-dimensional point or line graph

The general calling format of plot3 can be described as plot3(X1, Y1, Z1, LineSpec1, …, Xn, Yn, Zn, LineSpecn), where i=2~n these parameters can be omitted. X Y Z can be a scalar, column vector, or matrix, and LineSpec is used to specify the mark/line type. In this example, plot3 is passed in three 50-dimensional column vectors, representing the X Y Z coordinates. When drawing, it will be drawn in row order. For example, the first point should be (X(1, 1), X(1, 2), X(1, 3)), and the last point should be (X(50) , 1), X(50, 2), X(50, 3)).

4.axis sets the coordinate axis range and aspect ratio

axis(limits) is used to specify the range of the coordinate axis. You need to specify 4, 6 or 8 parameters. The general format is axis([-X X -Y Y -Z Z]).

The calling form of axis square is axis style, which is used to specify the coordinate axis scale. square uses axes of the same length and adjusts the increments between data units accordingly. In addition, there are tight, normal, etc. in style.

5.view camera sight

Sets the azimuth and elevation of the camera’s line of sight for the current axes. The calling format is view(az, el), az is the azimuth angle, and el is the elevation angle.

Next, we fit a plane to the data using PCA. The coefficients for the first two principal components define vectors that form a basis for the plane. The third PC is orthogonal to the first two, and its coefficients define the normal vector of the plane .

Next, we fit the data to a plane using principal component analysis (PCA). The coefficients of the first two principal components define the basis vectors that make up the plane. The third principal component is orthogonal to the first two, and its coefficients define the normal vector of the plane.

[coeff,score,roots] = pca(X);
basis = coeff(:,1:2)
normal = coeff(:,3)

This code uses the pca function, and the three return values contain all the required information. in:

1.coeff principal component coefficient: refers to the projection matrix obtained by pca, which is a square matrix that satisfies coeff’ * coeff = I. Its dimensions are the same as the data feature dimensions. It can also be understood as a feature matrix, where each column corresponds to a feature vector.

2.score Principal component score: refers to the projected coordinates of the original data points on each component. The first column corresponds to the projection value of the original data on the first principal component, the second column corresponds to the projection value of the original data on the second principal component, and so on. score = X * coeff.

3.roots characteristic roots/potential roots (also used latent)/eigenvalues: The standard calling format in the matlab document is called latent, which is a matrix of [1, m], where m is the characteristic dimension. Each value corresponds to a column in coeff and is arranged in order from large to small.

Therefore, the first two columns of coeff correspond to the direction vectors of the first two principal components, and if you want to fit a plane, these two direction vectors are the basis vectors of the plane. The principal components of pca are orthogonal to each other, so the third principal component is the normal vector of the fitting plane.

That’s all there is to the fit. But let’s look closer at the results, and plot the fit along with the data.

This completes the fitting process. But let’s take a closer look at the results and plot the fit along with the data.

Because the first two components explain as much of the variance in the data as is possible with two dimensions, the plane is the best 2-D linear approximation to the data. Equivalently, the third component explains the least amount of variation in the data, and it is the error term in the regression. The latent roots (or eigenvalues) from the PCA define the amount of explained variance for each component.

Because the first two principal components explain as much of the variance in the data as possible, this plane is the best two-dimensional linear approximation to the data. Equivalently, the third principal component explains the least variation in the data and is the error term in the regression. The latent roots (or eigenvalues) in PCA define the amount of variance explained by each principal component.

pctExplained = roots' ./ sum(roots)

pctExplained = <em>1×3</em>

    0.6226 0.2976 0.0798

Eigenvalues and variances are consistent. A large eigenvalue corresponds to a large variance that can be explained by the principal components (maximum separability). Therefore, you can use the proportion method to see how much variance each principal component can explain. In the example above, only 7% of the variance is explained by the third principal component.

The first two coordinates of the principal component scores give the projection of each point onto the plane, in the coordinate system of the plane. To get the coordinates of the fitted points in terms of the original coordinate system, we multiply each PC coefficient vector by the corresponding score, and add back in the mean of the data. The residuals are simply the original data minus the fitted points.

The first two coordinates of principal component analysis give the projection of each data point onto the plane, which is expressed in the coordinate system of the plane. To obtain the coordinates of the fitted points in the original coordinate system, we multiply each principal component coefficient vector by the corresponding score and then add back the mean of the data. The residuals are the original data minus the fitted points.

[n,p] = size(X);
meanX = mean(X,1);
Xfit = repmat(meanX,n,1) + score(:,1:2)*coeff(:,1:2)';
residuals = X - Xfit;

The Xfit here is the point in the original data space after the reconstruction (restoration) of pca, not the point projected on the principal component plane. As mentioned before, score = Transpose of the first two columns of . Finally, the value of residual is equal to the original point minus the fitted point.

Knowledge points:

1.mean mean of the array

mean(A, dim), dim = 1, row compression, the result is a row vector containing the mean of each column; dim = 2, column compression, the result is a column vector containing the mean of each row.

2.repmat repeats array copy

repmat(A, r1, …., rn), in this case, repmat(meanX, n, 1) will repeat meanX for 50 rows and 1 column. The purpose is to add the mean value to each row of the fitting point (because as mentioned earlier, the sample mean is not necessarily equal to the population mean).

The equation of the fitted plane, satisfied by each of the fitted points in Xfit, is ([x1 x2 x3] – meanX)*normal = 0. The plane passes through the point meanX, and its perpendicular distance to the origin is meanX* normal. The perpendicular distance from each point in X to the plane, i.e., the norm of the residuals, is the dot product of each centered point with the normal to the plane. The fitted plane minimizes the sum of the squared errors.

The equation of the fitted plane ([x1 x2 x3] – meanX) * normal = 0 applies to every fitted point in Xfit. This plane passes through the point meanX, and its perpendicular distance from the origin is meanX * normal. The perpendicular distance from each point in X to the plane, the norm of the residual, is the dot product of each centered point and the plane normal vector. The fitted plane minimizes the sum of squared errors.

error = abs((X - repmat(meanX,n,1))*normal);
sse = sum(error.^2)

sse = 15.5142

The mean is subtracted from X for centering, and then multiplied by the plane normal vector to get the distance between centered

To visualize the fit, we can plot the plane, the original data, and their projection to the plane.

[xgrid,ygrid] = meshgrid(linspace(min(X(:,1)),max(X(:,1)),5), ...
                         linspace(min(X(:,2)),max(X(:,2)),5));
zgrid = (1/normal(3)) .* (meanX*normal - (xgrid.*normal(1) + ygrid.*normal(2)));
h = mesh(xgrid,ygrid,zgrid,'EdgeColor',[0 0 0],'FaceAlpha',0);

hold on
above = (X-repmat(meanX,n,1))*normal < 0;
below = ~above;
nabove = sum(above);
X1 = [X(above,1) Xfit(above,1) nan*ones(nabove,1)];
X2 = [X(above,2) Xfit(above,2) nan*ones(nabove,1)];
X3 = [X(above,3) Xfit(above,3) nan*ones(nabove,1)];
plot3(X1',X2',X3','-', X(above,1),X(above,2),X(above,3),'o', 'Color',[0 .7 0] );
nbelow = sum(below);
X1 = [X(below,1) Xfit(below,1) nan*ones(nbelow,1)];
X2 = [X(below,2) Xfit(below,2) nan*ones(nbelow,1)];
X3 = [X(below,3) Xfit(below,3) nan*ones(nbelow,1)];
plot3(X1',X2',X3','-', X(below,1),X(below,2),X(below,3),'o', 'Color',[1 0 0]) ;

hold off
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);

Knowledge points:

1.meshgrid two-dimensional and three-dimensional mesh

This function is used to return grid coordinates, such as X = [1, 2, 3], Y = [1, 2, 3]. There are nine grids in these two coordinate ranges, (1, 1), (1 , 2), (1, 3), (2, 1), (2, 2), …, (3, 3), meshgrid does this, both parameters are [1, 5] Matrix, the returned results xgrid and ygrid are both [5, 5] matrices, and their corresponding positions are the grid points to be drawn with mesh later, such as (xgrid(1, 1), ygrid(1, 1)), (xgrid(1, 2), ygrid(1,2)), …, (xgrid(5,5), ygrid(5,5)), a total of 25 points.

2.linspace generates linear spacing vectors

In this example, linspace(min(X(:,1)),max(X(:,1)),5) divides equally from the minimum value in the first column of X to the maximum value in the first column of X and returns 5 value.

3.zgrid algorithm

First, the general equation of the plane is Ax + By + Cz + D = 0, [A, B, C] is the plane normal vector, and D is the plane offset. Written in vector form, the plane normal vector is normal, and it is known that meanX is on the fitting plane, then there is normal * meanX + D = 0, and D can be solved. Now that x and y are known quantities (stored in xgrid and ygrid), finding z is solving the equation. normal(1) * xgrid + normal(2) * ygrid + normal(3) * zgrid – normal * meanX = 0.

4.plot3

Here you can put plot3(X1′,X2′,X3′,’-‘, X(above,1),X(above,2),X(above,3),’o’, ‘Color’,[0. 7 0]); Take it apart: plot3(X1′,X2′,X3′,’-‘) and plot3(X(above,1),X(above,2),X(above,3),’o ‘), and finally set the color. The second sentence is easier to understand by marking the fitting points with circles. The first sentence, X1 It’s actually not complicated. In order to facilitate understanding, you can actually omit nan*ones(nabove,1), so that X1′ X2′ X3′ are all matrices of [2, m]. When drawing points, X1′ The point of the row is the starting point, such as (X1′(1, 1), X2′(1, 1), X3′(1, 1)), and the point of the second row of X1′ The starting point should correspond to), such as (X1′(2, 1), X2′(2, 1), X3′(2, 1)), and connect the two points with a solid line. More generally, the points in the first row of X1′ are the x-coordinates of the starting point, the points in the second row are the x-coordinates of the end point, and X2′ and X3’ correspond to y and z respectively. nan*ones(nabove,1) is used to occupy positions, the meaning is unknown.

3. Fit straight lines to three-dimensional data

Fitting a Line to 3-D Data

Fitting a straight line to the data is even simpler, and because of the nesting property of PCA, we can use the components that have already been computed. The direction vector that defines the line is given by the coefficients for the first principal component. second and third PCs are orthogonal to the first, and their coefficients define directions that are perpendicular to the line. The simplest equation to describe the line is meanX + t*dirVect, where t code> parameterizes the position along the line.

Fitting a straight line to the data is even simpler, and due to the nested nature of PCA we can use the principal components that have already been calculated. The direction vector defining the straight line is given by the coefficient of the first principal component. The second and third principal components are orthogonal to the first, and their coefficients define the direction perpendicular to the line. The simplest equation describing this line is meanX + t*dirVect, where t parameterizes the position along the line.

dirVect = coeff(:,1)

The general equation of a straight line in space can be described as: r = r? + td, where r is the position vector of a point on the straight line, r? is a point on the straight line, d is the direction vector of the straight line, and t is a parameter. meanX is known to be on a straight line, t is a parameter, and dirVect is the first principal component, then r = meanX + t * dirVect.

The first coordinate of the principal component scores gives the projection of each point onto the line. As with the 2-D fit, the PC coefficient vectors multiplied by the scores the gives the fitted points in the original coordinate system.

The first coordinate of the principal component score represents the projection of each point on the straight line. Similar to the two-dimensional fit, the vector of principal component coefficients multiplied by the score gives the fitted points in the original coordinate system.

Xfit1 = repmat(meanX,n,1) + score(:,1)*coeff(:,1)';

Plot the line, the original data, and their projection to the line.

Plot the line, the original data, and their projection onto the line.

t = [min(score(:,1))-.2, max(score(:,1)) + .2];
endpts = [meanX + t(1)*dirVect'; meanX + t(2)*dirVect'];
plot3(endpts(:,1),endpts(:,2),endpts(:,3),'k-');

X1 = [X(:,1) Xfit1(:,1) nan*ones(n,1)];
X2 = [X(:,2) Xfit1(:,2) nan*ones(n,1)];
X3 = [X(:,3) Xfit1(:,3) nan*ones(n,1)];
hold on
plot3(X1',X2',X3','b-', X(:,1),X(:,2),X(:,3),'bo');
hold off
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);
grid on