Pca analysis what is




















What's new? Print Statistical Reference Guide Principal component analysis PCA Principal component analysis PCA reduces the dimensionality of a dataset with a large number of interrelated variables while retaining as much of the variation in the dataset as possible. Principal components Principal components are the linear combinations of the original variables. Scree plot A scree plot visualizes the dimensionality of the data. Biplot A biplot simultaneously plots information on the observations and the variables in a multidimensional dataset.

Monoplot A monoplot plots information on the observations or the variables in a multidimensional dataset. Each observation yellow dot may now be projected onto this line in order to get a coordinate value along the PC-line.

This new coordinate value is also known as the score. The first principal component PC1 is the line that best accounts for the shape of the point swarm. It represents the maximum variance direction in the data.

Each observation yellow dot may be projected onto this line in order to get a coordinate value along the PC-line. This value is known as a score.

Usually, one summary index or principal component is insufficient to model the systematic variation of a data set. Thus, a second summary index — a second principal component PC2 — is calculated. The second PC is also represented by a line in the K-dimensional variable space, which is orthogonal to the first PC. This line also passes through the average point, and improves the approximation of the X-data as much as possible.

The second principal component PC2 is oriented such that it reflects the second largest source of variation in the data while being orthogonal to the first PC. PC2 also passes through the average point. When two principal components have been derived, they together define a place, a window into the K-dimensional variable space.

By projecting all the observations onto the low-dimensional sub-space and plotting the results, it is possible to visualize the structure of the investigated data set. The coordinate values of the observations on this plane are called scores, and hence the plotting of such a projected configuration is known as a score plot. Two PCs form a plane. This plane is a window into the multidimensional space, which can be visualized graphically.

Each observation may be projected onto this plane, giving a score for each. The figure below displays the score plot of the first two principal components. These scores are called t1 and t2. The score plot is a map of 16 countries. Countries close to each other have similar food consumption profiles, whereas those far from each other are dissimilar. The Nordic countries Finland, Norway, Denmark and Sweden are located together in the upper right-hand corner, thus representing a group of nations with some similarity in food consumption.

Belgium and Germany are close to the center origin of the plot, which indicates they have average properties. This provides a map of how the countries relate to each other. Colored by geographic location latitude of the respective capital city. In a PCA model with two components, that is, a plane in K-space, which variables food provisions are responsible for the patterns seen among the observations countries?

We would like to know which variables are influential, and also how the variables are correlated. Such knowledge is given by the principal component loadings graph below. These loading vectors are called p1 and p2. The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information.

So, in order to identify these correlations, we compute the covariance matrix. What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data. Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables i.

So, the idea is dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on, until having something like shown in the scree plot below. Organizing information in principal components this way, will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.

Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance , that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has.

To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.

As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.

The second principal component is calculated in the same way, with the condition that it is uncorrelated with i. This continues until a total of p principal components have been calculated, equal to the original number of variables. What you firstly need to know about them is that they always come in pairs, so that every eigenvector has an eigenvalue. And their number is equal to the number of dimensions of the data.

For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3 corresponding eigenvalues. Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained above, because the eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most variance most information and that we call Principal Components.

And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.



0コメント

  • 1000 / 1000