Principal Components Analysis

What is Principal Components Analysis ?

Principal component analysis (PCA) is one of a family of techniques for taking high-dimensional data and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. It is one of the simplest and most robust ways of achieving dimensionality reduction.

Why is Principal Components Analysis required ?

If we keep on increasing the number of features in the data, A phenomenon known as the Curse Of Dimensionality begins to occur. So the idea is that the more features that we have, the more complicated our numerical computations become. This means that the performance of the machine learning algorithms, which are mathematical computations on the data, will be affected. Each added variable results in an exponential decrease in predictive power.

Random Data

Consider a random dataset consisting of 3 different classes of 30 sample points each where each sample point has 3 features. So, the dataset consists of 90 points with 3 features each. The points have been randomly chosen. When you run the PCA, you’ll see the first component points in the direction that accounts for the most variation in the n variable space. The second component finds the next direction, perpendicular to the first, that has the second most variation and so on.



Adding an Outlier

Because the PCA minimizes the L2 norms (i.e. quadratic norms) it has the same issues that least-squares or fitting a Gaussian has by being sensitive to outliers. Because of the squaring of the large deviations from the outliers, they will dominate the total norm and therefore will drive the PCA components.





Using MOON dataset

The “classic” PCA approach described above is a linear projection technique that works well if the data is linearly separable. However, in the case of linearly inseparable data, a nonlinear technique is required if the task is to reduce the dimensionality of a dataset. The basic idea to deal with linearly inseparable data is to project it onto a higher dimensional space where it becomes linearly separable. Here we apply a Gaussian/RBF kernel to separate the data. where γ=1/2σ2 is a free parameter that is to be optimized.




Using IRIS dataset

The Iris dataset is a multivariate dataset consisting of 50 samples from each of the 3 species of Iris.
  • Iris setosa
  • Iris virginica
  • Iris versicolor
For the purpose of identifying features that would help to separate and classify them, four features were measured for each sample.They were
  1. Sepal length
  2. Sepal width
  3. Petal length
  4. Petal width