Multi collinearity Check using PCA

Nitesh Jindal
3 min readApr 27, 2019

--

The below study demonstrates how multicollinearity gets curtailed through use of PCA. For this demonstration, we took two different random samples (one multicollinear data set and other one non-multicollinear data) and applied Principal Component Analysis on each one of them and later check the correlation plots. Please note that all the codes in this document are written in Python.

Sometimes, highly correlated variables cause problems while running Multiple Linear Regression and any other regression techniques. For which, PCA is among the one methodology which reduces dimensions and gives us principal components which we can regress onto the dependent variable during regression analysis.

Steps followed are as follows:

1. Import necessary Python libraries

2. Create random sample of 100 points and stored them in a variable df. (we call it as data frame).

3. Create Multicollinear data and Non- Multicollinear dataset.

4. Correlation Plots

5. Scaling of Data.

6. Correlation Plots for Principal Components.

Note: Scaling before applying PCA. Why?

If the variables in our data are of different scales, it is recommended to scale the data and center it to zero to bring variables on a common scale. This is because PCA projects the data points onto direction which maximizes the variance. In simple words, if we look in below plot from a referred gene example-

o Green colored point is data point for which distance from origin is fixed and can not be changed.

o Red colored dotted line is best fit line which PCA. As distance a can not be changed, we can either minimize b or maximize c to get a best fitted line.

If data would be on different scales then, best fitted line would yield misleading variance values. To overcome such problems, its advisable to scale the data and center it to origin.

Code for above steps 1–6 as follows:

import pandas as pd
import numpy as np
from random import randint
import random
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from numpy.linalg import eig
import math
random.seed(112)
rn = random.sample(range(1, 200), 100)
''' Create Dataframe '''
df = pd.DataFrame(rn, columns=['a'])
# Multicollinear Data:
df_m = df.merge(df.a.apply(lambda s: pd.Series({'b': randint(0, 34),'c': randint(0, 50), 'd': s-2, 'e': 2*s, 'f': randint(0,20), 'g':s*4})), left_index=True, right_index=True)
#Non- Multicollinear Data:
df_nonm = df.merge(df.a.apply(lambda s: pd.Series({'b': randint(0, 50),'c': randint(0, 40), 'd': randint(0,100), 'e': randint(0,35), 'f': randint(0,120), 'g':randint(0,75)})), left_index=True, right_index=True)
''' Correlation Matrix for Multicollinear data and non-
Multicollinear data before applying PCA'''
corr_m = df_m.corr()
sns.pairplot(corr_m)
corr_nonm = df_nonm.corr()
sns.pairplot(corr_nonm)

Correlation Plots for Multicollinear Data—

Correlation Plots- Multicollinear Data before applying PCA

Scaling Data & PCA :

# Use StandardScaler() to standardize the features making it in unit scale (mean = 0 and variance = 1)before applying PCAdf_m_scaled = StandardScaler().fit_transform(df_m)df_nonm_scaled = StandardScaler().fit_transform(df_nonm)''' Principal Component Analysis(taking Principal Components = 4) for both Multicollinear data and non-Multicollinear data '''# PCA for scaled multicollinear data & scaled non multicollinear data# Multicollinear Data -
pca_m = PCA(n_components=4)
principalcomponents_m = pca_m.fit_transform(df_m_scaled)
principaldf_m = pd.DataFrame(data = principalcomponents_m, columns = ['p1', 'p2', 'p3', "p4"])
# Non Multicollinear Data -
pca_nonm = PCA(n_components=4)
principalcomponents_nonm = pca_nonm.fit_transform(df_nonm_scaled)
principaldf_nonm = pd.DataFrame(data = principalcomponents_nonm, columns = ['p1', 'p2', 'p3', "p4"])

Correlation Matrix of PCA objects-

''' Correlation Matrix for Multicollinear data and non- Multicollinear data after applying PCA''' 
# Multicollinear Data -
corr_m_pca = principaldf_m.corr()
sns.pairplot(corr_m_pca)
# Non Multicollinear Data -
corr_nonm_pca = principaldf_nonm.corr()
sns.pairplot(corr_nonm_pca)

Correlation Pair Plots(After PCA) —

Correlation Plot — Multicollinear Data after applying PCA

Conclusion: We observe that correlation plots drawn after applying PCA i.e. Principal components does not have multicollinearity.

--

--

Nitesh Jindal
Nitesh Jindal

Responses (1)