Apr 19, 2023

Principal Component Analysis and Loadings

Principal Component Analysis (PCA) is one of the most popular data analysis and dimensionality reduction techniques to this day. As someone coming from an ML background, I have seen it used for dimensionality reduction but for some reason it seems I never noticed the “analysis” in the name that much. Then, when I was reading a paper, I saw they used PCA to justify their groupings and analyze the data, and they did that by something called loadings. After a little bit of research, it turns out it is just a fancy name for the eigenvector components, but I wanted to write a post about it.

I will derive PCA from the maximum variance point of view, and then explain what loadings are and how they explain the relationship between features. Then, I will apply a PCA analysis on air quality measurements in Mersin, Turkey and we will see if we can capture similar stations by those loadings.

Deriving PCA

First, let’s derive PCA ourselves, from a maximum variance point of view. We are interested in the dataset $X \in \mathbb{R}^{n\times d}$ , where each row represents a datapoint and each column represents a feature.

n = 30
d = 5
X = np.random.rand(n, d)

Before going further, let us normalize our data to have a mean of 0.

X_c = X - \bar{X}

where $\bar{X}$ represents the column-wise means (i.e., the mean of each feature).

X_c = (X - X.mean(axis=0))

This is our data represented in the standard basis, i.e. $\begin{pmatrix}1&0&0&0&0\end{pmatrix}$ , $\begin{pmatrix}0&1&0&0&0\end{pmatrix}$ , $\dots$ We are interested in other basis vectors, such that when projected using those vectors, variation is maximized. Let’s check the current variance first:

S = \frac{1}{N} X_c^TX_c

S = (X_c.T @ X_c) / n
S

array([[ 0.07165141,  0.01730077, -0.00888918, -0.00484992, -0.03748958],
       [ 0.01730077,  0.07553007, -0.02827698,  0.00382616, -0.00305851],
       [-0.00888918, -0.02827698,  0.0690027 ,  0.0068945 , -0.0025763 ],
       [-0.00484992,  0.00382616,  0.0068945 ,  0.0601852 ,  0.00653085],
       [-0.03748958, -0.00305851, -0.0025763 ,  0.00653085,  0.08807457]])

This is our current covariance matrix. When we project our data to a new vector, our new covariance matrix will be:

\frac{1}{N}(X_c u_1)^T(X_c u_1) = \frac{1}{N} u_1^TX_c X_c u_1 = u_1^TSu_1

We need to maximize this variance. It is obvious that when $\|u_1\| \to \infty$ , this will be maximized. But we are interested in maximizing by finding a direction and then projecting onto it. Therefore, the magnitude of the vector is irrelevant. To make our life easier, we can fix its size to 1: $u_1^Tu_1 = 1$ and this would turn our problem into a constrained optimization problem. We can introduce a Lagrange multiplier $\lambda_1$ and optimize the objective:

\begin{aligned} & u_1^TSu_1 + \lambda_1(1 - u_1^Tu_1) \\ & 2 S u_1 - \lambda_1 u_1 = 0 & \text{(Take the derivative wrt. } u_1 \text{ and set it to 0)}\\ & Su_1 = \lambda_1 u_1 \end{aligned}

From the final equation, we can see that $u_1$ needs to be an eigenvector of the covariance matrix $S$ . Therefore, maximizing the variation would correspond to selecting the largest eigenvalue. In that case, $u_1$ is the first principal component (or depending on your field, you might see it called a principal axis, but in the ML community it is usually called the principal component). Then, following this in an iterative way we can get all of our principal components.

Computing PCA via SVD

But, how do we calculate the PCA? The answer is Singular Value Decomposition (SVD), as it is a highly related concept. All matrices have SVD, thus we know that our dataset matrix also has an SVD: $X_c = U\Sigma V^T$ . Using this relationship, we can find the eigenvectors and eigenvalues of $S$ :

\begin{aligned} X_c^T X_c &= V \Sigma U^T U \Sigma V^T = V \Sigma^2 V^T & \text{($U$ is a unitary matrix, s.t. $U^TU = I$)}\\ \implies & S = \frac{1}{N} V \Sigma^2 V^T \end{aligned}

And as we do not care about the magnitude of the eigenvectors (i.e. If $v_1$ is an eigenvector, $5v_1$ is also a valid eigenvector), we can just drop the $\frac{1}{N}$ and use $V \Sigma^2 V^T$ . From here, we see two things:

Right singular vectors of the dataset matrix are scaled eigenvectors of the covariance matrix.
Singular values of the dataset matrix are the square roots of the eigenvalues of the covariance matrix.

Then, we can simply calculate the PCA by using SVD:

u, s, vt = np.linalg.svd(X_c)
explained_variance = s ** 2
principal_comps = vt

explained_variance

array([3.78865876, 2.92541508, 1.88357554, 1.21211713, 1.12355213])

Recovering the exact eigenvalues:

explained_variance / n

array([0.12628863, 0.09751384, 0.06278585, 0.0404039 , 0.03745174])

Each row is an eigenvector:

principal_comps

array([[-0.6077345 , -0.38737752,  0.27080286,  0.11247385,  0.62818221],
       [ 0.12534338, -0.6098027 ,  0.59748364, -0.05631033, -0.50226701],
       [-0.08517668, -0.26555398, -0.3000831 , -0.91106   ,  0.04632308],
       [-0.29740359, -0.46054634, -0.66702274,  0.36398383, -0.34934883],
       [ 0.72058577, -0.44210572, -0.18630596,  0.14718774,  0.47846059]])

Now, we have the corresponding eigenvalues and eigenvectors, i.e. variances and principal components. We can project our data to each of these vectors:

(X_c @ principal_comps.T)

And for dimensionality reduction, we can decide to just keep top 2:

X_c @ principal_comps[:2, :].T

Loadings

In some texts, it is possible to see something called “loadings”, which are used to understand the relationship between original features and the PCA transformation. It sounds fancy, but it is just the components of the eigenvectors. They call this the contribution of the original feature to the eigenvector. Here is the idea behind it:

Let’s actually think about projecting. Let us consider the first principal component:

principal_comps[0, :]

array([-0.6077345 , -0.38737752,  0.27080286,  0.11247385,  0.62818221])

And consider our first data point:

X_c[0]

array([ 0.26322816, -0.17611613,  0.20135411,  0.22866828,  0.24809388])

To project $x_1$ to $u_1$ , we take their dot product:

\langle u_1, x_1 \rangle = u_{11}x_{11} + u_{12}x_{12} + u_{13}x_{13} + u_{14}x_{14} + u_{15}x_{15}

Then, we see that the contribution of the first feature is $u_{11}$ , because it is multiplied by that. Contribution of the second feature is $u_{12}$ , because it is multiplied by that and so on. So, the signs and magnitudes of these coefficients signify the relationships between features. If they have the same sign, they try to project the data point to the same side of the principal component; and if they have the opposite sign they kind of work against each other. And relative magnitudes of these components give us an idea of their similarity.

Case Study: Air Pollution in Mersin, Turkey

Let’s see if we can actually use PCA to detect which air quality stations are similar. We have the daily means of 7 different stations in the same city, for about a year.

def typecast_float(value):
    try:
        return float(value.replace(',', '.'))
    except:
        return value

df = pd.read_excel("mersin_stations.xlsx", converters={
    'Mersin - Akdeniz': typecast_float,
    'Mersin - Huzurkent': typecast_float,
    'Mersin - İstiklal Cad.': typecast_float,
    'Mersin - Tarsus': typecast_float,
    'Mersin - Tasucu': typecast_float,
    'Mersin - Toroslar': typecast_float,
    'Mersin - Yenişehir': typecast_float
})

df = df.drop(index=0)
cols = list(df.columns)
cols.remove("Tarih")
df = df.set_index("Tarih")
for col in cols:
    df.loc[df[col] == "-", col] = np.nan
    df[col] = df[col].astype(np.float32)

df.describe()

	Mersin - Akdeniz	Mersin - Huzurkent	Mersin - İstiklal Cad.	Mersin - Tarsus	Mersin - Tasucu	Mersin - Toroslar	Mersin - Yenişehir
count	321	310	334	332	302	316	335
mean	81.93	50.69	74.17	60.95	30.68	40.51	54.07
std	36.37	14.37	38.32	22.14	11.96	17.44	19.10
min	26.87	18.76	28.08	15.35	6.19	7.31	11.25
25%	54.42	41.46	49.06	45.63	22.38	29.08	41.26
50%	71.62	49.61	61.99	59.06	29.53	37.40	51.17
75%	101.75	59.33	82.85	72.50	38.44	49.23	66.56
max	217.05	100.92	287.74	165.80	79.59	106.82	115.50

From this description, we have a rough idea of the stations. But let’s use PCA to actually analyze the data:

# 1. Normalize the data, then fill in the missing values by the mean (0 now)
df_norm = df - df.mean()
df_norm = df_norm.fillna(0)

# 2. Use SVD to get PCA
u, s, vt = np.linalg.svd(df_norm)
explained_variance = s ** 2
principal_comps = vt

# 3. Print the results:
print(f"Explained Variance: \n{list(explained_variance)}")
print(f"Explained Variance Ratio: \n{list(explained_variance / explained_variance.sum())}")
print(f"Principal Components: \n{principal_comps}")

Explained Variance:
[970372.7, 161876.78, 99340.76, 73822.26, 44803.496, 25845.836, 22792.111]
Explained Variance Ratio:
[0.6936912, 0.115720995, 0.07101581, 0.052773383, 0.032028716, 0.018476436, 0.016293418]
Principal Components:
[[ 0.6225126   0.16223446  0.66419894  0.24866222 -0.00622524  0.11959113  0.26234362]
 [-0.02710083  0.2685964  -0.45266798  0.6661215   0.345759    0.05868038  0.3943383 ]
 [ 0.57136565 -0.05261079 -0.38114944 -0.1719964  -0.09751546 -0.68772197  0.11595257]
 [-0.3109046  -0.15181339  0.36948395  0.50480044  0.04580177 -0.6335116  -0.29242852]
 [ 0.37802657 -0.37931138 -0.2600934   0.41760823 -0.24467331  0.32418633 -0.55336404]
 [-0.09289647  0.69212395 -0.06681351  0.11723612 -0.6863987  -0.04926331 -0.14337459]
 [ 0.19258615  0.5028002  -0.0050598  -0.14799356  0.5812028  -0.02079634 -0.59156203]]

Okay, here we see that these grouped stations are somewhat similar, as 69% of the variance can be explained by one principal component. Returning back to our loadings, and actually understanding the relationships between features; our first principal component is

\begin{pmatrix} 0.62 & 0.16 & 0.66 & 0.25 & -0.01 & 0.12 & 0.26 \end{pmatrix}

From here, we observe the following groupings:

Stations 1, 3 ( $\sim 0.6 - 0.7$ )
Stations 4, 7 ( $\sim 0.2$ )
Stations 2, 6 ( $\sim 0.1$ )
Station 5 ( $\sim 0$ )

We can also see that Station 5 (Tasucu) is very different from the other stations, as it has a near-zero (slightly negative) loading. Looking back at the descriptive statistics, we see that Tasucu is the cleanest of the bunch, and is usually behaving differently from them. Stations 1 (Akdeniz) and 3 (İstiklal Cad.) are very similar, as we expected from them and so on. We could actually decide those similarities by eye, but now we have a formal point to stand on. It is also nice that we don’t have to do it by eye as we could be talking about hundreds of stations too.

References

Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
“Hava Kalitesi – İstasyon Veri İndirme | T.C. Çevre, Şehircilik ve İklim Değişikliği Bakanlığı.” 16 Apr. 2023.