PCA Feature Portfolio

Optimizing investment portfolios using PCA (Principal Component Analysis)

Photo by Carlos Muza on Unsplash

Summary of key points of this article

Article Difficulty:★★★★☆


The essence of mathematics is not to complicate simple things but to simplify complex things.” – Stan Gudder

Principal Component Analysis (PCA), a crucial technique in unsupervised learning, is widely used in the fields of machine learning and statistics to analyze data and reduce data dimensionality. Its core idea is to break down the original data into representative principal components, achieving dimensionality reduction and providing a new description of the data.

The key points of this article

The main purpose of this study is to utilize daily stock return data, apply PCA to obtain principal components, and construct an investment portfolio. Readers of this article will see the following key points:

  1. Understanding the eigenvalues and eigenvectors of PCA and using them to design an investment portfolio.
  2. Methods for backtesting portfolio performance, applicable to various investment strategies.

Editing Environment and Module Requirements

This article uses Windows OS and employs Jupyter as the editor.

import tejapi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

tejapi.ApiConfig.api_key = "Your Key"

Database Usage

0050 Index Constituent Data Set — Listed OTC Index (TWN/EWISAMPLE)

0050 Stock price return (day) – rate of return (TWN/APRCD2)

0050 Adjustment of stock price (day) — ex-dividend adjustment (TWN/APRCD1)

Data Loading

Loading Index Data Period: 2013.01.01–2022.11.24 Loading 0050 Constituent Stocks with filtering based on the “end_date” column to select stocks that are currently part of the constituents.

mdate = {'gte':'2000-01-01', 'lte':'2022-11-24'}
data = tejapi.get('TWN/EWISAMPLE',
idx_id = "IX0002",
start_date = mdate,

data1 = data[data["end_date"] < "2022-11-24"]
diff_data = pd.concat([data,data1,data1]).drop_duplicates(keep=False)
coid = list(diff_data["coid"])

0050 Loading Returns Data

for i in range(0,len(coid)):
if i == 0:
df = tejapi.get('TWN/EWPRCD2',
coid = coid[i],
mdate = {'gte':'2013-01-01', 'lte':'2022-11-24'},
Df = pd.DataFrame({coid[i]:df["roia"]})
df = tejapi.get('TWN/EWPRCD2',
coid = coid[i],
mdate = {'gte':'2013-01-01', 'lte':'2022-11-24'},
Df1 = pd.DataFrame({coid[i]:df["roia"]})
Df = pd.merge(Df,Df1[coid[i]],how='left', left_index=True, right_index=True)

Data Cleaning

The return rate information of ASE Investment Holdings (3711) was only available after 2018/04/30 and was deleted.
Shanghai Commercial Savings Bank ( 5876 ) was listed after 2014/09/25 and only had rate of return information, so it was excluded.
Silicon Power-KY (6415) was listed after 2013–12–12, and the rate of return data is excluded.

del Df["3711"]
del Df["5876"]
del Df["6415"]

Therefore, this article focuses on the 47 constituent stocks of the Taiwan 0050 Index up to 2022/11/24, excluding the three stocks mentioned above.

Data Visualization

First, we need to have a basic understanding of the dataset. By observing the correlations between the returns of each constituent stock, we can see a significant positive correlation among daily returns. Therefore, the data can be represented in a lower dimension, which is less than the current 47 dimensions.

cor = Df.corr()
plt.title("Correlation Matrix")
sns.heatmap(cor, vmax=1,square=True,annot=True,cmap="cubehelix")

Data Standardization

Before building the model, we do not know the importance of each feature in the dataset, which can lead to a significant loss of information. Therefore, standardizing each feature to have the same range of values is necessary, followed by applying PCA.

scale = StandardScaler().fit(Df)
rescale = pd.DataFrame(scale.fit_transform(Df),columns=Df.columns,index=Df.index)


Model Setup
We aim to reduce the original 47-dimensional data to 10 dimensions, representing the original data using 10 principal components.

n_components = 10
pca = PCA(n_components=n_components)
Pc = pca.fit(X_train)

PCA Explaining Variables

The first principal component represents the largest variance in the original data, the second principal component represents the second-largest variance, and so on in descending order of variance.

fig, axes = plt.subplots(ncols=2)
Series1 = pd.Series(Pc.explained_variance_ratio_[:n_components ]).sort_values()
Series2 = pd.Series(Pc.explained_variance_ratio_[:n_components ]).cumsum()

Series1.plot.barh(title="Explained Variance",ax=axes[0])
Series2.plot(ylim=(0,1),ax=axes[1],title="Cumulative Explained Variance")
Cumulative Explained Variance Ratio

From the left chart, it can be seen that the first 10 principal components explain the variance. The first principal component alone accounts for 35% of the variance in the original data, which means it explains 35% of the daily return variations in the 47 constituent stocks. This dominant principal component is often referred to as the “market” factor.

From the right chart, it can be observed that the first 10 principal components collectively explain approximately 60% of the variance in the daily returns of these 47 stocks.

Creating an Investment Portfolio using PCA

Setting Portfolio Weights
In the previous step, we examined the explained variance by the principal components. Next, we will explore the correlation of the original data, which consists of 47 stocks, with these 10 principal components. We will use this information to design the portfolio weights.

n_components = 10
weights = pd.DataFrame()
for i in range(n_components):
weights["weights_{}".format(i)] = pca.components_[i] / sum(pca.components_[i])
weights = weights.values.T
weight_port = pd.DataFrame(weights,columns=Df.columns)
weight_port.index = [f'Portfolio{i}' for i in range(weight_port.shape[0])]

Explaining the Portfolio Weighting Method
The first principal component explains 35% of the variance. Let’s examine the correlation of each variable (47 constituent stocks) with the first principal component.

As seen from the array, the correlation of all 47 constituent stocks with the first principal component is in the same direction (all negative), and the differences in magnitude are not significant. This further validates our previous explanation that the first principal component represents the “market” factor.


Next, we calculate the portfolio weights by taking the correlation of each stock divided by the sum of correlations of all stocks.

Plotting the Weights of the Top Five Principal Components in the Portfolio

weight_port[:5].T.plot.bar(subplots=True,layout = (int(5),1),figsize=(20,25),

Examining the Classification Logic of Other Principal Components

Portfolio 1

The top three are Nanya Branch (2408), Yageo (2327), and Airtec KY (1590); the last three are Far EasTone (4904), Taiwan University (3045), and Chunghwa Electronics (2412). Out of Portfolio 1, the weighting of electronic stocks is higher, while the weighting of transmission and telecommunications stocks is lower.

Portfolio 2

The top three stocks in Portfolio 2 are actually the three major telecommunications companies, while the later stocks are mostly from the financial sector. This suggests that Portfolio 2 is a non-financial investment portfolio.

Searching for the Optimal PCA Portfolio

We are using the Sharpe Ratio as our metric, which is a crucial indicator for measuring the performance and stability of an investment portfolio in fund investment or asset allocation. It represents “how much return can be obtained while enduring 1% of risk?”

In this article, the Sharpe Ratio is calculated as follows:

Sharpe Ratio = Annualized Return / Annualized Risk

def sharpe_ratio(ts_returns):
ts_returns = ts_returns
days = ts_returns.shape[0]
n_years = days/252
if ts_returns.cumsum()[-1] < 0:
annualized_return = (np.power(1+abs(ts_returns.cumsum()[-1])*0.01,1/n_years)-1)*(-1)
annualized_return = np.power(1+abs(ts_returns.cumsum()[-1])*0.01,1/n_years)-1
annualized_vol = (ts_returns*0.01).std()*np.sqrt(252)
annualized_sharpe = annualized_return / annualized_vol

return annualized_return,annualized_vol,annualized_sharpe

選出Top5 Portfolio

n_components = 10
annualized_ret = np.array([0.]*n_components)
sharpe_metric = np.array([0.]*n_components)
annualized_vol = np.array([0.]*n_components)
coids = X_train.columns.values
n_coids = len(coids)

pca = PCA(n_components=n_components)
Pc = pca.fit(X_train)
pcs = pca.components_
for i in range(n_components):
    pc_w = pcs[i] / sum(pcs[i])
    eigen_port = pd.DataFrame(data={"weights":pc_w.squeeze()},index=coids)
    #The daily portfolio return is obtained by taking the dot product of the portfolio weights and the daily returns of each constituent stock.
    eigen_port_returns = np.dot(X_train.loc[:,eigen_port.index],eigen_port["weights"])
    eigen_port_returns = pd.Series(eigen_port_returns.squeeze(),
                                   index = X_train.index)
    ar,vol,sharpe = sharpe_ratio(eigen_port_returns)
    annualized_ret[i] = ar
    annualized_vol[i] = vol
    sharpe_metric[i] = sharpe

sharpe_metric = np.nan_to_num(sharpe_metric)

result = pd.DataFrame({"Annual Return":annualized_ret,"Vol":annualized_vol,"Sharpe":sharpe_metric})
#Sharpe Ratio of PCA portfolio
ax = result[:N]["Sharpe"].plot(linewidth=3,xticks=range(0,N,1))

Drawing a Trend Chart of Portfolio Returns Over the Investment Period

def Backtest(i,data):
pca = PCA()
Pc = pca.fit(data)
pcs = pca.components_
pc_w = pcs[i] / sum(pcs[i])
eigen_port = pd.DataFrame(data={"weights":pc_w.squeeze()},index=coids)
eigen_port_returns = np.dot(data.loc[:,eigen_port.index],eigen_port["weights"])
eigen_port_returns = pd.Series(eigen_port_returns.squeeze(),
index = data.index)

ar,vol,sharpe = sharpe_ratio(eigen_port_returns)
return eigen_port_returns,ar,vol,sharpe

Visualizing the Trend of Portfolio Returns

def Weight_plot(i):
top_port = weight_port.iloc[[i]].T
port_name = top_port.columns.values.tolist()
ax = top_port.plot(title = port_name[0],xticks=range(0,len(coids),1),
ax.set_ylabel("Portfolio Weight")
portfolio = 0
train_returns,train_ar,train_vol,train_sharpe = Backtest(portfolio,X_train)
ax = train_returns.cumsum().plot(rot=45)
ax.set_ylabel("Accumulated Return(%)")


The provided backtesting method for the portfolio constructed using PCA shows that its performance is not favorable. This outcome was somewhat expected since PCA is primarily used for portfolio classification based on return correlations and does not guarantee good returns on its own.

This article provided an analysis of the Taiwan 50 Index (with three stocks excluded due to data availability) using PCA. It reduced the original 47 stocks to 10 principal components and constructed portfolio weights based on the correlation between the principal components and individual stocks. The article discussed each principal component, highlighting the presence of the dominant “market” factor and the logical classification ability of PCA. However, providing a detailed interpretation of the meaning of each principal component can be challenging.

In conclusion, it’s important to reiterate that the assets mentioned in this article are for illustrative purposes only and do not constitute any recommendations or advice on financial products. Therefore, readers interested in strategy development, performance testing, empirical research, and related topics are welcome to explore solutions available in the TEJ E Shop, which offers comprehensive databases for conducting various analyses and tests.

Full Code

Further Reading

Related Link