Numpy, Pandas

Using NumPy and Pandas to start your first step of data analysis

After reading our previous articles, you might have already known how to get the data from TEJ API, store it into your computer, and update automatically! Then we are going to tell you how to analyze this data by using these two important packages- Numpy and Pandas

Highlights of this article

  • Numpy Intro/Application
  • Pandas Intro/Application

Links related to this article

What is Numpy? How to use it?

Numpy is designed to conveniently and efficiently process n-dimensional and large-scale data arrays. With built-in functions, users could perform preliminary and rapid data processing.

  • Basic Application-Single Dimension
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0]) #float ndarray -1-1.
b = np.array(['a', 'b', 'c']) #string ndarray        -1-2.
c = np.arange(0, 10, 2) #array([0, 2, 4, 6, 8])      -2.
c[2:] #array([4, 6, 8])                              -3.
c[:2] #array([0, 2])                                 -4.

Examples above:

  1. Create a float data type array; string data type array
  2. Through np.arange() function, creating an array starts with 0, ends with 2, and the interval is 2.
  3. In python, ”[]” means select, and ”: ” means to…. But what we have to notice is that the location of the first element is 0 instead of 1 in python. Therefore, c[2:] means selecting the element from location 2 to the end (include the last element).
  4. Same as above, but if we change from c[2:] to c[:2], which means selecting elements from start to location 1( location 2 is not included)!!
  • Mathematical Tools
a = np.arange(0, 30, 2) #array([0, 2, 4, ..., 28])
a.sum() #210                                      -1-1.
a.mean() #14.0                                    -1-2.
a.std() #8.640987                                 -1-3.
a.cumsum() #array([0, 2, 6, 12, ...,210])         -1-4.
lst = [0, 2, 4]
lst*2 = [0, 2, 4, 0, 2, 4]                        -2-1.
a+a #array([0, 4, 8, ..., 56])                    -2-2.
a*a #array([0, 4, 16, ..., 784])                  -2-3.

Examples above:

  1. the sum of array a; average; standard deviation; cumulative sum
  2. elements in array an add with the corresponding position; multiply with the corresponding position

The first example is to use numpy built-in functions to calculate. In the second example, we can see the numpy vectorized computation. If we multiply a list(2–1) by 2, the number of elements in the list will double instead of doubling the value. But if it is numpy array(2–2, 2–3), it is possible to perform mathematical operations on the corresponding positions of the elements in the array~💪💪

  • Basic Application-Multiple Dimensions
b = np.array([a, a*2]) #array([0, 2, 4, ..., 28],
                              [0, 4, 8, ..., 56])
b[0] #array([0, 2, 4, ..., 28])                   -1-1.
b[0][1] #2                                        -1-2.
b.sum(axis = 1) #array([210, 420])                -1-3.
b.shape #(2, 15)                                  -2-1.
b.reshape(5,6) #array([0, 2, 4, 6, 8, 10],        -2-2.
                      [36, 40,   ..., 56]])

Examples above:

  1. Select the first row of array b; select the second element of the first row of the array b; row sum of array b
  2. Shape(2*15) of array b; change to a new shape(2*15 -> 5*6)

Next, let’s take a look at how numpy performs on multi-dimensional arrays. Similarly, we also use “[]” to select. The difference is that there are more elements that can be selected, so we can use 2 “[][]” to select column and position respectively. If we want to do some matrix operations, we can use shape functions in numpy to check and find the desired shape to do the calculation.~💪💪

  • Other Applications-Boolean, Random Variables, Financial Functions
b > 15    #array([False, False, ..., True],        -1-1.
                 [False, False, ..., True])
np.where(b>15, 1, 0) #array([0, 0, ..., 1],        -1-2.
                            [0, 0, ..., 1])
#random variable
np.random.normal(5, 2, 10)                         -2-1. 
np.random.standard_normal(5)                       -2-2.
pip install numpy_financial
import numpy_financial as npf
npf.fv(0.03, 5, 0, -1000) #1159.27                 -3-1.
#fv(rate, nper, pmt, pv)                 
npf.irr([-95, 3, 3, 3, 103]) #0.0439               -3-2.

Examples above:

  1. Boolean:
    We can directly use inequality(bigger than 15 in the example) to find the corresponding T/F array in numpy array or use np.where() function to make a new way of judging T/F (T is 1, F is 0 in the example).
  2. Random Variables:
    Using different distributions in statistics to generate random variables, such as the normal distribution in the example(mean 5, std 2, 10 elements), and standard normal distribution, and so on.
  3. Financial Functions:
    In numpy, there is also a package designed for financial functions such as fv, pv, and irr which will be used when discounting. But we will need to install this package separately. All functions included in this package can be checked in HERE~.

Numpy has many applications for data processing, so it is very difficult for us to tell you all of them in just one article😢. Therefore, if you are interested in numpy, you can go through Numpy Official Website or leave the message below!💪💪

What is Pandas? How to use it?

Pandas is a package that specializes in analyzing table data. Just like Excel, it presents data in a format we called DataFrame in order to help users analyze data more conveniently, especially for financial time series data.

  • Basic Application
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4],
                  columns = ['Numbers'],
                  index = ['index_a','index_b','index_c','index_d'])

From the codes above, we can create a table with column name “Numbers” and row names ”index_a, b, c, and d” respectively.

df.loc['index_a'] #Numbers 1                                  -1-1.
df.iloc[0:2] #refer to source code                            -1-2.
df * 2 #same as numpy                                         -1-3.
#add "Name" column
df['Name'] = ['Amy', 'Bob', 'Catherine', 'Duke']              -2-1.
#select whole column
df['Numbers']                                                 -2-2.
#delete column
df.drop('column name', axis=1)                                -2-3.
df['Numbers'].sum() #10                                       -3-1.
df['Numbers'].mean() #2.5                                     -3-2.
df['Numbers'].std() #1.291                                    -3-3.

Examples above:

  1. Use loc and iloc to find the corresponding value. It should be noted that loc is the name of the column/row, so we have to enter the name when selecting, while iloc is the position corresponding to the element. For example(1–2), select the elements from the start to position 1 (2 is not included!).
  2. Add; select; delete the column
  3. Sum of the whole df; average; standard deviation

Like the numpy arrays which we have mentioned earlier, in Pandas, we also use brackets [“column name”] to select or add columns. But we will have to use the drop() function to delete columns. For operations, pandas dataFrame can perform basic statistical calculations in tables.~💪💪

  • Basic Data Analysis
import tejapi
tejapi.ApiConfig.api_key = “你的api_key”
df = tejapi.get('TWN/EWPRCD', 
coid = ['2330'],
mdate={'gte':'2020-01-01', 'lte':'2020-12-31'}, 
opts={'columns': ['mdate','open_d','high_d','low_d','close_d']}, 
Descriptive Statistics Table

The sample data we used for pandas data analysis is 2330.TW stock price daily data got from the TEJ API. Then, most of the statistics that may be used further can be obtained through describe() function(figure above👆). If we want to do some operations on these values, we could directly use numpy to perform operations on the entire table!

Stock Price(Daily)

Last is the data visualization. There are several ways for users to plot the graph in python, and Pandas provides a very very easy one! If the chart we want to present is not complicated such as simple stock daily price, daily return, etc. We can select the column and use the plot() function to directly see the result! (figure above👆)

The only thing we have to note here is that the X and Y axes in the chart are the index and data you select respectively. That’s why we use a set_index() function to process our raw data at first.


What we share with you this time is how to use Numpy and Pandas packages to do the data analysis. However, it is very difficult for us to explain all the functions included in these 2 packages. Therefore, if you have any question or interested in any topic, you could go to their websites or leave the message below ❗️❗️ Then, we will go further into financial data analysis and applications in the next article, please look forward to it ❗️❗️

Finally, if you like this topic, please click 👏 below, giving us more support and encouragement. Additionally, if you have any questions or suggestions, please leave a message or email us, we will try our best to reply to you.👍👍

Links related to this article again!