Yudai Furukawa
May 14th, 2017
This project is about stock investing, and I am focusing on price prediction of a stock market index. A stock index is an aggregate value produced by combining several stocks, and it helps investors to measure and compare values of the stock markets such as in the US and Japan. The Dow Jones Industrial Average (DJIA), NASDAQ Composite index and the S&P Composite are examples of stock index. As a wealth of information such as price, earnings, dividends, and CPI are available, I am going to use those information to do the prediction. A dataset of S&P Composite published by Yale Department of Economics will be used in this project. For more information, please refer the link below: https://www.quandl.com/data/YALE-Yale-Department-of-Economics
For this project, the task is to build a stock index price predictor. A 12 month forward price change of S&P composite will be predicted by using regression. The project is going to be a supervised learning.
Following steps will be taken to make the predictor.
-
Choosing inputs that are necessary to predict a 12 month forward price change by using correlation between each feature and 12 month forward price changes. The inputs are decided on the ground of statistical figures such as correlation and common sense in the financial industry.
-
Deciding the best regression model according to the metrics defined in the next section.
In the step 1, I am expecting PE Ratio and earning growth are going to be among the dominant inputs as it is commonly used in the financial industry to justify investment. In the step 2, I am expecting that r2 score will best serve the purpose although there are other metrics such as mean absolute error , mean squared error, and explained variance score as well as median absolute error. This will be discussed in the next section.
r2 score, explained variance score, and mean squared error, as well as explained variance score and median absolute error are all going to be used to validate the result. By using all the metrics, the risk of being biased will be reduced. Expected result is the best regression model have the highest score in all the scoring metrics.
A dataset of S&P Composite published by Yale Department of Economics will be used in this project. The dataset (named snp in this project) is monthly time series of S&P Composite Price, Dividend, Earnings, CPI, Long Interest Rate, Real S&P Composite Price, Real Dividend, Real Earnings, and Cyclically Adjusted PE Ration since 1831-1-31 up to date.
For more information, please refer the link below: https://www.quandl.com/data/YALE-Yale-Department-of-Economics
As of 2017-04-08, the basic statistics of snp the dataset is following. Table 1: Basic Statistics of the Original Dataset
Statistics | S&P Composite | Dividend | Earnings | CPI | Long Interest Rate | Real Price | Real Dividend | Real Earnings | Cyclically Adjusted PE Ratio |
---|---|---|---|---|---|---|---|---|---|
count | 1756.000000 | 1755.000000 | 1749.000000 | 1756.000000 | 1756.000000 | 1756.000000 | 1755.000000 | 1749.000000 | 1636.000000 |
mean | 242.537415 | 5.344903 | 12.046968 | 56.433670 | 4.584025 | 482.828989 | 14.590266 | 28.558423 | 16.748036 |
std | 478.579184 | 9.010165 | 22.474642 | 69.298402 | 2.290630 | 505.897801 | 7.846696 | 22.451132 | 6.649515 |
min | 2.730000 | 0.180000 | 0.160000 | 6.279613 | 1.500000 | 66.095494 | 4.870418 | 4.093124 | 4.784241 |
25% | 7.680000 | NaN | NaN | 10.100000 | NaN | 3.308333 | 165.900151 | NaN | NaN |
50% | 16.005000 | NaN | NaN | 18.100000 | 3.870000 | 246.207502 | NaN | NaN | NaN |
75% | 115.550000 | NaN | NaN | 84.200000 | 5.240000 | 588.255364 | NaN | NaN | NaN |
max | 2357.000000 | 46.380000 | 105.960000 | 244.176000 | 15.320000 | 2357.000000 | 46.416308 | 108.695460 | 44.197940 |
What can be concluded from table 1 is that there are huge deviation in most of the factors overtime. Remarkably, the maximum price of S&P Composite is 863.3699634 times larger than its minimum price although the maximum earning is 662.25 times larger than its minimum and the maximum of dividend only 257.66 times. This fact shows the S&P Composite historically advanced faster than earnings and dividend.
Also, as you can see some of the cells are filled by NaN as some data are missing in the dataset. Also, because when dealing with economical data, inflation has to be carefully taken into account as CPI tends to grow overtime and values of price and earnings tend to have smaller values in the past. Therefore, only real values, Long Interest Rate, and Cyclically Adjusted PE Ratio in the previous table can be taken seriously in statistical analysis without any modification.
Figure 1: Time Series for all factors Figure 2: Time Series for Factors Except "Real Price" and "S&P Composite"
Figure 1 is a time series plot of the raw dataset. Figure 2 is also time series plot omitting "Real Price" and "S&P Composite" in order to visualize other features clearly. Time series plot was chosen as the data itself is a time series. For the both figures, the horizontal axis displays years and the vertical axis displays figures for each feature where the unit is different for each feature.
In figure 1, you can clearly see both "Real Price" and "S&P Composite" have been increasing in general as time goes. Also, in figure 2, you can see the same trends besides for "Cyclically adjusted PE Ratio" and "Long Interest Rate". As the figures are increasing overtime, it is hard to compare the figures as it is. For example, the meaning of S&P Composite being 1000 right now and 20 years ago could have completely different meanings. In order to avoid this kind of misinterpretation, normalizing the data is necessary.
In this project, following regressions will be used in predicting the 12 month forward price change in S&P Composite.
-
Linear Regression
- Linear regression is an approach for modeling the linear relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.
- I chose the linear regression as it is the most commonly used regression in the financial industry. However, the weakness of the linear regression such as sensitivity to outliers and assumption of data independence should be carefully treated.
-
K Nearest Neighbors
- In K Nearest Neighbors, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
- I chose KNeibors as it is robust to noisy training data and the dataset is noisy as a lot of other features that might affect the price are missing.
-
SVR
- Support Vector Regression is very specific class of algorithms, characterized by usage of kernels, absence of local minima, sparseness of the solution and capacity control obtained by acting on the margin, or on number of support vectors, etc. It can be used to avoid difficulties of using linear functions in the high dimensional feature space and optimization problem is transformed into dual convex quadratic programmes.
- I chose SVR as it can do both linear and non-linear regressions and it is less likely to over fit.
r2 score of 0.5 will be used as a benchmark. 0.5 is a reasonable benchmark as equity return is often said unpredictable. Also the result will be justified by the stability of r2 score, mean absolute error, and mean squared error, as well as explained variance score and median absolute error through back testing.
As discussed, the original dataset needs to be normalized in order to predict % change of S&P Composite price with a return horizon of 1 year (named snp_changes). I first removed the columns for real values as this project aims to predict change of nominal S&P Composite price change and CPI in the dataset can be used to calculate the real value when needed. After this procedure, instead of using raw dataset, 12 months changes for each feature and the original figures of "Cyclically adjusted PE Ration" and "Long Interest Rate" will be used. The target feature will be 12 month forward changes in S&P Composite price.
The following modifications were made on the dataset.
-
Removed features with real values as the project is focusing on predicting % change of nominal S&P Composite price.
- Features ['Real Price','Real Earnings','Real Dividend'] were removed
-
Generated a dataset called snp_changes which shows 1 year change of each feature in order to see the relationship of changes of 1 year S&P Composite price and other features
-
Added a target feature "y" which is 12 months forward return of S&P Composite.
-
Added following features that seem to be have an effect on the prediction.
- The real value of PE Ratio as the ratio is considered as a good indicator on predicting return in general. As investpedia says, "The P/E ratio is a much better indicator of the value of a stock than the market price alone, since it allows investors to make a better apples to apples comparison".
-
Removed outliers
- Outliers are omitted to avoid having a misleading r2 score - it has a weakness as the score could be greatly affected by unusual data points
TimeSeries Split Validator was used to test the models. This validator provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. This cross-validation object is a variation of KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.
At the beginning I used algorithms with default settings to determine which algorithm best serves the purpose of this project. After determining the algorithm, I used grid search in order to optimize the algorithms in this project. I used the default number of splits which is 3.
Step 1: Choose Algorithm Step 2: Grid Search
Followings are the result of each scores with split
SVR Split 1
Regressors | Split # | R2 Score | Explained Variance Score | Mean Squared Error | Mean Absolute Error | Median Absolute Error |
---|---|---|---|---|---|---|
SVR | Split 1 | 0.46203561013012351 | 0.49076547652335589 | 0.027909821203550064 | 0.13322021924068453 | 0.11494323190727181 |
SVR | Split 2 | 0.83734051188973591 | 0.8507121009308416 | 0.0036388683253097179 | 0.048233671571316319 | 0.042793838457600555 |
SVR | Split 3 | 0.56601185038138779 | 0.57365252325183336 | 0.011315896786256328 | 0.072512443085194361 | 0.047410680757838497 |
K Nearest Neighbors | Split 1 | 0.11366052722544762 | 0.27650168616411541 | 045983668578453943 | 0.16797549806049597 | 0.14124764701469139 |
K Nearest Neighbors | Split 2 | 0.6262866037854673 | 0.63084140197008631 | 0.0083603720633077076 | 0.071785369871896237 | 0.062301103917071922 |
K Nearest Neighbors | Split 3 | -0.082982563354474292 | 0.10866035141914021 | 0.028237911378465357 | 0.12316307176270737 | 0.089980760066028259 |
Linear Regression | Split 1 | 0.88629761910370441 | 0.88728464199424062 | 0.0058989278491112474 | 0.051686380139549411 | 0.037683001046082007 |
Linear Regression | Split 2 | 0.89932003229322677 | 0.90247836633119305 | 0.0022523195525675855 | 0.036520785100067572 | 0.028205368886686434 |
Linear Regression | Split 3 | 0.9020251503591914 | 0.90603497761453722 | 0.0025546164962307679 | 0.037571621770448774 | 0.02698937431809878 |
As you can see Linear Regression scored better in almost all the score. Also more stability in scores overtime is observed for Linear Regression compared to other regressions. For example, the r2 scores of linear regression stays around 0.9 whereas obvious instability of r2 score is observed for SVR and K Nearest Neighbors. Therefore, I concluded linear regression best serves the purpose for this project.
After using grid search, not much improvement was observed. The scores are followings.
Regressors | Split # | R2 Score | Explained Variance Score | Mean Squared Error | Mean Absolute Error | Median Absolute Error |
---|---|---|---|---|---|---|
Linear Regression | Split 1 | 0.89664579541310629 | 0.89691708583413254 | 0.0061842954759759924 | 0.051265662455921866 | 0.037261406664000601 |
Linear Regression | Split 2 | 0.89986981562675339 | 0.90286244646868152 | 0.0022638811791995807 | 0.036607046703997399 | 0.028266085435894095 |
Linear Regression | Split 3 | 0.90054123332737157 | 0.90377986316285341 | 0.0027188916770536373 | 0.038394690588562007 | 0.026809489549802112 |
r2 score of 0.91 with the selected linear regression is definitely higher than benchmark which is r2 score of 0.5. The result is reasonable as the scores for linear regression were stable through Split 1, 2, and 3. r2 score, mean absolute error, and mean squared error, as well as explained variance score and median absolute error were reasonably stable compared to other regressions. This also shows the model seems not to be overfitting it has similar score over back-testing (Split 1, 2, and 3). Also the result could be justified as linear regressions are historically working well in economics and finance field. This result is significant as the project shows existing factors can be good predictors of the future S&P Composite return although it is often said that the it is nearly impossible to predict the future return of S&P Composite.
Figure 3: % change in each factor (24 month % change for 'price 24m change'. 12 months % change for other factors)
This graph is a box plot of the dataset. As you can see mean values of all the features are above 0. Especially as the mean value of y is above 0, you can make money on average when you invest in S&P Composite.
I started this project with searching for datasets available in order to understand what kinds of datasets I can use. I checked morning star, Google finance, and yahoo finance as well as quandl, and decided to use quandl database as it seemed to have more various datasets compared to other sources.
After searching data, I defined the goal for this project as the problem could be either classification and regression. I was thinking of using classification as well (for example I could divide the returns into some categories such more than 10% increase, less than 10% change etc.) but decided to use regression.
I analyzed the characteristics of the dataset in order to understand how I can use the dataset to achieve the goal and decided to normalize the data in order to make it usable for regressions. Decided to use parametric returns for normalizing the dataset.
After having the dataset ready, I chose what regressions and scoring metrics I will use for the project, then compared each regressions according to the scores. I had to be careful about the score as the model could be over-fitting and the score could be biased. Finally I tried to refine the regression having the best score, and concluded the model which best serves the purpose.
I found it interesting that I could define the problem as any of supervised, unsupervised, and reinforcement learning although the data being used is the same. I found it difficult to find data that I need for the project as most of datasets are not free.
The final result fit my expectations for the problem as the linear regression is the most commonly used regression in the financial industry, and I think
I think I could have more factors such as fundamental data (book value, working capital etc.) and economic data (GDP, NFP, etc.) in the dataset in order to make a better stock price predictor. With more factors in the dataset, I expect to have a better regression with higher r2 score.