GitHub - courtneyum/Trading: U of M Data Science Competition 2019

Setting up your Python environment

IMPORTANT: This procedure and our code has only been tested on Windows. If you try to setup or run our code on any other operating system, the code may require adjustment and your setup process may be more difficult.

Install Miniconda Python 3.7 for Windows
Install Visual Studio Code
Open Visual Studio Code and install the Python extension from Microsoft. The button to open the extensions panel is found on the left hand side. If VSCode gives a popup suggesting that you install a linter, it is helpful but not needed. A linter will help catch errors in code before you run and will also colour code keywords for improved readability.
Open the "Trading" folder in Visual Studio Code. Double click "driver.py" to open this file in the editor.
Look at the bottom right on the VS Code window and find where it says "Plain Text". Click here and select "Python" from the dropdown that appears.
Next, VS Code should automatically ask you to select a Python interpreter. Select [Miniconda Path]\python.exe as your Python interpreter. Your selection should become apparent in the bottom left of the VS Code window.
The following need to be added to your "Path" environment variable:
- [Miniconda Path]\Scripts
- [Miniconda Path]
- [Miniconda Path]\Library\bin
Restart your computer for the changes to your env vars to take effect.
Install the following packages:
- pandas
- matplotlib
- numpy
- theano
- keras
- sklearn
- joblib
  with the command conda install [package_name] in the Visual Studio Code terminal.
Find the file "scan_perform.c" as part of the project code. Insert this file in [Miniconda Path]\Lib\site-packages\theano\scan_module\c_code. If the c_code folder does not exist, create it at the specifed location. Note that this is not our original code but was taken from github as a theano package bug fix. If the file is already found at that location, then ignore this step.
Now you should be ready to run our program. The entry point is "driver.py".

Our Code

We built a neural network based on LSTM (long short term memory) recurrent neural network architecture using the Keras library in python. This problem was framed as a regression problem.

Training: We did a 2:1 training:validation split of the data during the training phase and trained a new model for each dataset given. Our features consisted of two timesteps each of “high”, “low, “close” and “open” data points and our targets consisted of the three following steps of each quantity. This means that the network predicts what the next three “high”, “low”, “open”, and “close” values will be. In practice, we would need to retrain the model every so often as the trends captured will not be valid for the rest of time.

driver.py This is a script that will create a model and then make predictions for each dataset. It is the entrypoint of our code. If Param.remodel = False, then it will load a precomputed model. Precomputed models have been included as part of the project code. If Param.compute_predictions = False, then it will load precomputed predictions. Graphs will be plotted and success metrics will be computed and reported. Precomputed predictions have also been included as part of the project code. If either of these parameters are set to True, it will take 1-2 hours to compute.

model.py

create_model() The network architecture is defined here as well as most of its hyperparameters. Two LSTM layers are interleaved with Dropout to prevent overfitting. We use an Adam optimizer with a learning rate of 0.0001. As a loss function, we use mean absolute percentage error so that we see error in terms of a percentage. We found that mean absolute error or mean squared error did not give an accurate representation of the scale of the error. However, there is the risk of dividing by zero if there is a true target value equal to zero. This would force us to use a different loss function.
model() This is the main function of this module. Here, we load in the data, perform a training/validation split, train the model with the call to model.fit(...) and then plot the results.

predict.py

predict() Here we load the model that the data was trained on and fetch the testing data. The past 10 days are used to predict the next three days.
evaluate_predictions() Here we calculate the accuracy of the model for each target. The accuracy is the number of times that the model guesses the sign of a change in price correctly divided by the total number of guesses.

dataHelper.py

get_data() This function fetches the wanted data from csv, removes unwanted columns, and converts it to the format with columns [t-1, t-2, t, t+1, t+2] for each column in the loaded data. T stands for timestep.
get_training_input() This function fetches input data for training purposes. We fetch the training data and remove target columns. Then we scale it using a min_max scaler so that each data point is between 0 and 1. Then we save this scaler for use later. We ensure that the length of the data is divisible by time_steps * batch_size where time_steps is the number of time_steps used to make a prediction and batch_size is the amount of samples used to train at a time. A sample consists of time_steps * n_features datapoints.
get_training_output() This function fetches output data for training purposes. The differences here are that we remove the feature columns after fetching the data, and at the end we only take every 10th value so that there is the same number of output samples as input samples.
get_testing_input() The difference between this function and the training function is that we do not reshape to 3 dimensions. This reshaping is done at the predictions phase.
get_testing_output() The difference between this function and the training function is that we keep every value.
series_to_supervised() Supervised learning regression problems require a certain data arrangement. This is best explained here where we got this code from.

prepare.py

get_data() A simpler version of the function from data_helper. Just read the csv and drop unwanted columns.
ema() Smooth the data with an exponential moving average to reduce noise.
This is a script and should be run before the driver program. No need for the given data set as smoothed versions have been provided.

Param.py This is a static class of parameters:

models_dir is the location the precomputed models are saved to
data_dir is the location that the data is stored. Precomputed predictions also get stored here.
filenames is an array containing each of the filenames of the data that we wish to process.
columns is an array of the relevant column names
time_steps is the number of time_steps to include in a training sample. If this number is higher then the network gets more information for each prediction, and if it is lower it gets less.
batch_size amount of samples to train on at a time.
validation_split the amount of our training data to allow the network to use for validation.
n_in the number of time steps of each feature to train on.
n_out the number of time steps of each target to predict.
n_features the total number of features
n_targets the total number of targets. Currently, each columns is used as a feature and a target, but at different times.
loss The loss function to use. This built in function will measure error in our network. Other options are "mae" (mean absolute error) or "mse" (mean squared error).
learning_rate a hyperparameter required by the optimizer (we use Adam) of the network. Small learning rate helps us avoid skipping over the optimum configuration.
model_verbosity determines how much output you want to see from the network during training. For speed, use 0.
model_filename the name to give the precomputed model files. An index will be appended to each one to differentiate them.
best_model_filename same as previous, but the name for the best model. This is the model with the minimum validation loss.
input_scaler_filename is the filename of the input scaler that will be saved during training to be used during testing.
output_scaler_filename same as previous but for the output
predictions_filename the filename to give computed predictions when saving. An index will be appended to differentiate them.
price_fig_filename the filename to give figures that compare predicted prices and actual prices during testing. Again, an index will be appended to differentiate.
remodel a boolean that tells the program whether to train, or use a precomputed model.
compute_predictions a boolean that tells the program whether to compute predictions, or load precomputed ones.
use_best_model if True, we use the best model from best_model_filename. Otherwise, we use the final model that is created after all training has finished.
show_plots if true, the program will pause after each figure is created for you to view it. In order to continue, you will have to close the plots.
col_to_trade_on This is the column whose price fluctuations decide whether we make a trade. Currently it is the closing price, but can be changed to any of the other columns.

trade.py

Signal The signal class holds values for each possible position and also holds the current position.
trade() This function keeps track of potential signals and acts when we have the maximum amount of information about a time step. It also keeps track of our profit.
add_signal() This function adds a potential signal that may cause a real signal later on. When we get three (or Param.n_out) signals for a certain time step, then that time step is "full" and we attempt to make a trade.
check_position() This function checks the value of Signal.current_pos. If we are in a long or short position, we calculate the change in price and add it to our profits. If the change in price corresponds with our position, we increase profit, otherwise we decrease.

Our Trading Strategy

Since we can predict the closing price three (Param.n_out) time steps into the future, for each time step we will have 3 predictions for what will happen the following time step. If most of these predictions are that the price will go up, the we enter a long position (buy). If most of these predictions are that the price will go down, we enter a short position (sell). For each timestep that we are in a long or short position, we calculate our profits. If we were right, then our profit increases by the change in price at that time step. If we were wrong, our profit decreases by the change in price that time step.
This portion of our code is not as well tested as the rest due to time constraints. It appears though that this strategy is not extremely successful on three of the stocks. It does see some success on the SXF stock though. Profits are reported to the VS Code terminal when you run our program. Given more time, we would have liked to test our strategy more thoroughly and to search for one that is more widely successful.
We believe, however, that the neural network shows promise as an aid to a trading strategy due to the apparent accuracy of our predictions as evidenced by the "Price" plots. I believe that the model could be improved further and further customized for each stock given more time as well.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Data		Data
Models		Models
Plots		Plots
.gitignore		.gitignore
Param.py		Param.py
Poster.pptx		Poster.pptx
README.md		README.md
dataHelper.py		dataHelper.py
driver.py		driver.py
model.py		model.py
momentum_indicators.py		momentum_indicators.py
predict.py		predict.py
prepare.py		prepare.py
prepare_mi.py		prepare_mi.py
scan_perform.c		scan_perform.c
trade.py		trade.py

courtneyum/Trading

Folders and files

Latest commit

History

Repository files navigation

Setting up your Python environment

Our Code

Our Trading Strategy

About

Resources

Stars

Watchers

Forks

Languages