def main(): st.image('logo.png', width=200) st.title('Análise de dados Pulseira MiBand') st.markdown(""" Análise dos dados extraídos do monitoramento da pulseira de marca MiBand. \n Para obter teus dados, acesse essa [url](https://user.huami.com/hm_account/2.0.0/index.html#/threeLogin) com sua conta, e selecione a opção Exportar dados. """)
st.image( 'https://raw.githubusercontent.com/nilsoncunha/portfolioweb/master/assets/img/posts/enem.jpg', use_column_width=True) st.title('Prevendo as notas de matemática do ENEM do ano de **2016**') st.markdown( '>*Análise baseada em uma das atividades propostas pelo programa de aceleração da **Codenation** que ' 'participei no final de 2019, **Acelera Dev - Data Science**, em Belo Horizonte.*' ) st.markdown( '>*Iniciei novamente a aceleração, que agora está sendo online, através do convite da própria ' '**Codenation** com o intuito de auxiliar os participantes nos desafios, códigos e também ' 'passar para eles a experiência que tive no presencial. Dessa vez vou apenas refazer a análise já feita ' 'anteriormente (se quiser verificar é só ' '[clicar aqui](https://nilsoncunha.github.io/portfolioweb/prevendo-nota-de-matematica-do-enem-2016/)) ' 'utilizando uma ferramenta apresentada pelo [Túlio Vieira](https://www.linkedin.com/in/tuliovieira/) ' '(instrutor da aceleração), que é o [Streamlit](https://docs.streamlit.io/index.html)*.' ) '>*Vou fazer aqui uma breve apresentação dessa biblioteca, que merece muitos aplaudos, antes de iniciar. ' 'Trazendo a definição do próprio Stramlit que se apresenta assim: ' '"O Streamlit é uma biblioteca Python de código aberto que **facilita** (e muito, esse por minha conta) a ' 'criação de aplicativos da Web personalizados e bonitos para aprendizado de máquina e ciência de dados...".' ' Com essa facilidade não precisamos ficar preocupado em utilizar html, css, javascript, etc., para montar ' 'uma interface ou ter que utilizar PowerPoit, ou outra coisa para apresentarmos ao negócio nossa análise. ' 'Se antes realizávamos toda a documentação no próprio notebook, com o ' '[Streamlit](https://docs.streamlit.io/index.html) conseguiremos fazer a documentação e deixar muito mais ' 'apresentável para as outras pessoas*') st.markdown( 'Então, fazendo o resumo da análise e demonstrando um pouco da ferramenta, Vamos lá! ' 'Bases utilizadas de [treino](https://dl.dropbox.com/s/7vexlohz7j3qem/train.csv?dl=0) e de ' '[teste](https://dl.dropbox.com/s/dsgzaemaau9g5z0/test.csv?dl=0).') st.markdown( 'Com essa ferramenta conseguimos definir quantas linhas queremos visualizar em nosso dataframe, ' 'podemos definir um *"slider"* e passar como parâmetro do "*head()*"' ) st.markdown( "Conseguimos ver que o estado de São Paulo teve o maior número de candidados, seguido por Ceará e " "Minas Gerais *e também podemos utilizar o plotly facilmente*") st.markdown( "Fazendo a verificação por sexo, conseguimos observar que as mulheres tiveram uma maior " "participação na prova. *Nos gráficos, imagens, etc., o " "[Streamlit](https://docs.streamlit.io/index.html) nos dá a opção de expandi-lo, colocando o " "ponteiro do mouse em cima é exibido uma seta no canto superior direito*" ) st.markdown( "Observamos abaixo a distribuição de idade dos participantes. *Apenas adicionamos 'st.pyplot()' " "depois de montarmos o gráfico*") st.markdown( 'Na redação temos alguns pontos que são observados no caso de fugir ao tema, for anulada, entre outros. ' '*Tabela gerada com o "st.table()"*') st.markdown( 'Visualizando agora as notas das provas por estado. *Utilizando o plotly novamente que fica muito ' 'mais fácil para identificar os valores.*') O título do gráfico corresponde as perguntas realizadas. st.markdown("Descrevendo agora o questionário socioeconômico. " "*Colocamos o 'plt.figure(figsize=(x, y))' antes de iniciar a construção do gráfico, com isso conseguimos " "alterar o tamanho da imagem*") st.subheader("Tratando os dados e realizando a previsão") st.markdown( "Depois dessas análises, chegou a hora de prepar os dados para a previsão. " "Primeiro realizei o tratamento imputando o valor 0 (zero) na prova daqueles " "candidatos que estavam com com status diferente de "1 = Presente na prova". *Para exibir o código " "que escrevi eu usei o st.echo() que insere uma notação e ao mesmo tempo executa o código. Bem " "simples né!? (Exibindo apenas algumas linhas)*") st.markdown( 'No desafio tinhamos que submeter um arquivo csv com a resposta do modelo treinado. ' 'Nesse caso criei o pŕopio dataset de validação baseado nos dados de treino ' 'e verifiquei como está a performance do modelo. *Utilizei st.echo() para exibir o código ' 'e já executa-lo*') st.subheader( "Utilizando *Linear Regression* e *Random Forest Regressor*.") st.markdown("___Linear Regression___") st.markdown("___Random Forest Regressor___") elif option == 'Predição': st.header('Realizando a predição da nota:') st.subheader( 'Como o melhor modelo foi o *Random Forest* vou utilizar ele para fazer a predição' ) estados = st.selectbox('Estado', options=list(estados_op.keys()), format_func=lambda x: estados_op[x]) idade = st.number_input('Idade', min_value=13, max_value=80, step=1, value=18, key='idade') escola_op = { 1: 'Não respondeu', 2: 'Pública', 3: 'Privada', 4: 'Exterior' } escola = st.selectbox('Escola', index=1, key='escola', options=list(escola_op.keys()), format_func=lambda x: escola_op[x]) treino_op = ('Não', 'Sim') treino = st.selectbox('Fez somente para treino?', index=0, key='treino', options=list(range(len(treino_op))), format_func=lambda x: treino_op[x]) st.subheader('Presença e Nota nas provas') st.markdown('Ciências da Natureza') pr_prova_cn_op = ('Não', 'Sim') pr_prova_cn = st.selectbox('Presença', index=1, key='pr_prova_cn', options=list(range(len(pr_prova_cn_op))), format_func=lambda x: pr_prova_cn_op[x]) if pr_prova_cn == 0: nt_cn = st.number_input('Nota', value=0.0, min_value=0.0, max_value=0.0, key='nt_cn') else: nt_cn = st.number_input('Nota', min_value=0.0, max_value=1000.0, value=0.0, step=5.0, key='nt_cn') else: st.subheader('Sobre mim:') st.markdown( 'Uma pessoa que gostou de trabalhar com dados e viu que pode ser gerado muito valor através deles. ' 'Foi com esse intuito que comecei a fazer cursos e iniciar a pós-graduação em Ciência de Dados e Big Data.' ) st.markdown( '* Pós-Graduando em Ciência de Dados e Big Data pela PUC Minas. _(10/2020)_ \n' '* Acelera Dev Data Science - Codenation _(12/2019)_ \n' '* Data Science de A a Z - Udemy _(07/2019)_ \n' '* Graduação em Sistemas de informação pela Faculdades Promove. _(12/2016)_ *trancado* \n' '* Graduação Tecnológica em Redes de computadores pela Faculdades Promove. _(06/2014)_' )
df.drop(['URL'], axis=1, inplace=True) df_corr = df.corr().stack().reset_index().rename(columns={ 0: 'correlation', 'level_0': 'Y', 'level_1': 'X' }) df_corr['correlation_label'] = df_corr['correlation'].map('{:.3f}'.format) ''' The pairwise correlation of all attributes in the data set. ''' ''' Visualization of the correlation of features using a heat map. The magnitude of correlation between the attributes are strong. '''
movers = ya.get_day_most_active() movers = movers[movers['% Change'] >= 0] We have successfully scraped the data using the yahoo_fin python module. it is often a good idea to see if those stocks are also generating attention, and what kind of attention it is to avoid getting into false rallies. We will scrap some sentiment data courtesty of [sentdex](http://www.sentdex.com/financial-analysis/). Sometimes sentiments may lag due to source e.g Newsarticle published an hour after event, so we will also utilize [tradefollowers](https://www.tradefollowers.com/strength/twitter_strongest.jsp?tf=1d) for their twitter sentiment data. We will process both lists independently and combine them. For both the sentdex and tradefollowers data we use a 30 day time period. Using a single day might be great for day trading but increases probability of jumping on false rallies. NOTE: Sentdex only has stocks which belong to the S&P 500 We then combine these results with our results from the biggest movers on a given day. This done using a left join of this data frame with the original movers data frame A couple of stocks pop up with both very good sentiments and an upwards trend in favourability. ZNGA, TWTR and AES for instance stood out as potentially good picks. Note, the mentions here refer to the number of times the stock was referenced according to the internal metrics used by [sentdex](sentdex.com). Let's attempt supplimenting this information with some data based on twitter. We get stocks that showed the strongest twitter sentiments with a time period of 1 month Twit_Bull_score refers to the internally scoring used at [tradefollowers](tradefollowers.com) to rank stocks based on twitter sentiments, and can range from 1 to as high as 10,000 or greater. With the twitter sentiments obtains, we combine it with our sentiment data to get an overall idea of the data. Finally, we include a twitter momentum score. We again combine the dataframes to earlier concatanated dataframes. This will form our recommender list Our list now contains even more informationt to help us with our trades. Stocks which it suggests might generate positive returns include TSLA, ZNGA and TWTR. There is also the posibility that we do not get a stock that falls in all our generated lists, so usage of, for instance, the price information and the twitter data could still give us a good idea of what to expect in terms of performance. As an added measure, we can also obtain information on the sectors to see how they've performed. Again, we will use a one month time period for comparison. The aforementioned stocks belong to the Technology and consumer staples sectors. The industrials sector appears to be the best performing in this time period. Consumer staples appears to be doing better than IT, but overall they are up which bodes well for potential investors. Please note that this analysis is only a guide to find potentially positive return generating stocks. It is still up to the investor to do the research. ## Part 2: Forecasting using an LSTM In this section, we will atetmpt to apply deep learning to a stock of our chosing to predict future prices. At the time this project was conceived, the stock AMD was selected as it experienced really high gains at the time. First we obtain stock data for our chosen stock. Data from 2014 data up till August of 2020 was obtained for our analysis. Our data will be obtained from yahoo ### Feature selection/engineering We add additional data that might potentially increase prediction accuracy. Here we use technical indicators. To learn more about technical indicators and how they are useful in stock analysis, I welcome you to explore [investopedia](https://www.investopedia.com/). Let's combine these indicators into a dataframe We then extract the values for the time interval of choice Now we combine them with our original dataframe containing price and volume information Before we begin, it is often a good idea to visually inspect the stock data to have an idea of the price trend and volume information in the month of July, AMD experienced a massive price surge. Let's have a look at the data with the indicators included Indicators give us an idea of the direction of future prices. For instance We will take the difference between two consecutive days in this case. # In[81]: df_updated['Diff_Open'] = df_updated['Open'] - df_updated['Open'].shift(1) df_updated['Diff_Close'] = df_updated['Close'] - df_updated['Close'].shift( 1) df_updated[ 'Diff-Volume'] = df_updated['Volume'] - df_updated['Volume'].shift(1) df_updated['Diff-High'] = df_updated['High'] - df_updated['High'].shift(1) df_updated['Diff-Low'] = df_updated['Low'] - df_updated['Low'].shift(1) df_updated['Diff-Close (forward)'] = np.where( df_updated['Close'].shift(-1) > df_updated['Close'], 1, -1) df_updated['High-Low'] = df_updated['High'] - df_updated['Low'].shift(1) df_updated['Open-Close'] = df_updated['Open'] - df_updated['Close'].shift( 1) df_updated['Returns'] = df_updated['Open'].pct_change(1) # In[82]: st.table(df_updated.head()) # The next step is to visualize how the features relate to each other. We employ a correlation matrix for this purpose # In[83]: df_updated.drop(['date', 'Real Middle Band', 'Adj Close'], axis=1, inplace=True) # In[84]: plt.figure(figsize=(12, 8)) sns.heatmap(df_updated.corr()) # The closing price has very strong correlations with some of the other price informations such as opening price, highs and lows. # On the other hands, the differential prices arn't as correlated. We want to limit the amount of colinearity in our system before running any machine learning routine. So feature selection is a must. # ### Feature Selection # # We utilize two means of feature selection in this section. Random forests and mutual information gain. Random forests are # very popular due to their relatively good accuracy, robustness as well as simplicity in terms of utilization. They can directly measure the impact of each feature on accuracy of the model and in essence give them a rank. Information gain on the other hand, calculates the reduction in entropy from transforming a dataset in some way. Mutual information gain essentially evaluates the gain of each variable in the context of the target variable. # In[85]: # ### Random forest regressor # In[88]: # Seperate the target variable from the features y = df_updated['Close'].iloc[1:].dropna() X = df_updated.drop(['Close'], axis=1).iloc[1:].dropna() #print("y-Band: ",y.count) #print("x-band: ",X.count) # In[89]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # In[90]: X_train.shape, y_train.shape # In[92]: feat = SelectFromModel( RandomForestRegressor(n_estimators=100, random_state=0, n_jobs=-1)) feat.fit(X_train, y_train) feat.get_support() # In[93]: X_train.columns[feat.get_support()] # The regressor essentially selected the features that displayed good correlation with the Close price. However, although it selected the most important we would like information on the information gain from each variable. An issue with using random forests is it tends to diminsh the importance of other correlated variables and may lead to incorrect interpretation. However, it does help reduce overfitting # ### Mutual information gain # In[94]: # In[96]: mi = mutual_info_regression(X_train, y_train) mi = pd.Series(mi) mi.index = X_train.columns mi.sort_values(ascending=False, inplace=True) # In[97]: st.table(mi.head(50)) # The results validate the results using the random forest regressor, but it appears some of the other variables also contribute # a decent amount of information. We will select values greater than 2 for our analysis. # In[98]: sel = SelectKBest(mutual_info_regression, k=8).fit(X_train, y_train) # Features = X_train.columns[sel.get_support()] Features.values # ### Preprocessing # # In order to construct a Long short term memory neural network (LSTM), we need to understand its structure. Below is the design of a typical LSTM unit. Data source: [Researchgate](https://www.researchgate.net/publication/334268507_Application_of_Long_Short-Term_Memory_LSTM_Neural_Network_for_Flood_Forecasting) # ![LSTM_structure.jpg](LSTM_structure.jpg) # As mentioned earlier, LSTM's are a special type of Recurrent neural networks (RNN). Recurrent neural networks (RNN) are a special type of neural network in which the output of a layer is fed back to the input layer multiple times in order to learn from the past data. Basically, the neural network is trying to learn data that follows a sequence. However, since the RNNs utilize past data, they can become computationally expensive due to storing large amouts of data in memory. The LSTM mitigates this issue, using gates. It has a cell state, and 3 gates; forget, imput and output gates. # # The cell state is essentially the memory of the network. It carries information throughtout the data sequence processing. Information is added or removed from this cell state using gates. Information from the previous hidden state and current input are combined and passed through a sigmoid function at the forget gate. The sigmoid function determines which data to keep or forget. The transformed values are then multipled by the current cell state. # # Next, the information from the previous hidden state combined with the input is passed through a sigmoid function to again determine important information, and also a tanh function to transform data between -1 and 1. This transformation helps with the stability of the network and helps deal with the vanishing/exploding gradient problem. These 2 outputs are multiplied together, and the output is added to the current cell state with the sigmoid function applied to it to give us our new cell state for the next time step. # # Finally, the information from the hidden state combined with the current input are combined and a sigmoid function applied to it. The new cell state is passed through a tanh function to transform the values and both outputs are multiplied to determine the new hidden state for the next time step. # # Now we have an idea of how the LSTM works, let's construct one. First we split our data into training and test set # In[99]: df_updated.reset_index(drop=True, inplace=True) train_size = int(len(df_updated) * 0.8) test_size = len(df_updated) - train_size # Make sure to omit the first row, contains NAN's train = df_updated.iloc[1:train_size] test = df_updated.iloc[train_size:] # In[100]: train.shape, test.shape # In[102]: # Extract the features total_features = list(Features.values) total_features.append('Close') total_features train = train[total_features] test = test[total_features] train.shape, test.shape # Before we proceed, it is important to scale the data. Scaling is done to ensure one set of features don't have more importance relative to the others. In addition, having values between 0 and 1 will help the neural network converge faster if at all it does. We apply different scalings to the test and training data to avoid leakage into our model. # In[103]: # Scale both features and target variables f_transformer = MinMaxScaler() # Feature scaler targ_transformer = MinMaxScaler() # Target scaler f_transformer = f_transformer.fit(train[Features].to_numpy()) targ_transformer = targ_transformer.fit(train[['Close']]) train.loc[:, Features] = f_transformer.transform(train[Features].to_numpy()) train['Close'] = targ_transformer.transform(train[['Close']].to_numpy()) test.loc[:, Features] = f_transformer.transform(test[Features].to_numpy()) test['Close'] = targ_transformer.transform(test[['Close']].to_numpy()) # In[104]: train.shape, test.shape # The figure below shows how the sequential data for an LSTM is constructed to be fed into the network. Data source: [Althelaya et al, 2018](https://ieeexplore.ieee.org/document/8355458) # ![LSTM_data_arrangement.PNG](attachment:LSTM_data_arrangement.PNG) # Bassically for data at time t, with a window size of N, the target feature will be the data point at time t, and the feature will be the data points [t-1, t-N]. We then sequentially move forward in time using this approach. We therefore need to format our data that way. # In[105]: # In[106]: time_steps = 10 X_train_lstm, y_train_lstm = create_dataset(train.drop(['Close'], axis=1), train['Close'], time_steps) X_test_lstm, y_test_lstm = create_dataset(test.drop(['Close'], axis=1), test['Close'], time_steps) # In[108]: X_train_lstm.shape, y_train_lstm.shape # In[109]: X_test_lstm.shape, y_test_lstm.shape # ### Building LSTM model # # The new installment of tensorflow (Tensorflow 2.0) via keras has made implmentation of deep learning models much easier than in previous installments. We will apply a bidrectional LSTM as they have been shown to more effective in certain applications (see [Althelaya et al, 2018](https://ieeexplore.ieee.org/document/8355458)). This due to the fact that the network learns using both past and future data in 2 layers. Each layer performs the operations using reversed time steps to each other. The loss function in this case will be the mean squared error, and the adam optimizer with the default learning rate is applied. # In[110]: # In[111]: model = keras.Sequential() model.add( keras.layers.Bidirectional( keras.layers.LSTM(units=32, input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))) model.add(keras.layers.Dropout(rate=0.2)) model.add(keras.layers.Dense(units=1)) # In[112]: model.compile(optimizer='adam', loss='mean_squared_error') # In[114]: history = model.fit(X_train_lstm, y_train_lstm, epochs=90, batch_size=40, validation_split=0.2, shuffle=False, verbose=1) # In[115]: test_loss = model.evaluate(X_test_lstm, y_test_lstm) # In[116]: # In[117]: plot_learningCurve(history, 90) # With each epoch, the validation loss is decreasing but in a bit of a stochastic manner. The training loss is fairly consisten throughout. There maybe some overfitting in there but you can always tune model parameters and explore data more. Let's make some predictions on the test data just to see what's happening # In[118]: y_pred = model.predict(X_test_lstm) # We need to apply some inverse scaling to get back our original results. # In[119]: y_train_inv = targ_transformer.inverse_transform( y_train_lstm.reshape(1, -1)) y_test_inv = targ_transformer.inverse_transform(y_test_lstm.reshape(1, -1)) y_pred_inv = targ_transformer.inverse_transform(y_pred) # In[120]: plt.figure(figsize=(10, 10)) plt.plot(np.arange(0, len(y_train_lstm)), y_train_inv.flatten(), 'g', label="history") plt.plot(np.arange(len(y_train_lstm, ), len(y_train_lstm) + len(y_test_lstm)), y_test_inv.flatten(), marker='.', label="true") plt.plot(np.arange(len(y_train_lstm), len(y_train_lstm) + len(y_test_lstm)), y_pred_inv.flatten(), 'r', label="prediction") plt.ylabel('Close Price') plt.xlabel('Time step') plt.legend() st.pyplot(plt, use_container_width=True) #plt.show(); # At first glance we can see that the our predictions are not very great, we could define adjust our model parameters some more. However, they appear to be following the trends pretty well. Let's take a closer look # In[121]: plt.figure(figsize=(10, 10)) plt.plot(np.arange(len(y_train_lstm[0:500], ), len(y_train_lstm[0:500]) + len(y_test_lstm[0:500])), y_test_inv.flatten()[0:500], label="true") plt.plot(np.arange(len(y_train_lstm[0:500]), len(y_train_lstm[0:500]) + len(y_test_lstm[0:500])), y_pred_inv.flatten()[0:500], 'r', label="prediction") plt.ylabel('Close Price') plt.xlabel('Time Step') plt.legend() st.pyplot(plt, use_container_width=True) #plt.show(); # Now it will become apparent why I did not use a large amount of epochs to train my model. At first glance, we notice the LSTM has some implicit autocorrelation in its results since its predictions for a given day are very similar to those of the previous day. It essentially lags. Its basically showing that the best guess of the model is very similar to previous results. This should not be a surprising result; The stock market is influenced by a number of factors such as news, earnings reports, meargers etc. Therefore, it is a bit too choatic and stoachastic to be acurately modelled because it depends on so many factors, some of which can be sporadic i.e positive or negative news. Therefore in my opinion, this may not be the best way to predict stock prices. Of course with major advances in AI there might actually be a way, but I don't think the hedge funds will be sharing their methods anytime soon. # ## Part 3: Regression analysis # Of course we could still make an attempt to have an idea of what the possible price movements might be. In this case I will utilize the differential prices as there's less volatility compared to using absolute prices. Let's explore these relationships # In[122]: fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(10, 10)) ax[0, 0].scatter(df_updated['Open-Close'], df_updated['Diff_Close'], c='k') ax[0, 0].legend(['Open-Close']) ax[0, 0].set_ylabel('Diff-Close') ax[0, 1].scatter(df_updated['High-Low'], df_updated['Diff_Close'], c='k') ax[0, 1].legend(['High-Low']) ax[0, 1].set_ylabel('Diff-Close') ax[1, 0].scatter(df_updated['Diff_Open'], df_updated['Diff_Close'], c='k') ax[1, 0].legend(['Diff-Open']) ax[1, 0].set_ylabel('Diff-Close') ax[1, 1].scatter(df_updated['Diff-Low'], df_updated['Diff_Close'], c='k') ax[1, 1].legend(['Diff-Low']) ax[1, 1].set_ylabel('Diff-Close') ax[2, 0].scatter(df_updated['Diff-High'], df_updated['Diff_Close'], c='k') ax[2, 0].legend(['Diff-High']) ax[2, 0].set_ylabel('Diff-Close') ax[2, 1].scatter(df_updated['Open'], df_updated['Diff_Close'], c='k') ax[2, 1].legend(['Open']) ax[2, 1].set_ylabel('Diff-Close') st.pyplot(fig) # Above are a series of plots that show the relationship between different differential price measurements and the differential close. In this study, the differece relates to the difference between a value at time t and the previous day value at time t-1. The Differential high, differential low, differential high-low and differential open-close appear to have a linear relationship with the differential close. However, only the differential open-close would be useful in an analysis. This because on a given day (time t), we can not know what the highs or lows are before hand till the day ends. However, we know the open value at the start of the trading period. # Let's separate the data features and target variables. We will use Ridge regression in this case to make our model more generalizable # In[123]: # In[124]: X_reg = df_updated[['Open-Close']] y_reg = df_updated['Diff_Close'] # In[125]: X_reg = X_reg.loc[1:, :] y_reg = y_reg.iloc[1:] # In[126]: X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split( X_reg, y_reg, test_size=0.2, random_state=0) # We will perform a grid search and cross validation to determine optimal paramters for our regresison model # In[127]: ridge = Ridge() alphas = [ 1e-15, 1e-8, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0, 1, 5, 10, 20, 30, 40, 45, 50, 55, 100 ] params = {'alpha': alphas} # In[129]: ridge_regressor = GridSearchCV(ridge, params, scoring='neg_mean_squared_error', cv=10) ridge_regressor.fit(X_reg, y_reg) # In[130]: st.text(ridge_regressor.best_score_) st.text(ridge_regressor.best_params_) # Finally, let's produce a plot and see how it fits # In[131]: np.shape(X_test_reg) # In[133]: regr = Ridge(alpha=1e-15) regr.fit(X_train_reg, y_train_reg) y_pred = regr.predict(X_test_reg) y_pred_train = regr.predict(X_train_reg) st.text(f'R^2 value for test set is {regr.score(X_test_reg,y_test_reg)}') st.text(f'Mean squared error is {mean_squared_error(y_test_reg,y_pred)}') plt.scatter(df_updated['Open-Close'][1:], df_updated['Diff_Close'][1:], c='k') plt.plot(df_updated['Open-Close'][1:], (regr.coef_[0] * df_updated['Open-Close'][1:] + regr.intercept_), c='r') plt.xlabel('Open-Close') plt.ylabel('Diff-Close') st.pyplot(plt, use_container_width=True) # We obtained a mean square error of 0.58 which is fairly moderate. Our R^2 value basically says 54% of the variance in the # differential close price is explained by the differential open-close price. Not so bad so far. But to be truly effective, we need to make use of statistics. Specifically, let's define a confidence interval around our predictions i.e prediction intervals. # # Prediction intervals give you a range for the prediction that accounts for any threshold of modeling error. Prediction intervals are most commonly used when making predictions or forecasts with a regression model, where a quantity is being predicted. We select the 95% confidence interval in this example such that our actual predictions fall into this range 99% of the time. For an in-depth overview and explanation please explore [machinelearningmastery](https://machinelearningmastery.com/prediction-intervals-for-machine-learning/) # In[135]: # In[136]: lower, upper, interval = predict_range(X_reg, y_reg, regr) # In[138]: plt.scatter(X_reg, df_updated['Diff_Close'][1:], c='k') plt.plot(X_reg, lower, c='b') plt.plot(X_reg, (regr.coef_[0] * df_updated['Open-Close'][1:] + regr.intercept_), c='r') plt.plot(X_reg, upper, c='g') #plt.errorbar(X_reg , (regr.coef_[0] * df_updated['Open-Close'][1:] + regr.intercept_),yerr=interval) # plt.xlabel('Open-Close') plt.ylabel('Diff-Close') plt.legend(['Upper bound', 'Model', 'Lower bound']) st.pyplot(plt, use_container_width=True)
