I have been trying to use a Keras Neural Network to predict tomorrows closing price in comparison to today as a Boolean.
I am using financial data from Yahoo Finance, and I load in the data using this code.
df = pd.DataFrame(yf.download(tickers='QQQ', interval ='1d'))
This is the code I am using to create my target variable:
close_tm = df['Close'].shift(-1)
df['inc_tm'] = close_tm > df['Close']
When I execute my neural network using some additional variables, the accuracy rate caps out at about 54%, which, perplexingly, is equal to the model's accuracy without any additional variables.
Interestingly, when I change the target variable to be a Boolean of yesterday's close, using this code:
close_tm = df['Close'].shift(1)
df['inc_tm'] = close_tm > df['Close']
the accuracy rate jumps up to about 63%, this is what I expect to have as an actual accuracy rate, as it reflects what happens using the exact same data and target variable in R. What's more, when I shift the target variable to be 5 days ago, the accuracy rate increases to a whopping 87%, which is completely illogical.
I should mention that the accuracy rate of yesterday or 5 days ago does not increase without the addition of my variables. If anyone has any ideas as to why this could be happening, I have been stumped on this for days now. To be clear, the exact same process (accurate target variable and timeseries data) returns a somewhat functional neural network in R. For some reason Python is doing something different.
Thank you for your taking the time to read this, any help is greatly appreciated even if you don't have a definite answer.
Related
so I was recently trying to predict stock prices using an LSTM (seems really overdone I know) but when I was predicting on data that is outside the dataset, I get a really weird graph that I do not think is correct, am I doing something wrong?
Prediction: https://github.com/Alpheron/StockPred/blob/master/predictions/MSFT-5-Year-LSTM.ipynb
Training:
https://github.com/Alpheron/StockPred/blob/master/MSFT-5-Year-LSTM.ipynb
In order to process the data, I used the lookback index of 60 points, so when trying to predict on data that is outside of the dataset, I would need the last 60 points as well, but I am I doing something wrong with the way I am predicting?
I have a time-series data from 2016 to 2021, how could I backcast to get the data from 2010 to 2015 using ARIMA in Python?
COuld you guys give me some sample Python code?
Thank you very much
The only possibiliy I see here is to simply inverse your time series. That means the last observations becomes the first, the second last becomes the second and so on. You then have a series from 2021 to 2016.
You can do that by:
df = df.reindex(index=df.index[::-1])
You can then train an ARIMA model on this data and predict the "next" five years from 2015 to 2010. Remember that the first prediction will be for 2015-12-31, so you need to inverse this again to have the series from 2010 to 2015.
Keep in mind that ARIMA the predictions will be very, very bad, since your forecasts will be based on forecasts and so on. ARIMA is not made for predictions on such long time frames, so the results will be useless anyway, I guess. It is very likely that the predicitons will become a straight line after 30 or 40 predicions.And you can only use the autoregression part in such a case, since the order of the moving average model will limit the amount of steps you can forecast into the future.
Forecasting from an inversed timeseries would be the solution if you had more data.
However, only having 6 observations is problematic. Creating a forecasting (or backcasting) model requires using some of the observations to train the model and others to validate it. If you train with 4 observations then you only have 2 observations for validation. Is a model good if it forecasts those two well or did you just get lucky? Is it bad if it forecasts one observation well and the other poorly? If you increase the validation set to 3 observations, you get more confidence on whether the model is good or bad but creating the model (with only 3 observations) gets even harder than before.
Like others have stated, regardless of what machine learning model you choose, the results are likely to be poor with so little data. If you had the monthly data it might be more fruitful.
If you can't get the monthly data, since you are backcasting into the past, it might be better to estimate the values manually based on some related variables that you have data of (if any). E.g. if your timeseries is about a company's sales then maybe you could estimate based on the company's annual revenue (or company size, or something else) if you can get the historical data of that variable. This is not precise but can still be more precise than what ARIMA or similar methods would give with the data you have.
First of all, I must say, I'm a beginner to this AI things. I followed most of the tutorials about stock market predictions and all of them are pretty much same. These tutorials using a data set and split in to two sets. First one is Training set and the 2nd one is Test set. They are using Closing price of the stocks to train and make a model. From that model, they insert test data set which contain the closing price and showing two graphs. Then they say the actual and the predicted graphs are pretty much same.
The github repo of the tutorial. -
https://github.com/surajr/Stock-Predictor-using-LSTM/blob/master/Stock-Predictor-using-LSTM.ipynb
This is my question,
1. Why all those tutorials are putting closing price in the testing set also? They are only suppose to insert dates right? Because we are predicting the closing price. This is confusing. Please explain me.
2. No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
Please help me to clarify this. Thanks a lot.
Take a look at this link. I think it will get you going in the right direction.
https://www.datacamp.com/community/tutorials/lstm-python-stock-market
Why all those tutorials are putting closing price in the testing set also?
The ultimate goal is to predict the movement (growth), Which is closing minus- opening price. The ultimate model is the model that calculates the growth in test data set very close to what the actual growth is. The growth is the main problem that the model is trying solve and is the point of reference when you calculate the accuracy of the trained model.
They are only suppose to insert dates right? Because we are predicting the closing price
The model is predicting the growth based on given factors. For a company, you have many factors that are quantified, per day. I suspect the tutorial you did uses a testing set extracted for one particular day and different stocks. Like extracting all parameters for all companies but only in 10th of January and then check how accurate the trained model is. The training set on the other hand contains the stock for more than one day most of the time.
No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
To predict the stock price relatively accurate, you need a well-trained model. To do this you need to train your model based on many many factors. Same model cannot predict stock in different countries. One model might be suitable to predict technology stocks (AAPL) but not other fields.
Overall, this is a complicated subject. Financial advisers pay a massive amount of money just to use reliable models. Most of them use multiple models based on their client's portfolio. These tutorials introduce the subject to you and teach you the main concept. IMHO, I would say the next step would be learning and then competing in Kaggle.
In the training set, closing value is included as an input because it is relevant to the "next day's" price, or "price in X days" (for models that predict price movement over more than 1 day).
Note, in the training data, typically the future price (today + 1 day) is the target value (train_Y).
In the testing data, the closing data is included because the testing data is predicting "future price."
In determining the accuracy of the model, the price prediction of (today + X days) is compared against the future value (test_Y) to determine the effectiveness of the prediction. Just like a human stock trader, if you are guessing/predicting if the FUTURE price will be Y (i.e. up/down), then you would have access to the current day's end of day closing price...which is why it is a relevant input. Obviously, in a real-world model, the accuracy of the prediction would only be known AFTER X days pass. When training and then testing a model, typically the data is historical, so out of sample values (like the price of today + X days) is used for accuracy determination, though the FUTURE value should definitely not be an input.
Why all those tutorials are putting closing price in the testing set also?
-> It is easy to understand that closing price is a kind of input variable which is required to calculate stock price.
As I see the code, it seems predict stock price with 22days history
X_train (1173, 22, 3)
y_train (1173,)
X_test (130, 22, 3)
y_test (130,)
I think you should re-train with (~~~, 7, 3) to predict price of 7 days after today.
I am struggling with a standard ML problem.
I'm trying to build a service which predicts the next time a user is sending a message on a platform. For this I'm using a historic dataset of the users messages which is structured as an array of timestamps. For example:
[2019-05-23 18:28:34.741413, 2019-05-23 18:45:39.643218, 2019-05-23 23:26:44.767524]
What is the best way of predicting the next timestamp in this series on when the user will be online?
Currently I am creating a dataframe in Python to then put it into a Sequential() model of keras but I need a y value for doing this.
Thanks for your ideas on how to handle this.
As a first attempt, I would predict the time duration until the next timestamp. (Regression, not classification.) Probably even better would be to predict the logarithm of that duration instead. Because it is more important to get 2min vs 3min right than to focus on 500min vs 510min.
As inputs you could use the logarithmic time since the last timestamp, and maybe a couple of the previous distances, or logarithm of the last message length, or some general user stats.
But ideally, you'd have the neural network predict the parameters of a probability distribution, such that it can give you an answer like "probably within the next 30 minutes, certainly not after midnight, but possibly after 7am", and then you can measure this prediction against the empirical distribution (e.g. cross-entropy loss). But this is probably a bit too involved for getting started.
If you only want to predict a single timestamp (and not a distribution) then in theory you'd have to define an appropriate loss, and make a decision about which errors are how bad for your application, and then train a model that optimizes this loss.
I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
Sometimes it can achieve the perfect 200 score within 100 episodes, but sometimes it gets stuck and only achieves 10 score no matter how long it is trained.
Also, in case of successful learning, the learning speed differ.
The problem in DDQN is:
It can learn to achieve 200 score, but then it seems to forget what's learned and the score drops dramatically.
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!
These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.
Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity
I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:
I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x.
In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)