How to predict future Stock using LSTM Keras

How to predict future Stock using LSTM Keras - python

First of all, I must say, I'm a beginner to this AI things. I followed most of the tutorials about stock market predictions and all of them are pretty much same. These tutorials using a data set and split in to two sets. First one is Training set and the 2nd one is Test set. They are using Closing price of the stocks to train and make a model. From that model, they insert test data set which contain the closing price and showing two graphs. Then they say the actual and the predicted graphs are pretty much same.
The github repo of the tutorial. -
https://github.com/surajr/Stock-Predictor-using-LSTM/blob/master/Stock-Predictor-using-LSTM.ipynb
This is my question,
1. Why all those tutorials are putting closing price in the testing set also? They are only suppose to insert dates right? Because we are predicting the closing price. This is confusing. Please explain me.
2. No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
Please help me to clarify this. Thanks a lot.

Take a look at this link. I think it will get you going in the right direction.
https://www.datacamp.com/community/tutorials/lstm-python-stock-market

Why all those tutorials are putting closing price in the testing set also?
The ultimate goal is to predict the movement (growth), Which is closing minus- opening price. The ultimate model is the model that calculates the growth in test data set very close to what the actual growth is. The growth is the main problem that the model is trying solve and is the point of reference when you calculate the accuracy of the trained model.
They are only suppose to insert dates right? Because we are predicting the closing price
The model is predicting the growth based on given factors. For a company, you have many factors that are quantified, per day. I suspect the tutorial you did uses a testing set extracted for one particular day and different stocks. Like extracting all parameters for all companies but only in 10th of January and then check how accurate the trained model is. The training set on the other hand contains the stock for more than one day most of the time.
No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
To predict the stock price relatively accurate, you need a well-trained model. To do this you need to train your model based on many many factors. Same model cannot predict stock in different countries. One model might be suitable to predict technology stocks (AAPL) but not other fields.
Overall, this is a complicated subject. Financial advisers pay a massive amount of money just to use reliable models. Most of them use multiple models based on their client's portfolio. These tutorials introduce the subject to you and teach you the main concept. IMHO, I would say the next step would be learning and then competing in Kaggle.

In the training set, closing value is included as an input because it is relevant to the "next day's" price, or "price in X days" (for models that predict price movement over more than 1 day).
Note, in the training data, typically the future price (today + 1 day) is the target value (train_Y).
In the testing data, the closing data is included because the testing data is predicting "future price."
In determining the accuracy of the model, the price prediction of (today + X days) is compared against the future value (test_Y) to determine the effectiveness of the prediction. Just like a human stock trader, if you are guessing/predicting if the FUTURE price will be Y (i.e. up/down), then you would have access to the current day's end of day closing price...which is why it is a relevant input. Obviously, in a real-world model, the accuracy of the prediction would only be known AFTER X days pass. When training and then testing a model, typically the data is historical, so out of sample values (like the price of today + X days) is used for accuracy determination, though the FUTURE value should definitely not be an input.

Why all those tutorials are putting closing price in the testing set also?
-> It is easy to understand that closing price is a kind of input variable which is required to calculate stock price.
As I see the code, it seems predict stock price with 22days history
X_train (1173, 22, 3)
y_train (1173,)
X_test (130, 22, 3)
y_test (130,)
I think you should re-train with (~~~, 7, 3) to predict price of 7 days after today.

Related

Why can my neural network only predict the previous day?

I have been trying to use a Keras Neural Network to predict tomorrows closing price in comparison to today as a Boolean.
I am using financial data from Yahoo Finance, and I load in the data using this code.
df = pd.DataFrame(yf.download(tickers='QQQ', interval ='1d'))
This is the code I am using to create my target variable:
close_tm = df['Close'].shift(-1)
df['inc_tm'] = close_tm > df['Close']
When I execute my neural network using some additional variables, the accuracy rate caps out at about 54%, which, perplexingly, is equal to the model's accuracy without any additional variables.
Interestingly, when I change the target variable to be a Boolean of yesterday's close, using this code:
close_tm = df['Close'].shift(1)
df['inc_tm'] = close_tm > df['Close']
the accuracy rate jumps up to about 63%, this is what I expect to have as an actual accuracy rate, as it reflects what happens using the exact same data and target variable in R. What's more, when I shift the target variable to be 5 days ago, the accuracy rate increases to a whopping 87%, which is completely illogical.
I should mention that the accuracy rate of yesterday or 5 days ago does not increase without the addition of my variables. If anyone has any ideas as to why this could be happening, I have been stumped on this for days now. To be clear, the exact same process (accurate target variable and timeseries data) returns a somewhat functional neural network in R. For some reason Python is doing something different.
Thank you for your taking the time to read this, any help is greatly appreciated even if you don't have a definite answer.

Backcast time series using Python

I have a time-series data from 2016 to 2021, how could I backcast to get the data from 2010 to 2015 using ARIMA in Python?
COuld you guys give me some sample Python code?
Thank you very much

The only possibiliy I see here is to simply inverse your time series. That means the last observations becomes the first, the second last becomes the second and so on. You then have a series from 2021 to 2016.
You can do that by:
df = df.reindex(index=df.index[::-1])
You can then train an ARIMA model on this data and predict the "next" five years from 2015 to 2010. Remember that the first prediction will be for 2015-12-31, so you need to inverse this again to have the series from 2010 to 2015.
Keep in mind that ARIMA the predictions will be very, very bad, since your forecasts will be based on forecasts and so on. ARIMA is not made for predictions on such long time frames, so the results will be useless anyway, I guess. It is very likely that the predicitons will become a straight line after 30 or 40 predicions.And you can only use the autoregression part in such a case, since the order of the moving average model will limit the amount of steps you can forecast into the future.

Forecasting from an inversed timeseries would be the solution if you had more data.
However, only having 6 observations is problematic. Creating a forecasting (or backcasting) model requires using some of the observations to train the model and others to validate it. If you train with 4 observations then you only have 2 observations for validation. Is a model good if it forecasts those two well or did you just get lucky? Is it bad if it forecasts one observation well and the other poorly? If you increase the validation set to 3 observations, you get more confidence on whether the model is good or bad but creating the model (with only 3 observations) gets even harder than before.
Like others have stated, regardless of what machine learning model you choose, the results are likely to be poor with so little data. If you had the monthly data it might be more fruitful.
If you can't get the monthly data, since you are backcasting into the past, it might be better to estimate the values manually based on some related variables that you have data of (if any). E.g. if your timeseries is about a company's sales then maybe you could estimate based on the company's annual revenue (or company size, or something else) if you can get the historical data of that variable. This is not precise but can still be more precise than what ARIMA or similar methods would give with the data you have.

What is the best approach to implement Time series forecasting to predict future customer orders?

I have 2 years of historical data of customers, items ordered and the numbers of orders. Based on this data, I am trying to predict the future sales at customer - item level. I tried ARIMA model which didn't give me the expected results. Any suggestions or references of implementation. I am interested to try LSTM and looking for good retail references.

ARIMA models can be challenging to work with as there are a lot of parameters which need to be set in a reasonable manner, especially if you're trying to model seasonality.
Here's a decent starter discussion (with code) on using an LSTM for retail sales problem: https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
Also, consider working to identify some dependent variables which may impact sales to add as auxiliary inputs. For example - some items could sell more during certain seasons, around national holidays, etc.
Forecasting at this detailed of a level is always challenging so the more variables you can find which explain the observed phenomena, the better you may be able to do.

Can a single model return either a continuous or categorical result?

I'm sorry for the poorly worded title, I'm just not sure how to phrase my question. Explaining my problem is probably easier. Also, I'm quite new to this, having only taken a few Udemy courses in Python and no professional experience in the field.
TL;DR: I want to create a model that will predict how many a minutes a flight will be delayed, or if it'll be canceled.
I'm working with the 2015 Flight Delays dataset on Kaggle, and I'm combining this data with a site I found that gives historical weather data for airports. I'm trying to combine those two sets of data to built a better delay predictor.
My hypothesis is that bad weather is the primary cause of delays, but really bad weather is the primary cause of cancellations.
My first thought at attacking this is that I could create a linear model that takes ceiling, visibility, wind as continuous independent variables, plus origin airport, departure airport, and precipitation intensity as categorical independent variables, and output the arrival delay as the dependent variable. However, this doesn't account for flight cancellations.
In the dataset, a cancelled flight has a NaN for its departure and arrival times, and then a one-hot encoded column for Cancelled. I could just edit the dataset and do something like making the arrival delay 999 if the Cancelled column is 1, but that screws with the linear model.
Is it possible for my model to output either a numeric (continuous) value for how many minutes a flight will be delayed, or if it'll be cancelled altogether (categorical)?
If not, can I somehow stack models? Ie, make something like a random forest model that only predicts "Canceled' or "Not Canceled", and feed the "Not Canceled" flights into a linear model that predicts a number of minutes a flight will be delayed?
Slightly unrelated question: I don't think there is a linear relationship between ceiling/visibility/winds, but I do think there is maybe a logarithmic relationship. For example, there is a massive difference between a visibility of 0.5, 1.0, and 1.5 miles, but there is zero difference between a visibility of 8 and 10 miles. Same for winds and ceiling. What sort of model is available to me in scikit-learn that accounts for this?

There are a lot of questions here but let me answer the one related with the stacked machine learning models.
The answer is yes, you train a neural network which solves both tasks. That is called multitask learning. I only know implementations of multitask learning with deep neural networks and its not hard to implement. The most important part is the output layer, in this case you would have an output layer with two activation units. One sigmoid output unit and one relu output unit (if you are planning to predict only possitive delays).
Another solution for your problem is the one you proposed. Just set the delay to a big number and set a condition in your code like "If the number of delayed hours is longer than n days, then the flight was cancelled".
And the last solution I see for it is implementing two neural networks. One which determines if the flight was cancelled and one which determines the delay time. If the first one returns true, then don't use the second one.
Please feel free to ask

Learn to predict future customer's behavior from history

I have a dataset which contains visiting history from customers.
It has three columns in dataset including customer ID, AM/PM (visit at AM or PM) and Weekday/Weekend (visit on weekday or weekend).
I want to learn from this dataset and select the top 50 customers who have the biggest chance to visit in specified input (like AM / Weekday).
For now, I create model for each customer by using one-class SVM (I only have positive (visit) data). Since the one-class SVM only has binary output, I can only tell the certain customer will visit or not in specified input, rather than selecting the top 50 customers.
I was wondering if there is an algorithm that can learn from a positive-only dataset and give a score or probability like output?

That's a subcategory problem within machine learning. You can learn a lot reading this survey: "One-Class Classification: Taxonomy of Study and Review of
Techniques" (http://arxiv.org/pdf/1312.0049.pdf). Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.