Xgboost with simple time series dataframe Python

Xgboost with simple time series dataframe Python - python

I have this dataframe with the hourly production of a plant. I have many 0 values in the dataframe (sometimes the plant doesn't work for external reasons).
If anyone want to have a glance on the dataframe here the link (https://www.mediafire.com/file/55915xs2acdl7h4/Dataframe.zip/file)
I would like to make a prediction for the next 24 hours, but I'm having many troubles with xgboost. After splitting the data (X_train/y_train; X_test/y_test) the algorithm is stuck.
1) Should I change algorithm?
2) Am I missing some parameter tuning steps?

Related

Create a ML model with tensorflow that predicts a values at any given time range at hourly intervals

I am pretty new to ML and completely new to creating my own models. I have went through tensorflows time-series forecasting tutorial and other LSTM time series examples on how to predict with multi-variate inputs. After trying multiple examples I think I realized that this is not what I want to achieve.
My problem involves a dataset that is in hourly intervals and includes 4 different variables with the possibility of more in the future. One column being the datetime. I want to train a model with data that can range from one month to many years. I will then create an input set that involves a few of the variables included during training with at least one of them missing which is what I want it to predict.
For example if I had 2 years worth of solar panel data and weather conditions from Jan-12-2016 to Jan-24-2018 I would like to train the model on that then have it predict the solar panel data from May-05-2021 to May-09-2021 given the weather conditions during that date range. So in my case I am not necessarily forecasting but using existing data to point at a certain day of any year given the conditions of that day at each hour. This would mean I should be able to go back in time as well.
Is this possible to achieve using tensorflow and If so are there any resources or tutorials I should be looking at?

See Best practice for encoding datetime in machine learning.
Relevant variables for solar power seem to be hour-of-day and day-of-year. (I don't think separating into month and day-of-month is useful, as they are part of the same natural cycle; if you had data over spending habits of people who get paid monthly, then it would make sense to model the month cycle, but Sun doesn't do anything particular monthly.) Divide them by hours-in-day and days-in-year respectively to get them to [0,1), multiply by 2*PI to get to a circle, and create a sine column and a cosine column for each of them. This gives you four features that capture the cyclic nature of day and year.
CyclicalFeatures discussed in the link is a nice convenience, but you should still pre-process your day-of-year into a [0,1) interval manually, as otherwise you can't get it to handle leap years correctly - some years days-in-year are 365 and some years 366, and CyclicalFeatures only accepts a single max value per column.

Need approach recomendation in field ML or DL in forecast value task

I am trying to forecast value in 30 days. I have a timeseries data with some parameteers. The example of date I will attach at the bottom.
The main idea is that Y value is our aim variable to predict in 30 days from today. The f1-f5 variables is values which influence on Y value. So I need to predict Y using Date and f1-f5 columns. All the data comes every day.
Recomend me please some ML and DL approaches to predict "Y" value?
My thoughts.
As I understood it is time series data. And the task is regression. But I am a bit disapointed because time series approaches, as I understood, predict value based on only date value, using seasonality and so on. But I afraid that if I will use XGBoost or Linear regression approaches I will loose timeseries effect on this data.
Date,f1,f2,f3,f4,f5,Y
2015-01-01,183,34,15,1166,50,3251
2015-01-02,364,173,5,739,32,8132
2015-01-03,83,72,38,551,49,6271
2015-01-04,183,81,7,937,32,3334
2015-01-05,324,61,73,554,71,3742
2015-01-06,183,97,15,337,17,5543
2015-01-07,38,152,83,883,32,9143
2015-01-08,78,72,5,551,11,6435
2015-01-09,183,30,21,443,92,4353
...,...,...,...,...,...,...
2018-06-08,924,9,53,897,88,7446

Time series are traditionally modeled with AR (auto-regression) and MA (moving average). Trend and seasonality should also be accounted for. So why not use ARIMA or Prophet? Here's some theory on the subject - https://otexts.com/fpp2/
There are some ML/DL implementations based on RNN/LSTM but they are really complex, often hard to explain, and tend to suffer from vanishing gradient problem. If you must use ML/DL, you may want to have a look at LSTNet.

What is my dataframe's x value when using sklearn RandomForestRegressor?

I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv
I'm trying to predict the next values of "LandAverageTemperature".
I've asked another question about this topic earlier.(its here:How to predict correctly in sklearn RandomForestRegressor?) I couldn't get any answer for that question.After not getting anything in my first question and then failing for another day, I've decided to start from scratch.
Right now, I want to know which value is in my dataset is "x" to make the prediction correctly. I read that y is a dependent variable which that I want to predict and x is the independent variable that I should use as "predictor" to help the prediction proccess. In that case my y variable is "LandAverageTemperature". I don't know what the x value is. I was using date values for x at first but I'm not sure that is true at the moment.
And if I shouldn't use RandomForestRegressor or sklearn (I've started with spark to this project) for this dataset please let me know. Thanks in advance.

You only have one variable (LandAverageTemperature), so obviously that's what you're going to use. What you're looking for is the df.shift() function, which shifts your values. With this function, you'll be able to add columns of past values to your dataframe. You will then be able to use t 1 month/day ago, t 2 months/days ago, etc, as predictors of another day/month's temperature.
You can use it like this:
for i in range(1, 15):
df.loc[:, 'T–%s'%i] = df.loc[:, 'LandAverageTemperature'].shift(i)
Your columns will then be temperature, and temperature at T-1, T-2, for up to 14 time periods.
For your question about what is a proper model for time series forecasting, it would be off-topic for this site, but many resources exist on https://stats.stackexchange.com.

In general case you can use for X feature matrix all data columns excluding your target column. But in your case there is several complications:
You have missed (empty) data in most of the columns for many years. You can exclude such rows/years from train data. Or exclude columns with missed data (which will be almost all of your columns and it's not good).
Regression model can't use date fields directly, you should traislate date field to some numerical field(s), "months past first observation", for example. Something like (year-1750)*12 + month. Or/and you can have year and month in separate columns (it's better if you have some "seasonality" in your data).
You have sequental time data here, so may be you should not use simple regression. Use some ARIMA/SARIMA/SARIMAX and so on so-called Time-Series models which predicts target data sequentially one value by another, month after month in your case. It's a hard topic for learning, but you should definitely take a look at TS because you will need it some time in the future if not today.

ARIMA order for trend

I am trying to fit ARIMA model. I have 3 months data and it shows count(float) for every minute. Which order I should pass for arima.fit()?
I need to predict for every minute.

A basic ARIMA(p,d,q) model isn't going to work with your data.
Your data violates the assumptions of ARIMA, one of which is that the parameters have to be consistent over time.
I see clusters of 5 spikes, so I'm presuming that you have 5 busy workdays and quiet weekends. Basic ARIMA isn't going to know the difference between a weekday and weekend, so it probably won't give you useful results.
There is such a thing as a SARIMA (Seasonal Autoregressive Integrated Moving Average). That would be useful if you were dealing with daily data points, but not really suitable for minute data either.
I would suggest that you try to filter your data so that it excludes evenings and weekends. Then you might be dealing with a dataset that has consistent parameters. If you're using python, you could try using pyramid's auto_arima() function on your filtered time series data (s) to have it try to auto-find the best parameters p, d, q.
It also does a lot of the statistical tests you might want to look into for this type of analysis. I actually don't always agree with the auto_arima's parameter choices, but it's a start.
model = pyramid.arima.auto_arima(s)
print(model.summary())
http://pyramid-arima.readthedocs.io/en/latest/_submodules/arima.html

1) is your data even good for box -Jenkins model( ARIMA)?
2) I see higher mean in the end. There is clear seasonality in the data. ARIMA will fail. Please try seasonal based models- SARIMA as righly suggested by try, Prophet is another beautiful algo by Facebook for seasonality. - ( R implementation)
https://machinelearningstories.blogspot.com/2017/05/facebooks-phophet-model-for-forecasting.html
3) Not just rely on ARIMA. Try other time series algos that are STL, BTS, TBATS. ( Structural Time series) and Hybrid etc. Let me know if you want some packages information in R?

numeric value for time series data with frequency every 2 minutes

I have a dataset of two months in which I am getting reading every 2 minutes. statsmodel.tsa.seasonal_decompose method asks for the numeric value for frequency. What would be the numeric value for such data and what is the proper method to calculate freq in such time series data.

You need to identify frequency of the seasonality yourself. Often this is done using knowledge of the dataset or by visually inspecting a partial autocorrelation plot, which the statsmodels library provides:
statsmodels - partial autocorrelation
If the data has an hourly seasonality to it, you might see a significant partial autocorrelation lag 30 (as there are 30 data points between this hour's first 2 minutes, and last hour's first 2 minutes). I'm assuming statsmodels would expect this value; I'm assuming if you had monthly data it would expect a 12, or if you had daily data, it would expect a 7 for the weekly shape etc.
It sounds like you have multiple seasonalities to consider judging by your other post. You might see a significant lag that corresponds to the same 2 minutes in the previous hours, previous days, and/or the previous weeks. This approach to seasonal decomposition is considered to be naive and only addresses 1 seasonality as described in its documentation:
Seasonal Decomposition
If you would like to continue down the seasonal decomposition path, you could try the relatively recently released double seasonal model released by facebook. It is specifically designed to work well with daily data where it models intra-year and intra-week seasonalities. Perhaps it can be adapted to your problem.
fbprophet
The shortcoming with seasonal decomposition models is that it does not capture how a season might change through time. For example, the characteristics of a week of electricity demand in the summer is much different than the winter. This method will determine an average seasonal pattern and leave the remaining information in the residuals. So given your characteristics vary by day of week (mentioned in your other post), this won't capture that.
If you'd like to send me your data, I'd be interested in taking a look. Given my experience, you've jumped into the deep end of time-series forecasting that doesn't necessarily have an easy to use off-the-shelf solution. If you do provide it, please also clarify what your objective is:
Are you attempting to forecast ahead, and if so, how many 2 minute intervals?
Do you require confidence intervals, monte carlo results, or neither?
How do you measure accuracy of the model's performance? How 'good' does it need to be?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.