XGBoost scaling weights for time series data - python

I'm working on a binary classification problem using time series data and I've been having some trouble adjusting the scale_pos_weight parameter.
As it's time series data most of my features are of the sort of last 30 days mean, number of days since X event, accumulated days of X event happening, etc. so in order to avoid data leakeage I'm splitting the data first 80% for training and the last 20% for test.
Works fine for most of the cases but there's a few that the target's distribution changes a lot from the training data to the test, meaning that the training data has 100:1 negative to positive instances meanwhile the test data is around 30:1.
I've tried switching the training size to different values to get similar target distributions, but I end up getting odd training sizes like 50% or 95%.
I also considered using the test data distribution to adjust the weights but it would be data leakeage.
Any ideas of how could I sort this out?

Related

Is it a good idea to re-scale data for each new data point when training a neural network?

I am training a neural network model for time series forecasting. I need to scale my data as the model should be able to receive different time series with different value ranges as input. So far, I have only tried using a MinMaxScaler. I have a limited number of data points to fit my scaler on as the individual time series are quite small, and I need the model to make predictions on most of the data. My time series are quite volatile so as new data points are transformed by the scaler they often exceed the range of the the scaler (above 1, below 0). This is a problem - specifically when values fall below 0. I know I can adjust the range of the MinMaxScaler, but that doesn't seem like a good solution.
Is it a good idea to scale the entire dataframe every time a new data point arrives? Or maybe just scale and transform the number of data points that the model use to predict (the window size)?
If not, how do you solve the issue of having little data to fit the scaler on? Clipping values is not an option as it looses a key component, namely the difference in value between the data points. I am not convinced that StandardScaler or RobustScaler will do the trick either for the same reason.

Predicting future of LSTM resulting in weird answers

so I was recently trying to predict stock prices using an LSTM (seems really overdone I know) but when I was predicting on data that is outside the dataset, I get a really weird graph that I do not think is correct, am I doing something wrong?
Prediction: https://github.com/Alpheron/StockPred/blob/master/predictions/MSFT-5-Year-LSTM.ipynb
Training:
https://github.com/Alpheron/StockPred/blob/master/MSFT-5-Year-LSTM.ipynb
In order to process the data, I used the lookback index of 60 points, so when trying to predict on data that is outside of the dataset, I would need the last 60 points as well, but I am I doing something wrong with the way I am predicting?

LSTM prediction normalization

I am currently trying Ethereum price prediction with LSTM network in Keras and I have a little problem.
I am trying to predict 10 following prices from 40 previous prices. Time span between prices is 15 second, so I am predicting 2,5 minutes into the future.
My input data are: closing price, open price, lowest price and highest price. I normalize the data between 0 and 1, since my activation function in fully connected layer is linear. I normalize each sequence independently based on minimum and maximum value of the whole sequence.
This model works fine when I test it with past data, since I have both input prices and the prices I am trying to predict, so when I denormalize the data based on minimum and maximum of the whole sequence, the prediction is pretty accurate.
However when I try to predict live data, I do not have the last 10 prices from the sequence, so I have to denormalize my output based on only 40 input values, which is where the problem is. The live result will never be as good as the testing result.
I tried to normalize my data based on 40 input prices from the sequence at the training, but it kind of threw off my whole prediction.
I also tried to normalize between -1 and 1 based on 40 input values and change my output activation function to tanh, but it still did not work correctly. My prediction mostly showed straight line with no increase or decrease in value.
My training data consist of around 61 000 rows of prices and I took 85% as training set, 20% of training set as validation set and 15% as testing set.
Do you have any idea where the problem could be? I can also provide some data about my architecture and training if needed.

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:
train is a pandas dataframe with 100k rows and 50 features as columns.
target is a pandas series
xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0,
objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)
predictions = xgb_classifier.predict(test)
However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?
Data clarification:
~50 numerical features with a numerical target
I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?
This question has received several responses including on this thread as well as here and here.
I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.
I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.
None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included:
1) check if gamma is too high
2) make sure your target labels are not included in your training dataset
3) max_depth may be too small.
One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.
Won't the max_depth =3 too smaller, try to get it bigger,the default value is 7 if i remember it correctly. and set silent to be 1, then you can monitor what's the error each epochs
You need to post a reproducible example for any real investigation. It's entirely likely that your response target is highly unbalanced and that your training data is not super predictive, thus you always (or almost always) get one class predicted. Have you looked at the predicted probabilities at all to see if there is any variance? Is it just an issue of not using the proper cut-off for classification labels?
Since you said that a RF gave reasonable predictions it would useful to see your training parameters for that. At a glance, it's curious why you're using a regression objective function in your xgboost call though -- that could easily be why you are seeing such poor performance. Trying changing your objective to: 'binary:logistic.
You should check there are no inf values in your target.
Try to increase (significantly) min_child_weight in XGBoost or min_data_in_leaf in LightGBM:
min_data_in_leaf oof_rmse
20000 0.052998
2000 0.053001
200 0.053002
20 0.053015
2 0.054261
Actually, it may be a case of overfitting masking as underfitting. It happens for instance for zero-inflated targets in case of insurance claims frequency models. One solution would be to increase the representation/coverage of rare target levels (e.g. non-zero insurance claims) in each tree leaf, by increasing the hyperparameter controlling minimum leaf size to some rather large values, such as those specified in the example above.
I just had this problem and managed to fix it. The problem was I was training on tree_method='gpu_hist' which gave all the same predictions. If I set tree_method='auto' it works properly but wayy longer runtimes. So then if I set tree_method='gpu_hist' along with base_score=0 it worked. I think base_score should be about the mean of your predicted variable.
I have tried all solutions on this page, but none worked.
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.
Probably the hyper-parameters you use cause errors. Try using default values. In my case, this problem was solved by removing subsample and min_child_weight hyper-parameters from params.

how to analyse and predict(machine learning) a time series data set using scikit-learn for python

i got data-set like this
i need to analyse and predict the status column. This is just 2 entrees from the training data set. In this data set there is heart rate pattern(which is collected in 1 second intervals, 10 numbers altogether) its a time series array(correct me if i'm wrong) i just need to know best way to analyse and get a prediction using this data. I'm using scikit-learning for my data-mining and machine learning.
What i just want to know is what is the best way to analyse these time series data? should i use vector based approach or something else. If you can give me example code that would be great for me to understand it.
Feed in each point in the heart rate time series as a separate column, along with a separate column (feature) for all of the the other data points. Do feature normalization (substract the mean, divide by the standard deviation) for each column over the entire dataset, and feed into a classifier.

Categories

Resources