SciKit-learn for data driven regression of oscillating data

SciKit-learn for data driven regression of oscillating data - python

Long time lurker first time poster.
I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time.
The goal is to predict the future values of the target variable. I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn.
I have tried the following methods (the parameters were blindly copied from examples and other threads):
LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
"gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
The results fall into two different categories of failure:
The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable)
The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant.
Below is how I performed the training and testing:
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
'Time':weather_df['Days'].ix[:'2015-1-1']
})
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred
Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters?
(Also, my treatment of the time objects in sklearn is far from elegant, so I am gladly taking advice there.)

Here is my guess about what is happening in your two types of results:
.days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset.
As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data.
Suggestion:
If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information.

Agree with #zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc.
Beyond that, I'd like to remind using prior knowledge when selecting features or models. Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach.
We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. None of the models that you used may have included these features. One way to do it would be to create these high order features by yourself and concatenate to your other features. Then a linear model with regulation may give you reasonable results.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?

I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

How to get k-fold cross validation final model with sklearn

Once I iterated on each training combination, given the k-fold split, I can estimate mean and standard deviation of models performance but I actually get k different models (with their own fitted parameters). How do I get the final, whole model? Is a matter of averaging the parameters?
Not showing code because is a general question so I'll write down the logic only:
dataset
splitting dataset according to the k-fold theory (let's say k = 5)
iterations: training from the first to the fifth model
getting 5 different models with, let's say, the following parameters:
model_1 = [p10, p11, p12] \
model_2 = [p20, p21, p22] |
model_3 = [p30, p31, p32] > param_matrix
model_4 = [p40, p41, p42] |
model_5 = [p50, p51, p52] /
What about model_final: [pf0, pf1, pf2]?
Too trivial solution 1:
model_final = mean(param_matrix, axis=0)
Too trivial solution 2:
model_final = the one of the fives that reach the highest performance
(could be a overfit rather than the optimal one)

First of all, the purpose of cross-validation (K-fold) is model checking, not model building.
In your question, you said that every fold of your program has different parameters, maybe this is not the best way to work.
One possibility to proceed is evaluate every model (each one with different parameters) using K-fold inside (using GridSearchCV); if you see that you obtain similar values of accuracy or other metrics in each split, then you are not overfitting.
Make this methodology for every model you have, and chose the one you obtain better results. Of course, always there is possibility to overfit, but with K-fold, you reduce it.
Finally, once you have checked with cross-validation that you obtain similar metrics for every split and you have chosed the model parameters, you have to train your model with all your training data; and you will finally obtain one unique model.

what is Gridsearch.cv_results_ , could any explain all the things in that i.e mean_test_score etc .?

I am doing hyperparameter tuning with GridSearchCV for Decision Trees. I have fit the model and I am trying to find what does exactly Gridsearch.cv_results_ gives. I have read the documentation but still its not clear. Could anyone explain this attribute?
my code is below :
depth={"max_depth":[1,5,10,50,100,500,1000],
"min_samples_split":[5,10,100,500]}
DTC=DecisionTreeClassifier(class_weight="balanced")
DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc')
DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow)

DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do
pd.DataFrame(DTC_Bow.cv_results_)
In your case, this should return a dataframe with 28 rows (7 choices for max_depth times 4 choices for min_samples_split). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of cv_results_.
You should have one column called param_max_depth and another called param_min_samples_leaf referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column params.
Now to the metrics. Default value for return_train_score was True up until now but they will change it to False in version 0.21. If you want the train metrics, set it to True. But usually, what you are interested in are the test metrics.
The main column is mean_test_score. This is the average of columns split_0_test_score, split_1_test_score, split_2_test_score (because you are doing a 3 fold split in your gridsearch). If you do DTC_Bow.best_score_ this will return the max value of the column mean_test_score. Column rank_test_score ranks all parameter combinations by the values of mean_test_score.
You might also want to look at std_test_score which is the standard deviation of split_0_test_score, split_1_test_score, split_2_test_score. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data.
As mentioned, you can have the metrics on the train set as well provided you set return_train_score = True.
Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (mean_fit_time, std_fit_time) and to evaluate it (mean_score_time, std_score_time). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.

How to handle Shift in Forecasted value

I implemented a forecasting model using LSTM in Keras. The dataset is 15mints seperated and I am forecasting for 12 future steps.
The model performs good for the problem. But there is a small problem with the forecast made. It is showing a small shift effect. To get a more clear picture see the below attached figure.
How to handle this problem.? How the data must be transformed to handle this kind of issue.?
The model I used is given below
init_lstm = RandomUniform(minval=-.05, maxval=.05)
init_dense_1 = RandomUniform(minval=-.03, maxval=.06)
model = Sequential()
model.add(LSTM(15, input_shape=(X.shape[1], X.shape[2]), kernel_initializer=init_lstm, recurrent_dropout=0.33))
model.add(Dense(1, kernel_initializer=init_dense_1, activation='linear'))
model.compile(loss='mae', optimizer=Adam(lr=1e-4))
history = model.fit(X, y, epochs=1000, batch_size=16, validation_data=(X_valid, y_valid), verbose=1, shuffle=False)
I made the forecasts like this
my_forecasts = model.predict(X_valid, batch_size=16)
Time series data is transformed to supervised to feed the LSTM using this function
# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
super_data = series_to_supervised(data, 12, 1)
My timeseries is a multi-variate one. var2 is the one that I need to forecast. I dropped the future var1 like
del super_data['var1(t)']
Seperated train and valid like this
features = super_data[feat_names]
values = super_data[val_name]
ntest = 3444
train_feats, test_feats = features[0:-n_test], features[-n_test:]
train_vals, test_vals = values [0:-n_test], values [-n_test:]
X, y = train_feats.values, train_vals.values
X = X.reshape(X.shape[0], 1, X.shape[1])
X_valid, y_valid = test_feats .values, test_vals .values
X_valid = X_valid.reshape(X_valid.shape[0], 1, X_valid.shape[1])
I haven't made the data stationary for this forecast. I also tried taking difference and making the model as stationary as I can, but the issue remains the same.
I have also tried different scaling ranges for the min-max scaler, hoping it may help the model. But the forecasts are getting worsened.
Other Things I have tried
=> Tried other optimizers
=> Tried mse loss and custom log-mae loss functions
=> Tried varying batch_size
=> Tried adding more past timesteps
=> Tried training with sliding window and TimeSeriesSplit
I understand that the model is replicating the last known value to it, thereby minimizing the loss as good as it can
The validation and training loss remains low enough through out the training process. This makes me think whether I need to come up with a new loss function for this purpose.
Is that necessary.? If so what loss function should I go for.?
I have tried all the methods that I stumbled upon. I can't find any resource at all that points to this kind of issue. Is this the problem of data.? Is this because the problem is very hard to be learned by a LSTM .?

you asked for my help at:
stock prediction : GRU model predicting same given values instead of future stock price
Hope not late. What you can try is that you can divert the numerical explicitness of your features. Let me explain:
Similar to my answer in the previous topic; the regression algorithm will use the value from the time-window you give as a sample, to minimize the error. Let's assume you are trying to predict the closing price of BTC at time t. One of your features consists of previous closing prices and you are giving a time-series window of last 20 inputs from t-20 to t-1. A regressor probably will learn to choose the closing value at time step t-1 or t-2 or a close value in this case, cheating. Think like that: if closing price was $6340 at t-1, predicting $6340 or something close at t+1 would minimize the error at strongest. But actually the algorithm did not learn any patterns; it just replicates, so it basically does nothing but accomplishing its optimization duty.
Think analogously from my example: By diverting the explicitness, what I mean is that: do not give the closing prices directly, but scale them or do not use explicit ones at all. Do not use any features explicitly showing the closing prices to the algorithm, do not use open, high, low etc for every time step. You will need to be creative here, engineer the features to get rid of explicit ones; you can give squared close differences (regressor can still steal from past with linear differences, with experience), its ratio to volume. Or, can make the features categorical by digitizing them in a manner that would make sense to use. The point is do not give direct intuition to what it should predict, only provide patterns for algorithm to work on.
A faster approach may be suggested depending on your task. You can do multi-class classification if predicting how much percent of change that your labels is enough for you, just be careful about class imbalance situations. If even just the up/down fluctuations are enough for you, you can directly go for the binary classification. Replication or shifting problems are only seen at the regression tasks, if you are not leaking data from training to the test set. If possible, get rid out of regression for time-series windowed applications.
If anything misunderstood or missing, I will be around. Hope I could help. Good Luck.

Most likely your LSTM is learning to guess roughly what its previous input value was (modulated a bit). That's why you see a "shift".
So let's say your data looks like:
x = [1, 1, 1, 4, 5, 4, 1, 1]
And your LSTM learned to just output the previous input for the current timestep. Then your output would look like:
y = [?, 1, 1, 1, 4, 5, 4, 1]
Because your network has some complicated machinery it is not quite this straightforward but in principle the "shift" you see is caused by this phenomenon.

Drastically different feature importance between very same data and very similar model for catboost

Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated

Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.