Why to invert predictions on LSTM-RNN?

Why to invert predictions on LSTM-RNN? - python

Here: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/ under the paragraph: LSTM Network for Regression this guy inverts predictions inside the LSTM-RNN code. If I remove those lines of code, the resulted predictions are useless. I mean the model does not predict anything. So, my question is what the code that invert predictions really does? Why does he use it?

In the field of time series forecasting, raw data generally have large values. For example, in the field of load forecasting, the load value at each moment is about tens of thousands. In order to speed up the convergence of the model, we generally need to normalize the original data. For example, use MinMaxScaler to adjust the range of all data to [0, 1].
It is worth noting that after normalizing the data, the value predicted by the model will also be in the range [0, 1] (if the model converges well). At this time, the prediction result of the model cannot be used directly (the load value in the real scene cannot be in the range of [0, 1]), so we need to inverse normalize the prediction result, that is, inverse_transform.

Related

Using SHAP: Scaling the Shapley values for each model and then averaging across models, or just adding the Shapley values for each model?

I'm running n trials on a Keras model with k features, after which I apply SHAPs DeepExplainer to the model on each trial. All the data is the same, but it is randomly split between the training and testing sets. I'm trying to figure out the best way to combine the model outputs, whether it be directly by adding the Shapley values for each trial, feature by feature, and then averaging - or by scaling the Shapley values output each trial first and then adding them and averaging.
My initial thought was that, as the "baseline is always relative based on the average of all predictions" (from here), the overall average would be skewed and there might be a better way of combining the data. Though I wonder if, despite the different samples in the train/test split and a different relative "baseline" for each model, if averaging over many models would give a final averaged model should have as much interpretation value as a single model. Should this be the case?
However, would scaling the features per model offer any benefits: Again from here I can (save for the caveats) scale a features Shapley values for a single observation in a model. It seems then that I should be able to scale each of the features Shapley values after summing over all observations, over each bin such that all Shapley values for each feature sum to 1. If this is the case, that I can scale by features within the model can I average the models this way? I am thinking a benefit of this is that all models will then have equal weight since the features are scaled within each. Is this a valid approach, and if so, does it offer any benefit over adding all the Shapley values, feature by feature together over all models?
To be clear on what mean concerning the bins, they are the the lists returned from the explainer, equal to the number of classifications:
explainer = shap.DeepExplainer(model, X_train)
ShapleyBinVals = explainer.shap_values(X_test)
Bin = ShapleyBinVals[n]
where n is the number of output classifications. Here's a bar plot of the scaled output:
Notice that for each feature e.g. PSWQ_2 the y-value is a percentage and the sum of percentages over all bins is 1.

Autoencoder for Tabular Data with Discrete Values

I want to use an autoencoder for dimension reduction in Keras. The input is a table with discrete values 0,1,2,3,4 (each of these numbers show a category) in the columns. Each subject has a label 0/1 to show sick/healthy. Now I have two questions:
Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)

Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
The activation in the last layer should be sigmoid and use binary_crossentropy loss function for training.
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)
It depends on the nature of discrete values you mentioned. As you know, inputs to a neural network represents the "intensity" of each neurons; higher values mean the neuron being more intensive/active. So, categorical values as input to a NN only makes sense if they map to a continuous range. For example if excellent=3, good=2, bad=1, terrible=0, it's okay to feed these values to a NN because it makes sense to calculate f(wx+b) (intensity of the neuron) as a value of 1.5 means somewhere between bad and good.
However if the categorical values are pure nomial values without any relationship between them (for example: apple=1, orange=2, banana=3), it really doen't make sense to calculate the f(wx+b). In this case what does value 1.5 mean? For this type of data as input to a NN you should convert them to a binary encoding. For example if you have only 3 fruits you can encode this way:
apple = [1, 0, 0]
orange = [0, 1, 0]
banana = [0, 0, 1]
For this binary conversion, Keras has an utility function: to_categorical.

Alternative Loss Functions for Multi Label Classification

I am currently using the following loss function:
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
However, my loss quickly approaches zero since there are ~1000 classes and only a handful of ones for any example (see attached image) and the algorithm is simply learning to predict almost entirely zeroes. I'm worried that this is preventing learning even though the loss continues to creep slightly towards zero. Are there any alternative loss functions that I should consider?

Did you try to project one multi-label target vector into multiple one-hot vectors?
Bear with me for a moment. For brevity I will build the loss function in numpy.
Apply sigmoids on your model outputs. Let's call it y. This will be the probabilities for each class. Here for simplicity I will sample from unit uniform.
y = np.random.uniform(0,1,size=[5]) # inferred
y_true = np.array([1, 0, 0, 1, 0]) #multi-label target vector
projection = y_true*np.identity(5) #expand each label into one separate one-hot vector
cross_entropy = -projection*np.log(y) # cross entropy for each label
loss = np.sum(cross_entropy) # sum cross entropies for different labels
I belive that now the biggiest weight in calculating the gradients will fall in the nonzero elements (the labels) and the gradients will point in the direction that pleases all the labels.
Am I missing something?

SciKit-learn for data driven regression of oscillating data

Long time lurker first time poster.
I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time.
The goal is to predict the future values of the target variable. I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn.
I have tried the following methods (the parameters were blindly copied from examples and other threads):
LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
"gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
The results fall into two different categories of failure:
The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable)
The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant.
Below is how I performed the training and testing:
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
'Time':weather_df['Days'].ix[:'2015-1-1']
})
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred
Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters?
(Also, my treatment of the time objects in sklearn is far from elegant, so I am gladly taking advice there.)

Here is my guess about what is happening in your two types of results:
.days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset.
As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data.
Suggestion:
If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information.

Agree with #zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc.
Beyond that, I'd like to remind using prior knowledge when selecting features or models. Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach.
We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. None of the models that you used may have included these features. One way to do it would be to create these high order features by yourself and concatenate to your other features. Then a linear model with regulation may give you reasonable results.

Scikit-learn RandomForestClassifier output of predict_proba

I have a dataset that I split in two for training and testing a random forest classifier with scikit learn.
I have 87 classes and 344 samples. The output of predict_proba is, most of the times, a 3-dimensional array (87, 344, 2) (it's actually a list of 87 numpy.ndarrays of (344, 2) elements).
Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344) (though I can't work out in which cases).
My two questions are:
what do these dimensions represent? I worked out that to get a ROC AUC score, I have to take one half of the output (that is (87, 344, 2)[:,:,1], transpose it, and then compare it with my ground truth (roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T) essentially) . But I don't understand what it really means.
why does the output change with different subsets of the data? I can't understand in which cases it returns a 3D array and in which cases a 2D one.

classifier.predict_proba() returns the class probabilities. The n dimension of the array will vary depending on how many classes there are in the subset you train on

Are you sure the arrays you're using to fit the RF has the right shape ? (n_samples,n_features) for the data and (n_samples) for the target classes.
You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:]. Note that sum( Y_pred[r,:] ) = 1.
However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. In that case, for each class and each sample, you would predict the probability of belonging to this class or not.
Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). You can get it with the attribute n_classes (and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes. See the documentation.
Hope it helps !

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.