Incomprehension of input data for the LSTM model - python

I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand

I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.

Related

strange behaivour of 'inverse_transform' function in sklearn.preprocessing.MinMaxScalar

I used MinMaxScalar function in sklearn.preprocessing for normalizing the attributes of some of my variables(array) to use that in a model(linear regression), after the model creation and training
I tested my model with x_test(splited usind train_test_split) and stored the result in some variable(say predicted) ,for evaluating purpose i wanna evaluate my prediction with the original dataset for that i used "MinMaxScalar.inverse_transform" function, that function works well when my code is in below order,
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
when i changed the order like the below code it throws me error
on-broadcastable output operand with shape (379,1) doesn't match the
broadcast shape (379,13))
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
x_test=sc.fit_transform(x_train)
please compare the two photos for better understanding of my query
It can be seen from the linked printscreen figure that you use the same MinMaxScaler to fit and transform both the train and test x-data, and also the training y-data (which does not make sense).
The correct process would be
Fit the scaler with train x-data. The fit_transform() also transforms (scales) the x_train.
sc = MinMaxScaler(feature_range=(0,1))
x_train = sc.fit_transform(x_train)
Scale also the test x-data with the same scaler. Do not fit here; just scale/transform.
x_test = sc.transform(x_test)
If you think scaling is needed also for y-data, you will have to fit another scaler for that purpose. It could also be that there is no need for scaling the y-data.
# Option A: Do not scale y-data
# (do nothing)
# Option B: Scale y-data
sc_y = MinMaxScaler(feature_range=(0,1))
y_train = sc_y.fit_transform(y_train)
After you have trained your model (lr), you can make predictions with the scaled x_test and the model:
# Option A:
predicted = lr.predict(x_test)
# Option B:
y_test_scaled = lr.predict(x_test)
predicted = sc_y.inverse_transform(y_test_scaled)

ANN: keras/sklearn doesn't scale well

I don't know the reason why the results I have obtained aren't scale well.
As you can see on the pictures bellow, there is a problem with scaling.
There are two issues:
There are no negative values
There is problem with maximum values prediction
I don't have idea why I have those problems.
Do you have any idea's how I can fix this issue?
I would be very grateful for your help
CODE:
# Read inputs
X = dataset.iloc[0:20000, [1, 4, 10]].values
# Read output
y = dataset.iloc[0:20000, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Output matrix conversion
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
y_test = sc.transform(y_test)
# Import the Keras libraries and package
from keras.models import Sequential
from keras.layers import Dense
# building model
classifier = Sequential()
classifier.add(Dense(activation="sigmoid", input_dim=3, units=64, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=32, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=16, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))
classifier.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
# Fitting the ANN to the training set
results = classifier.fit(X_train, y_train, batch_size=16, epochs=25)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
If your blue curves show your initial output y and your orange ones the output of your model (you have not cared to clarify this...), then there is nothing strange here...
There is problem with maximum values prediction
Looking more carefully at your code, you will realize that you don't actually feed your initial y into your network, but its scaled version, i.e. the result of sc.transform(); hence, your output is also scaled, and you should use the inverse_transform method to get it back to the initial scale:
y_final = sc.inverse_transform(y_pred)
BTW, this will happen to work now, but in general it is not a good idea to use the same scaler (sc here) for two different datasets (i.e. your X's and y's) - you should define two different scalers instead, say sc_X and sc_y.
There are no negative values
That is because the sigmoid function you have used as activation in your output layer takes only positive values in [0, 1], so you may want to change it to something else that will be able to give the required value range (linear is a candidate), and possibly also change your other sigmoids to tanh.

Add data to MNIST dataset

I am doing a machine learning project to recognize handwritten digits. Actually, I just want to add few more data sets to MNIST but I am unable to do so.
I have done following:
n_samples = len(mnist.data)
x = mnist.data.reshape((n_samples, -1))# array of feature of 64 pixel
y = mnist.target # Class label from 0-9 as there are digits
img_temp_train=cv2.imread('C:/Users/amuly/Desktop/Soap/crop/2.jpg',0)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Now I want to add the img_temp_train to my dataset for training.
X_train=np.append(X_train,img_temp_train.reshape(-1))
y_train=np.append(y_train,[4.0])
The length after training is:
43904784 (X_train)
56001(y_train)
But it should be 56001 for both.
Try this:
X_train = np.append(X_train, [img_temp_train], axis=0)
You shouldn't be reshaping things willy-nilly without thinking about what you're doing first!
Also, it's usually a better idea to use concatenate:
X_train = np.concatenate((X_train, [img_temp_train]), axis=0)

Found input variables with inconsistent numbers of samples when fitting LogisticRegression

I am creating LogisticRegression classifier with the following code:
regressor = LogisticRegression()
regressor.fit(x_train, y_train)
Both x_train and y_train shapes are
<class 'tuple'>: (32383,)
x_train contains values around range [0..1], and y_train contains only 0s and 1s.
Unfortunately, fit fails with error
ValueError: Found input variables with inconsistent numbers of samples: [1, 32383]
Adding transpose to arguments doesn't help.
To continue the solution that I proposed in my comment:
The problem is the shape of the x_train. So we need to reshape it:
From the documentation:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
Example using scikit-learn and numpy:
from sklearn.linear_model import LogisticRegression
import numpy as np
# create the tuple data
x_train = tuple(range(32383))
x_train = np.asarray(x_train)
#same for y_train
y_train=tuple(range(32383))
y_train = np.asarray(y_train)
#convert tuples to nparray and reshape the x_train
x_train = x_train.reshape(32383,1)
#check if shape if (32383,)
y_train.shape
#create the model
lg = LogisticRegression()
#Fit the model
lg.fit(x_train, y_train)
This should work fine.
Hope it helps
I guess a little reshaping is necessary. I tried it like this :
from sklearn.linear_model import LogisticRegression
import numpy as np
#x_train = np.random.randn(10,1)
x_train = np.asarray(x_train).reshape(32383,1)
con = np.ones_like(x_train)
x_train = np.concatenate((con,x_train), axis =1)
#y = np.random.randn(10,1)
#y_train = np.where(y<0.5,1,0)
y_train = np.asarray(y_train).reshape(32383,1)
regressor = LogisticRegression()
regressor.fit(x_train,y_train)
The comments are just what i did to create some data. And dont forget to ad a constant like in the example, as far as i know sklearn isn't doing it. Also Statsmodels could be helpfull to you if you are interested in some statistical test and a pretty print of the results :
from statsmodels.api import Logit
logit =Logit(y_train, x_train)
fit= logit.fit()
fit.summary()
That will give you a little more statistical intel without much effort.

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.

Categories

Resources