fit and transform error on Cross validation and test data

fit and transform error on Cross validation and test data - python

I need help with the code here. i am trying to fit and transform the train data and then transform the cross validation and the test data. but when i do that i get the error that - ValueError: X has 24155 features, but Normalizer is expecting 49041 features as input.
Can someone please help me to solve this issue.
my code snippet-
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))
print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)

The transform function expects a 2D array as (samples, features)
The error indicates that second dimension of X_train['price'] and x_cv['price'] or x_test['price'] are not the same.
As the code reflects, you have 1 feature (price), and many samples. So, as the above explanation (samples, features), your input shape should be like (n_samples,1), since you have one feature. Now, consider to change the reshape to (-1,1) instead of (1,-1).
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(-1,1))

Related

How can I solve inverse_transform with shape problem?

here is my code
scaler = MinMaxScaler() #default set 0~1
dataset= scaler.fit_transform(dataset)
...
make model
...
predicted = model.predict(X_test) #shape : (5, 1)
and when I run predict = scaler.inverse_transform(predicted)
ValueError occur ValueError: non-broadcastable output operand with shape (5,1) doesn't match the broadcast shape (5,2)
My model have 2 feature as input
I tried scaler.inverse_transform(predict)[:, [0]] and reshape in several directions
but occur same ValueError
how can I solve this Problem? please give me some advice
I need your priceless opinion and will be very much appreciated.

You are using inverse_transform in a wrong way: while you have used fit_transform to your features, you are using inverse_transform to your predictions, which are of a different shape, hence the error.
This is not the intended usage of inverse_transform; have a look at the docs for more:
inverse_transform(self, X)
Undo the scaling of X according to feature_range.
Parameters: X : array-like, shape [n_samples, n_features]
Input data that will be transformed. It cannot be sparse.
It is not clear from your post why you attempt to "transform back" your predictions; this only makes sense if you already have transformed your labels (it is not clear from your post if you have done so), and you want, say, to scale back measures like MSE in the original scale of the labels. In such a case, you should use a separate scaler for your labels - see own answer in How to interpret MSE in Keras Regressor for details (the example there is with StandardScaler, but the rationale is the same).

Why does AdaBoost not work with DecisionTree?

I'm using sklearn 0.19.1 with DecisionTree and AdaBoost.
I have a DecisionTree classifier that works fine:
clf = tree.DecisionTreeClassifier()
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
clf.fit(train_pdf_x, train_pdf_y)
pred2 = clf.predict(test_pdf_x)
But when trying to add AdaBoost, it throws an error on the predict function:
treeclf = tree.DecisionTreeClassifier(max_depth=3)
adaclf = AdaBoostClassifier(base_estimator=treeclf, n_estimators=500, learning_rate=0.5)
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
adaclf.fit(train_pdf_x, train_pdf_y)
pred2 = adaclf.predict(test_pdf_x)
Specifically the error says:
ValueError: bad input shape (236821, 6)
The dataset that it seems to be pointing to is train_pdf_y because it has a shape of (236821, 6) and I don't understand why.
From even the description of the AdaBoostClassifier in the docs I can understand that the actual classifier that uses the data is the DecisionTree:
An AdaBoost 1 classifier is a meta-estimator that begins by fitting
a classifier on the original dataset and then fits additional copies
of the classifier on the same dataset but where the weights of
incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases
But still I'm getting this error.
In the code examples I've found, even on sklearn's website with how to use AdaBoost and I can't understand what I'm doing wrong.
Any help is appreciated.

It looks like you are trying to perform a Multi-Output classification problem, given the shape of y, otherwise it does not make sense that you are feeding and n-dimensional y to adaclf.fit(train_pdf_x, train_pdf_y).
So assuming that is the case, the problem is that indeed Scikit-Learn's DecisionTreeClassifier does support Multi-output problems, this is, y inputs with shape [n_samples, n_outputs]. However that is not the case for the AdaBoostClassifier, given that, from the documentation, the labels must be:
y : array-like of shape = [n_samples]

Keras Input Shape Issue

I can find many questions and answers related to my question but somehow they did not solve my problem. I have data with shape (10584, 56) and specified input_shape=(10584,56) in the code but getting following error:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (10584, 56).
I have somehow idea that I have to reshape my input data frame but not sure how. Following is my code:
y = df['Target']
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
model = keras.models.Sequential()
model.add(keras.layers.Dense(64,input_shape(10584,56),activation='relu'))
Any help/suggestion will be much appreciated.

There is always an additional dimension for the batch size that you need add even if you want to use a batch size of 1.
Another possibility: If in fact your samples are not 2d vectors but 1d vectors of size 64 and 10584 is the number of samples you have, than the number of samples is not part of the input shape. You only provide the size of a single sample. Keras will take care of splitting your data into batches and setting the network up the right way.

LSTM: Understand timesteps, samples and features and especially the use in reshape and input_shape

I'm trying to learn LSTM. Have taken this web courses, read this book (https://machinelearningmastery.com/lstms-with-python/) and a lot of blogs... But, I'm completely stuck. My interest is in multivariate LSTM's and I have read all I can find but still can't get it. Don't know if I'm stupid or what it is...
If this exact question and a good answer already exists then I am sorry for double posting but I have looked and haven't found it...
As I want to really know the basics I created a dummy dataset in excel where every "y" depends on the sum of each input x1 and x2 but also over time. As I understand it this is a many-to-one scenario.
Pseudo code:
x1(t) = sin(A(t))
x2(t) = cos(A(t))
tmp(t) = x1(t) + x2(t) (dummy variable)
y(t) = tmp(t) + tmp(t-1) + tmp(t-2) (i.e. sum over the last three steps)
(Basically I want to predict y(t) given x1 and x2 over three time steps)
This is then exported to a csv file with columns x1, x2, y
I have tried to code it up below but obviously it won't work.
I read the data and split it into a 80/20 test and train set as X_train, y_train, X_test, y_test with dimensions (217,2), (217,1), (54,2), (54/1)
What I really haven't got a grip on yet is what exactly are timesteps and samples and the use in reshape and input_shape. In many examples of code I have looked at they simply use numbers rather than defined variables which makes it very difficult to understand what is happening, especially if you want to change something. As an example, in one of the courses I took the reshaping was coded like this...
X_train = np.reshape(X_train, (1257, 1, 1))
This doesn't provide much info...
Anyway, when i run the code below it says
ValueError: cannot reshape array of size 434 into shape (217,3,2)
So, I know what the causes the error, but not what I need to do to fix it. If I set look_back=1 it works but that's not what I want.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
# Load data
data_set = pd.read_csv('../Data/LSTM_test.csv',';')
"""
data loaded have three columns:
col 0, col 1: features (x)
col 2: y
"""
# Train/test and variable split
split = 0.8 # 80% train, 20% test
split_idx = int(data_set.shape[0]*split)
# ...train
X_train = data_set.values[0:split_idx,0:2]
y_train = data_set.values[0:split_idx,2]
# ...test
X_test = data_set.values[split_idx:-1,0:2]
y_test = data_set.values[split_idx:-1,2]
# Model setup
look_back = 3 # as that is how y was generated (i.e. sum last three steps)
num_features = 2 # in this case: 2 features x1, x2
output_dim = 1 # want to predict 1 y value
nb_hidden_neurons = 32 # assume something to start with
nb_epoch = 2 # assume something to start with
# Reshaping
nb_samples = len(X_train) # in this case 217 samples in the training set
X_train_reshaped = np.reshape(X_train,(nb_samples, look_back, num_features))
# Create model
model = Sequential()
model.add(LSTM(nb_hidden_neurons, input_shape=(look_back,num_features)))
model.add(Dense(units=output_dim))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_train_reshaped, y_train, batch_size = 32, epochs = nb_epoch)
print(model.summary())
Can anyone please explain what I have done wrong?
As I said, I have read a lot of blogs, questions, tutorials etc but if someone has a particularly good source of info I'd love to check that one up too.

I also had this question before. On a higher level, in (samples, time steps, features)
samples are the number of data, or say how many rows are there in your data set
time step is the number of times to feed in the model or LSTM
features is the number of columns of each sample
For me, I think a better example to understand it is that in NLP, suppose you have a sentence to process, then here sample is 1, which means 1 sentence to read, time step is the number of words in that sentence, you feed in the sentence word by word before the model read all the words and get a whole context of that sentence, features here is the dimension of each word, because in word embedding like word2vec or glove, each word is interpreted by a vector with multiple dimensions.
The input_shape parameter in Keras is only (time_steps, num_features),
more you can refer to this.
And the problem of yours is that when you reshape data, the multiplication of each dimension should equal to the multiplication of dimensions of original data set, where 434 does not equal to 217*3*2.
When you implement LSTM, you should be very clear of what are the features and what are the element you want the model to read each time step. There is a very similar case here surely can help you. For example, if you are trying to predict the value of time t using t-1 and t-2, you can either choose to feed in two values as one element to predict t, where (time_step, num_features)=(1, 2), or you can feed each value in 2 time steps, where (time_step, num_features)=(2, 1).
That's basically how I understand this, hope make it clear for you.

You seem to have a decent grasp of what LSTM expects and are just struggling with getting your data into the correct format. You start with an X_train of shape (217, 2) and you want to reshape this such that it's in the shape (nb_samples, look_back, num_features). You already have defined look_back and num_features and really all the work that's left is generating nb_samples chunks of length look_back with your original X_train. Numpy's reshape isn't really the tool for this, instead you'll have to write some code.
import numpy as np
nb_samples = X_train.shape[0] - look_back
x_train_reshaped = np.zeros((nb_samples, look_back, num_features))
y_train_reshaped = np.zeros((nb_samples))
for i in range(nb_samples):
y_position = i + look_back
x_train_reshaped[i] = X_train[i:y_position]
y_train_reshaped[i] = y_train[y_position]
model.fit(x_train_reshaped, y_train_reshaped, ...)
The shapes are now:
x_train_reshaped.shape
# (214, 3, 2)
y_train_reshaped.shape
# (214,)
You'll have to do the same thing with X_test and y_test.

This https://github.com/fchollet/keras/issues/2045 helped me.
But shortly, the answer for your question:
you want to reshape a list with 434 elements into shape (217,3,2), but it's impossible, let me show you why:
A new shape has 217*3*2 = 1302 elements, but you have 434 elements in the original list. Therefore, the solution is to change the dimensions of reshaping.

Python, ValueError, BroadCast Error with SKLearn Preproccesing

I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?

please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?

What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.

The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...

t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fit and transform error on Cross validation and test data - python

Related

How can I solve inverse_transform with shape problem?

Why does AdaBoost not work with DecisionTree?

Keras Input Shape Issue

LSTM: Understand timesteps, samples and features and especially the use in reshape and input_shape

Python, ValueError, BroadCast Error with SKLearn Preproccesing

Categories

Resources