How can I solve inverse_transform with shape problem? - python

here is my code
scaler = MinMaxScaler() #default set 0~1
dataset= scaler.fit_transform(dataset)
...
make model
...
predicted = model.predict(X_test) #shape : (5, 1)
and when I run predict = scaler.inverse_transform(predicted)
ValueError occur ValueError: non-broadcastable output operand with shape (5,1) doesn't match the broadcast shape (5,2)
My model have 2 feature as input
I tried scaler.inverse_transform(predict)[:, [0]] and reshape in several directions
but occur same ValueError
how can I solve this Problem? please give me some advice
I need your priceless opinion and will be very much appreciated.

You are using inverse_transform in a wrong way: while you have used fit_transform to your features, you are using inverse_transform to your predictions, which are of a different shape, hence the error.
This is not the intended usage of inverse_transform; have a look at the docs for more:
inverse_transform(self, X)
Undo the scaling of X according to feature_range.
Parameters: X : array-like, shape [n_samples, n_features]
Input data that will be transformed. It cannot be sparse.
It is not clear from your post why you attempt to "transform back" your predictions; this only makes sense if you already have transformed your labels (it is not clear from your post if you have done so), and you want, say, to scale back measures like MSE in the original scale of the labels. In such a case, you should use a separate scaler for your labels - see own answer in How to interpret MSE in Keras Regressor for details (the example there is with StandardScaler, but the rationale is the same).

Related

fit and transform error on Cross validation and test data

I need help with the code here. i am trying to fit and transform the train data and then transform the cross validation and the test data. but when i do that i get the error that - ValueError: X has 24155 features, but Normalizer is expecting 49041 features as input.
Can someone please help me to solve this issue.
my code snippet-
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))
print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)
The transform function expects a 2D array as (samples, features)
The error indicates that second dimension of X_train['price'] and x_cv['price'] or x_test['price'] are not the same.
As the code reflects, you have 1 feature (price), and many samples. So, as the above explanation (samples, features), your input shape should be like (n_samples,1), since you have one feature. Now, consider to change the reshape to (-1,1) instead of (1,-1).
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(-1,1))

Third parameter (kwargs) sklearn fit() function

I'm quite new to scikit-learn and I have a question about the fit() function. I tried to look for information on the internet but couldn't find much.
In an assignement I have to create a dict of parameters passed to the fit function of a classifier, which means the function will take 3 arguments (X, y, kwargs). What parameters is this dictionary supposed to have? Apparently those are hyper parameters for the fit function. Online I only found information for xgbooster but I'm not supposed to use that, only classifiers from sklearn.
I also found online that fit can take a dictionary called **fit_params but there is nothing about the parameters the function might take.
I hope my question is clear, thanks a lot in advance!
The model hyperparameters are not arguments to the fit function, but to the model class object that you need to create beforehand.
If you have a dictionary with parameters that you want to pass to your model, you need to do things this way (here with a Logistic Regression):
from sklearn.linear_model import LogisticRegression
params = {"C":10, "max_iter":200}
LR = LogisticRegression(**params)
Now that you have created the model specifying the hyperparameters, you can proceed and fit it with your data.
LR.fit(X, y)
I haven't used scikit-learn before, but you can get the docs of a function that you are unsure about by using the __doc__ method. The fit() method of an estimator returns this for its __doc__ method:
Fit the SVM model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
For kernel="precomputed", the expected shape of X is
(n_samples, n_samples).
y : array-like of shape (n_samples,)
Target values (class labels in classification, real numbers in
regression)
sample_weight : array-like of shape (n_samples,), default=None
Per-sample weights. Rescale C per sample. Higher weights
force the classifier to put more emphasis on these points.
Returns
-------
self : object
Notes
-----
If X and y are not C-ordered and contiguous arrays of np.float64 and
X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
If X is a dense array, then the other methods will not support sparse
matrices as input.
I ran this to get that output:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
print(clf.fit.__doc__)

Why does AdaBoost not work with DecisionTree?

I'm using sklearn 0.19.1 with DecisionTree and AdaBoost.
I have a DecisionTree classifier that works fine:
clf = tree.DecisionTreeClassifier()
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
clf.fit(train_pdf_x, train_pdf_y)
pred2 = clf.predict(test_pdf_x)
But when trying to add AdaBoost, it throws an error on the predict function:
treeclf = tree.DecisionTreeClassifier(max_depth=3)
adaclf = AdaBoostClassifier(base_estimator=treeclf, n_estimators=500, learning_rate=0.5)
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
adaclf.fit(train_pdf_x, train_pdf_y)
pred2 = adaclf.predict(test_pdf_x)
Specifically the error says:
ValueError: bad input shape (236821, 6)
The dataset that it seems to be pointing to is train_pdf_y because it has a shape of (236821, 6) and I don't understand why.
From even the description of the AdaBoostClassifier in the docs I can understand that the actual classifier that uses the data is the DecisionTree:
An AdaBoost 1 classifier is a meta-estimator that begins by fitting
a classifier on the original dataset and then fits additional copies
of the classifier on the same dataset but where the weights of
incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases
But still I'm getting this error.
In the code examples I've found, even on sklearn's website with how to use AdaBoost and I can't understand what I'm doing wrong.
Any help is appreciated.
It looks like you are trying to perform a Multi-Output classification problem, given the shape of y, otherwise it does not make sense that you are feeding and n-dimensional y to adaclf.fit(train_pdf_x, train_pdf_y).
So assuming that is the case, the problem is that indeed Scikit-Learn's DecisionTreeClassifier does support Multi-output problems, this is, y inputs with shape [n_samples, n_outputs]. However that is not the case for the AdaBoostClassifier, given that, from the documentation, the labels must be:
y : array-like of shape = [n_samples]

Keras Input Shape Issue

I can find many questions and answers related to my question but somehow they did not solve my problem. I have data with shape (10584, 56) and specified input_shape=(10584,56) in the code but getting following error:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (10584, 56).
I have somehow idea that I have to reshape my input data frame but not sure how. Following is my code:
y = df['Target']
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
model = keras.models.Sequential()
model.add(keras.layers.Dense(64,input_shape(10584,56),activation='relu'))
Any help/suggestion will be much appreciated.
There is always an additional dimension for the batch size that you need add even if you want to use a batch size of 1.
Another possibility: If in fact your samples are not 2d vectors but 1d vectors of size 64 and 10584 is the number of samples you have, than the number of samples is not part of the input shape. You only provide the size of a single sample. Keras will take care of splitting your data into batches and setting the network up the right way.

Scikit-learn RandomForestClassifier output of predict_proba

I have a dataset that I split in two for training and testing a random forest classifier with scikit learn.
I have 87 classes and 344 samples. The output of predict_proba is, most of the times, a 3-dimensional array (87, 344, 2) (it's actually a list of 87 numpy.ndarrays of (344, 2) elements).
Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344) (though I can't work out in which cases).
My two questions are:
what do these dimensions represent? I worked out that to get a ROC AUC score, I have to take one half of the output (that is (87, 344, 2)[:,:,1], transpose it, and then compare it with my ground truth (roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T) essentially) . But I don't understand what it really means.
why does the output change with different subsets of the data? I can't understand in which cases it returns a 3D array and in which cases a 2D one.
classifier.predict_proba() returns the class probabilities. The n dimension of the array will vary depending on how many classes there are in the subset you train on
Are you sure the arrays you're using to fit the RF has the right shape ? (n_samples,n_features) for the data and (n_samples) for the target classes.
You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:]. Note that sum( Y_pred[r,:] ) = 1.
However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. In that case, for each class and each sample, you would predict the probability of belonging to this class or not.
Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). You can get it with the attribute n_classes (and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes. See the documentation.
Hope it helps !

Categories

Resources