Python, ValueError, BroadCast Error with SKLearn Preproccesing - python

I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?

please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?

What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.

The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...

t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.

Related

Dimensionality Reduction ValueError

New to this subject. I was trying to use PCA, from sklearn to reduce my data dimensionally. As I dont know another method I am trying to use PCA to guess how much dimensions it should be used.
My data is an ndarray with shape (51, 2928). With the next code I try to fit the data
pca = PCA(n_components='mle', svd_solver='full')
pca.fit(data)
But I deal with the following error when trying to fit the data:
ValueError: n_components='mle' is only supported if n_samples >= n_features
What am I doing wrong?

fit and transform error on Cross validation and test data

I need help with the code here. i am trying to fit and transform the train data and then transform the cross validation and the test data. but when i do that i get the error that - ValueError: X has 24155 features, but Normalizer is expecting 49041 features as input.
Can someone please help me to solve this issue.
my code snippet-
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))
print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)
The transform function expects a 2D array as (samples, features)
The error indicates that second dimension of X_train['price'] and x_cv['price'] or x_test['price'] are not the same.
As the code reflects, you have 1 feature (price), and many samples. So, as the above explanation (samples, features), your input shape should be like (n_samples,1), since you have one feature. Now, consider to change the reshape to (-1,1) instead of (1,-1).
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(-1,1))

ValueError: X has 29 features per sample; expecting 84

I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.
The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.
Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.
#......continued
df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])
# step 3.5 transform data before making predictions
df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)
# step 4 predicting
lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)
Here's the error message
ValueError Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
11 # change name of model and file name
12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
14
15 #add model
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
287 Predicted class label per sample.
288 """
--> 289 scores = self.decision_function(X)
290 if len(scores.shape) == 1:
291 indices = (scores > 0).astype(np.int)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
268 if X.shape[1] != n_features:
269 raise ValueError("X has %d features per sample; expecting %d"
--> 270 % (X.shape[1], n_features))
271
272 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 29 features per sample; expecting 84
Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.
The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.
Edit
If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach.
Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:
df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)
This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.
Hope this helps
Use just Transform instead of fit_transform. This should do the trick. Hope it helps.
x_scbcd = scaler.transform(dfbcd)
Could you try using the transform method of x_scbcd [StandardScaler object] on your testing data object lrbcd_test? This will create a feature representation of your testing data.
ypredbcdfinal = lrbcd_test.predict(x_scbcd.transform(x_scbcd))
In place of predict we get error so
We get error free with
Pred_1=Model_1.predict(tfidf_train)
Cr1=accuracy_score(y_train,pred_1)

Using numpy.ndarray type (multilabel) for labels in Sagemaker RecordIO format?

I am trying to write a numpy.ndarray as the labels for Amazon Sagemaker's conversion tool: write_numpy_to_dense_tensor(). It converts a numpy array of features and labels to a RecordIO for better use of Sagemaker algorithms.
However, if I try to pass a multilabel output for the labels, I get an error stating it can only be a vector (i.e. a scalar for every feature row).
Is there any way of having multiple values in the label? This is useful for multidimensional regressions which can be achieved with XGBoost, Random Forests, Neural Networks, etc.
Code
import sagemaker.amazon.common as smac
print("Types: {}, {}".format(type(X_train), type(y_train)))
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
Output:
Types: <class 'numpy.ndarray'>, <class 'numpy.ndarray'>
X_train shape: (9919, 2684)
y_train shape: (9919, 20)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-fc1033b7e309> in <module>()
3 print("y_train shape: {}".format(y_train.shape))
4 f = io.BytesIO()
----> 5 smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
94 if labels is not None:
95 if not len(labels.shape) == 1:
---> 96 raise ValueError("Labels must be a Vector")
97 if labels.shape[0] not in array.shape:
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
ValueError: Labels must be a Vector
Tom, XGBoost does not support RecordIO format. It only supports csv and libsvm. Also, the algorithm itself doesn’t natively support multi-label. But there are a couple of ways around it: Xg boost for multilabel classification?
Random Cut Forest does not support multiple labels either. If more than one label is provided it picks up the first only.

sklearn dimensionality issues "Found array with dim 3. Estimator expected <= 2"

I am trying to use KNN to correctly classify .wav files into two groups, group 0 and group 1.
I extracted the data, created the model, fit the model, however when I try and use the .predict() method I get the following error:
Traceback (most recent call last):
File "/..../....../KNN.py", line 20, in <module>
classifier.fit(X_train, y_train)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 761, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 405, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
I have found these two stackoverflow posts which describe similar issues:
sklearn Logistic Regression "ValueError: Found array with dim 3. Estimator expected <= 2."
Error: Found array with dim 3. Estimator expected <= 2
And, correct me if I'm wrong, but it appears that scikit-learn can only accept 2-dimensional data.
My training data has shape (3240, 20, 5255)
Which consists of:
3240 .wav files in this dataset (this is index 0 of the training data)
For
For each .wav file there is a (20, 5255) numpy array which represents the MFCC coefficients (MFCC coefficients try and represent the sound in a numeric way).
My testing data has shape (3240,) #category is 0 or 1
What code can I use to manipulated my training and testing data to convert it into a form that is usable by scikit-learn? Also, how can I ensure that data is not lost when I go down from 3 dimensions to 2 dimensions?
It is true, sklearn works only with 2D data.
What you can try to do:
Just use np.reshape on the training data to convert it to shape (3240, 20*5255). It will preserve all the original information. But sklearn will not be able to exploit the implicit structure in this data (e.g. that features 1, 21, 41, etc. are different versions of the same variable).
Build a convolutional neural network on your original data (e.g. with tensorflow+Keras stack). CNNs were designed specially to handle such multidimensional data and exploit its structure. But they have lots of hyperparameters to tune.
Use dimensionality reduction (e.g. PCA) on the data reshaped to (3240, 20*5255). It fill try to preserve as much information as possible, while still keeping number of features low.
Use manual feature engineering to extract specific information from the data structure (e.g. descriptive statistics along each dimension), and train your model on such features.
If you had more data (e.g. 100K examples), the first approach might work best. In your case (3K examples and 10K features) you need to regularize your model heavily to avoid overfitting.

Categories

Resources