I am getting the following error while trying to load a scaler to make a prediction on a trained neural network: ValueError: operands could not be broadcast together with shapes (317,257) (269,) (317,257)
Some context: the first number (317, 257) is the shape of the data set I am trying to make a prediction on, while (269,) is the shape of the training set. I pickled the scaler and loaded it back in, but will get to that later. Before I apply OneHotEncoder, the data set only has 9 columns. After LabelEconder and OneHotEncoder is applied, it expands to 269 columns (there are a lot of large categorical variables within the data).
My code:
Applying OneHotEncoder after applying LabelEncoder:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_4 = LabelEncoder()
X[:, 3] = labelencoder_X_4.fit_transform(X[:, 3].astype(str))
labelencoder_X_5 = LabelEncoder()
X[:, 4] = labelencoder_X_5.fit_transform(X[:, 4].astype(str))
labelencoder_X_6 = LabelEncoder()
X[:, 5] = labelencoder_X_6.fit_transform(X[:, 5].astype(str))
labelencoder_X_7 = LabelEncoder()
X[:, 6] = labelencoder_X_7.fit_transform(X[:, 6].astype(str))
labelencoder_X_8 = LabelEncoder()
X[:, 7] = labelencoder_X_8.fit_transform(X[:, 7].astype(str))
labelencoder_X_9 = LabelEncoder()
X[:, 8] = labelencoder_X_9.fit_transform(X[:, 8].astype(str))
onehotencoder_X = OneHotEncoder(categorical_features = [1,3,4,5,6,7,8])
X = onehotencoder_X.fit_transform(X).toarray()
Fitting to the scaler during training:
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
joblib.dump(sc, 'scaler.pkl')
X_test = sc.transform(X_test)
Then retrieving the scaler and attempting to transform test data:
from sklearn.externals import joblib
sc = joblib.load('scaler.pkl')
dataset = sc.transform(dataset) #Error happens here
The test data set is pretty large since it does expand to 257 of 269 of the variables - but ideally I want to be able to predict on just a single row of data. In order to get the correct shape, do I have to append that data to a set that contains ALL of the different categories within the categorical data? What happens if a row of data has a value that was not present in the dataset? This just seems inefficient, so there must be a simple fix to this, right?
Thank you for any help you can provide. If you need any additional details, please let me know.
Update: I am pickling all the encoders and loading them back in when trying to encode the test set. I am then using .transform instead of .fit_transform on the test set. I feel like this is a step in the right direction, but here is the error I am getting now: ValueError: y contains new labels: ['0' '1' '2']
Related
I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.
I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand
I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.
The training dataset has object columns called shops and others. Now for the machine learning model I converted the columns into labels for training purposes. Using the code below
from sklearn.ensemble import RandomForestRegressor
X = df_all_4.copy()
y = df_all_4.item_price
X = X.drop(['item_price','date'], axis=1)
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
X[c] = X[c].factorize()[0]
rf = RandomForestRegressor()
rf.fit(X,y)
Now the testing dataset also has those categorical columns but with the some columns missing including the target column not relevant here I think. But if I again label the training dataset (unordered) the labels would be different than the one used while training so the model would not work properly . How to solve this problem and get the same encodings while training and testing
The important thing here is you can use LabelEncoder or OneHotEncoder classes present in Sklearn package. which makes this task pretty much simple.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
le = LabelEncoder()
X[c] = le.fit_transform(X[c])
test[c] = le.transform(test[c])
That's it you have encoded the labels into numbers for both train and test data
You can also use OneHotEncoder which does OneHotEncoding to categorical data.
I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
You can just do dataset['prediction'] = y_pred to add a new column.
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
EDIT
Looking at your code and the data, you're misunderstanding what train_test_split does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
y_pred = classifier.predict_proba(X)
now assign this back :
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
There are several solutions. The answer of EdChurm had mentioned one.
As far as I know, pandas has other 2 methods to work with it.
df.insert()
df.assign()
Since you didn't provide the data in use, here's a pretty simple example.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign() doesn't work inplace, so you should reassign to your previous variable.
In my opinion, I prefer df.insert() the most, for it allows you to assign which location you want to insert. (with parameter loc)