sklearn throws ValueError: bad input shape (600000, 24) - python

Below is a piece of my code for Label Encoding. While implementing Label Encoder for one column of the Dataframe at a time, it worked fine. But, when I tried to implement on whole categorical features at once sklearn throws
ValueError: bad input shape (600000, 24). I'm not able to find any specific reason for that.
df = pd.read_csv("../inputs/cat-in-the-dat-train-folds.csv")
# extracting the categorical features
cat_features = [x for x in df.columns if x not in ( "id", "target", "kflod")]
for col in cat_features:
df.loc[:, col] = df[col].astype(str).fillna("NONE")
df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)
lbl_enc = preprocessing.LabelEncoder()
full_cat_data = pd.concat(
[df_train[cat_features], df_valid[cat_features]],
axis=0)
lbl_enc.fit(full_cat_data)
x_train = lbl_enc.transform(df_train[cat_features])
x_valid = lbl_enc.transform(df_valid[cat_features])

sklearn.preprocessing.LabelEncoder.fit only takes a 1D array as a parameter.
To fit multiple columns, use sklearn.preprocessing.OrdinalEncoder.fit which can take multi-dimensional data with [n_samples, n_features] (as per the documentation)
In your example, try replacing lbl_enc = preprocessing.LabelEncoder() with lbl_enc = preprocessing.OrdinalEncoder() and that should work.
See this answer here for more information on the difference between LabelEncoder and OrdinalEncoder
Additional resources:
Label encoding across multiple columns in scikit-learn

Related

Fit a Normalizer with an array, then transform another in python with sklearn

I'm not sure if i'm doing something wrong, or if this is not the correct way to do this..
I'm encoding variables in a dataset for a model, now, i'm using a Normalizer() from sklearn.preprocessing to normalize one of my variables which is numerical.
My dataset is split in two, one for the training and one for the inference. Now, my goal is to normalize this numerical variable (let's call it column x) in the training subset, and then use the normalization parameters to normalize the same variable in the inference dataset. Now, both subsets don't have the same amount of entries, so, what i'm doing is:
nr = Normalizer()
nr.fit([df1.x])
new_col = nr.transform(df1.x)
Now, the problme is.. when i try to use the same normalizer parameters on the column x in the inference subset, since it has a different number of rows:
new_col1 = nr.transform(df2.x)
I get:
X has 10 features, but Normalizer is expecting 697 features as input.
I'm not sure if it's some reshape problem or if the Normalizer() shouldn't be used in that way, so, any advice would be more than welcome.
Normalizer is used to normalize rows whereas StandardScaler is used to normalize column. Concerning your questions, it seems that you want to scale columns. Therefore you should use StandardScaler.
scikit-learn transformers excepts 2D array as input of shape (n_sample, n_feature) but pandas.Series are one-dimensional ndarray with axis labels.
You can fix that by passing a pandas.DataFrame to the transformer.
As follows:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
df1 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=1000)})
df2 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=850)})
scaler = StandardScaler()
new_col = scaler.fit_transform(df1[['x']])
new_col1 = scaler.transform(df2[['x']])

How to safely resolve Setting With Copy Warning on assigning over a Pandas DataFrame [duplicate]

This question already has an answer here:
Creating a new column for predicted cluster: SettingWithCopyWarning
(1 answer)
Closed 1 year ago.
I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.
I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.
My process is as follows with some toy data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List
# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
['1',125,'A',15],
['2',134,'A',20],
['3',112,'A',25],
['4',107,'B',35],
['5',68,'B',50],
['6',321,'B',10],
['7',26,'B',27],
['8',115,'C',64],
['9',100,'C',72],
['10',74,'C',18],
['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id object
weight int64
type object
age int64
dtype: object
# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()
# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.
You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.
The second approach would prevent information leak as commented in your code. So something like this:
# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy() # don't you drop the label?
# y: pd.Series = df.pop('type') # y = df['type']
# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'],
test_size = 0.2,
random_state = 0)
# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()
## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
I try to explain both pd.get_dummies() and OneHotEncoder() for transforming categorical data into dummy columns. But I do recommend using OneHotEncoder() transformer, because it's a sklearn transformer that you can use it in a Pipeline later if you want.
First OneHotEncoder(): It does the same job as pd.get_dummies function of pandas does, but return of this class is a Numpy ndarray or a sparse array. you can read more about this class here:
from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.
Second method, pd.get_dummies():
df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)

How to deal with changing cardinality in sci-kit learn model

I am trying to use a high cardinality feature (siteid) in a sci-kit learn model and am using get_dummies to one-hot encode this feature. I get around 800 new binary columns which returns a decent accuracy using logistic regression. My problem is that when I pass a new dataset through my model I have a different cardinality on this feature with say 300 unique values and the model rightly asks, where are the other 500 columns you trained me on? How can I resolve this?
I don't want to have to train the model every time the cardinality changes nor do i want to hard code these columns in my SQL data load.
cat_columns = ["siteid"]
df = pd.get_dummies(df, prefix_sep="__",
columns=cat_columns)
My recommendation would be to pad these remaining columns with zeros. So if your new training sample has, for example 10 unique values, and the model expects 50 values (number of total_cols), then create 40 columns of zeros to the right to "fill out" the rest of the data:
df = pd.DataFrame({"siteid": range(10)})
cat_columns = ["siteid"]
df1 = pd.get_dummies(df, columns=cat_columns)
# df1 has shape (10, 10)
total_cols = 50 # Number of columns that model expects
zero_padding = pd.DataFrame(np.zeros((df1.shape[0], total_cols - df1.shape[1])))
df = pd.concat([df1, zero_padding], axis=1)
df.columns = ["siteid__" + str(i) for i in range(df.shape[1])]
# df now has shape (10, 50)
I suggest using scikit-learn's OneHotEncoder.
documentation here
In your case, the usage would look something like
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df[['cat_columns']])
categories = [cat for cats in enc.categories_ for cat in cats]
df[categories] = enc.transform(df[['cat_columns']])
the handle_unknown parameter is key, and the enc object is necessary for repeatability for new data.
On new dataframes you would run
df_new[categories] = enc.transform(df_new[['cat_columns']])
This will hot-encode the same categories and ignore any new ones that your model is not accustomed to.

is there a way to fit month wise date column into multivariate linear regression model with categorical data?

I tried Multivariate linear regression with categorical variables.
Used One hot encoder technique to solve the problem but got this error.
I have tried converting date string to timestamp using pd.to_datetime() function but then also it gave error like --
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
So I removed this thing and got backto solve the actual specified error using some alternative way..
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dfle = df #df is the dataset containing column names ----- 'Section', 'BRAND', 'RSP', 'Monthstartdate', and 'Sales'*(to be predicted)* ---------
dfle.Section = le.fit_transform(dfle.Section) #Categorical values (2 in number )
dfle.BRAND = le.fit_transform(dfle.BRAND) #Categorical values (390 in number)
X = dfle[['Section', 'BRAND', 'RSP', 'Monthstartdate']].values
y = dfle.Sales
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features = [0])
X = ohe.fit_transform(X).toarray()
Expected result was that the array could have been fit properly but getting this error.
Error -
----> X = ohe.fit_transform(X).toarray()
ValueError: could not convert string to float: '01/06/2016'
('01/06/2016' is string in this case, not timestamp, It would have been awesome if it could have been timestamp and working with the regression problem)
You need to use Label encoder or vectorizer or tokenizer on all values of column before giving to one-hot encoder.
The below code works fine on my side
import pandas as pd
df = pd.read_csv('yourfile.txt',delimiter=',')
df.head()
dfle = df.monthstartdate
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dfle = le.fit_transform(dfle)
feature = dfle
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features = [0])
feature = feature.reshape(feature.shape[0], 1)
feature = ohe.fit_transform(feature)
print (feature)
I think you have missed
dfle.monthstartdate = le.fit_transform(dfle.monthstartdate) in your code?

How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

I am struggling to understand how training our algorithms connects with making predictions on new data.
My situation: I have an algorithm that I use on a labeled dataset. After the steps of importing it, encoding it, fit_transforming it and fitting it to make predictions on the data_test of the train_test_split function I get a really nice prediction from using the labeled dataset.
I am stumped as to how I need to feed a new dataset (unlabeled this time) to the trained model, which has learned from the labeled dataset. I know that technically the data used to train withheld the labels from itself to predict, but I am unaware how I have to provide the gaussianNB algorithm new data features to predict unknown labels.
My code for the training:
df = pd.read_csv(chosen_file, sep=',')
cat_cols = df.select_dtypes(include=['object'])
cat_cols_filled = cat_cols.fillna('0')
le = LabelEncoder()
cat_cols_fitted = cat_cols_filled.apply(lambda col: le.fit_transform(col))
non_cat_cols = df.select_dtypes(exclude=['object'])
non_cat_cols_filled = non_cat_cols.fillna('0')
non_cat_cols_fitted = non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
target_prep = df.iloc[:,-1]
target = le.fit_transform(target_prep.astype(str))
data = pd.concat([cat_cols_fitted, non_cat_cols_fitted], axis=1)
try:
data_train, data_test, target_train, target_test = train_test_split(data, target, train_size=0.3))
alg = GaussianNB()
pred = alg.fit(data_train, target_train).predict(***data_test***)
This is all fine and dandy. But I cannot understand how I have to give something in place of data_test. Do I need to provide the new dataset with some placeholder values for the label column? My label column from the beginning dataframe is the last one.
My attempt:
new_df = pd.read_csv(new_chosen_file, sep=',')
new_cat_cols = new_df.select_dtypes(include=['object'])
new_cat_cols_filled = new_cat_cols.fillna('0')
new_cat_cols_fitted = new_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_non_cat_cols = new_df.select_dtypes(exclude=['object'])
new_non_cat_cols_filled = new_non_cat_cols.fillna('0')
new_non_cat_cols_fitted = new_non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_data = pd.concat([new_cat_cols_fitted, new_non_cat_cols_fitted], axis=1)
print(new_data)
new_pred = alg.predict(new_data)
new_prediction = pd.DataFrame({'NEW ML prediction':new_pred})
print(new_pred)
print(new_prediction)
Notice I do not provide the target column in the new dataset. However the program errors out on me if I my column count does not match, so I am forced to add at least the label for the column for it to not do that:
Am I way off in my understanding of how this works? Please let me know.
EDIT:
I found my major screw-up in the code. I had not isolated my target column out of the data DataFrame. This was why data was 10 column shape.
I can finally appreciate the simplicity of the code.
You are instantiating an empty model to alg. Returning the prediction from fitted model to a variable named pred. So you are not actually saving the fitted model.
The concatenation of multiple methods such as
alg.fit(data_train, target_train).predict(***data_test***) is known as method chaining and can cause confusion.
A cleaner & more readable alternative is to :
alg = GaussianNB() # initiating model
alg = alg.fit(data_train, target_train) # fitting model with train data
pred = alg.predict(***data_test***) # testing with test data
new_pred = alg.predict(new_data) # test with new data`

Categories

Resources