How to use LabelEncoder in sklearn pipeline?
NOTE
The following code works for "OneHotEncoder" but fails for "LabelEncoder", How to use LabelEncoder in this circumstance?
MWE
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
import sklearn
print(sklearn.__version__) # 0.22.2.post1
df = sns.load_dataset('titanic').head()
le = OneHotEncoder() # this success
# le = LabelEncoder() # this fails
ct = make_column_transformer(
(le, ['sex','adult_male','alone']),
remainder='drop')
ct.fit_transform(df)
$$\begin{align}\mathsf P(N\mid E)&=\dfrac{\mathsf P(N\cap E)}{\mathsf P(E)}\[2ex]&=\dfrac{\mathsf P(N\cap E\mid F),\mathsf P(F)+\mathsf P(N\cap E\mid F^{\small\complement}),\mathsf P(F^{\small\complement})}{\mathsf P(E\mid F),\mathsf P(F)+\mathsf P(E\mid F^{\small\complement}),\mathsf P(F^{\small\complement})}\end{align}$$
From the docs, OneHotEncoder can take a dataframe and convert the categorical columns into the vectors you see. LabelEncoder takes a Series(your y / dependent variable) and generates new labels.
OnHotEncoder's usage: fit_transform(X,[y])
LabelEncoder's usage: fit_transform(y)
That's why it'll tell you: "fit_transform() takes 2 positional arguments but 3 were given"
Just call LabelEncoder fit_transform on the y directly if you really want to use it. Here is a similar question: How to use sklearn Column Transformer?
Here are the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LabelEncoder was specially designed for encoding the target variable - y. That's why you can't use it to transform multiple columns at the same time as with OneHotEncoder.
Sklearn provides OrdinalEncoder for such circumstances. It can encode multiple columns at once when encoding features.
Related
I export an instance of sklearn.preprocessing.StandardScaler into a pmml-file. The problem is, that the names of the fields do not appear in the pmml-file, e.g. when using the iris dataset then the original field names ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'] do not appear. Instead only names like x1,x2, etc appear. Is there a way to get the original field names in the pmml-file?
The Following code should be runnable:
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
ssModel.fit(dfIris)
pipe = PMMLPipeline([("StandardScaler", ssModel)])
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
In the ssIris.pmml I see this:
First, I believe you want to fit the PMMLPipeline after initialization so you may use pipe.fit(dfIris) instead of fitting before the ssModel. To preserve the column names add a none preprocessing function that uses DataFrameMapper to map pandas data frame columns to different sklearn transformations before the scaler, as the pipeline expects a preprocessing function in order to keep the column names. I am not sure whether this is the best way but I checked it and it was preserving the column names.
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
pipe.fit(dfIris)
pipe = PMMLPipeline([("df_mapper",
DataFrameMapper([(d, None) for d in data.feature_names],
df_out=True)), ("StandardScaler", ssModel)])
pipe.fit(dfIris)
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
The only component that comes in contact with dfIris data frame (holds feature name information) is the StandardScaler.fit(X) method. This method does not collect or store incoming feature names in any way.
The SkLearn2PMML package gets feature names from the value of the PMMLPipeline.active_fields attribute. Right now it's missing, which is why SkLearn2PMML falls back to default feature names "x1", "x2", .., "xn".
This attribute is automatically set during the PMMLPipeline.fit(X, y) method invocation. Alternatively, you may set/reset this attribute manually at any time.
If you're constructing a PMMLPipeline object using the sklearn2pmml.make_pmml_pipeline utility method, then this method also takes active_fields and target_fields arguments. Please note that in your example code you have a manually constructed PMMLPipeline object, which you then wrap into a new PMMLPipeline object using this utility function call. This is redundant, and actually masks any feature/target names that were possibly set there.
A much better example:
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml, PMMLPipeline
data = load_iris()
iris_X = DataFrame(data = data.data, columns = data.feature_names)
iris_y = None
pipeline = PMMLPipeline([
("ss", StandardScaler())
])
pipeline.fit(iris_X, iris_y)
sklearn2pmml(pipeline, "ssIris.pmml")
I am building a model to train it for binary classification. While processing the data before feeding it to the model i come across this warning
FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Here is my code
import torch
import torch.nn as nn
import matplotlib.pyplot as pyp
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import precision_score,recall_score,roc_curve,auc,roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
#loading the dataset
path='G:/My Drive/datasets/bank.csv'
df=pd.read_csv(path)
print(df.head(5))
print(df.shape)
#distirbuting the target values
print("Distribution of Target Values in Dataset -")
df.deposit.value_counts()
#check f we have na values in the datset
df.isna().sum()
#extracting columns whith strings
cartegorical_columns=df.select_dtypes(include='object').columns
print('cartegprical columns:',list(cartegorical_columns))
#for all cartegorical column if values in(yes/no) convert into a 1/10 flag
for col in cartegorical_columns:
if df[col].nunique()==2:
df[col]=np.where(df[col]=='yes',1,0)
print(df.head(5))
#for the remaining cartegorical values that have no binary values
#crate one hot encoded version of the dataset
new_df=pd.get_dummies(df)
#define the target and predictors for the model
target='deposit'
predictors=set(new_df.columns) - set([target])
print('new_df shape:',new_df.shape)
print(new_df[predictors].head())
The specific error
FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
print(new_df[predictors].head())
What could be raising this error in my code and how can i solve it
you are trying to access the new_df with predictors which is set.
convert it to list.
example:
print(new_df[list(predictors)].head())
Creating a Decision Tree and the dataset has 21 columns, a mix of numeric and categorical variables. Using sklearn, I understand it does not support categorical variables. I converted categorical to numeric using Label Encoding while also separating the numeric variables. I would then think I'd have to add both groups together so I can split into testing and training data. However when I tried to add the two together (originally numeric variables with the categorical variables converted to numeric) I received a ValueError.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
credit = pd.read_csv('german_credit_risk.csv')
credit.head(10)
image of output
credit.info()
image of output
credit.describe(include='all')
image ouf output
col_names = ['Duration', 'Credit.Amount', 'Disposable.Income', 'Present.Residence', 'Age', 'Existing.Credits', 'Number.Liable', 'Cost.Matrix']
obj_cols = list(credit.select_dtypes(include='O').columns)
obj_cols
image of output
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_obj_df = pd.DataFrame(columns=obj_cols)
for col in obj_cols:
encoded_obj_df[col] = le.fit_transform(credit[col])
encoded_obj_df.head(10)
image of output
credit.columns = col_names + encoded_obj_df
ValueError
Do I have the right idea and I'm just not adding the two together properly?
The error occurred because you are adding a list of strings to a DataFrame and try to assign the result of this operation to column names of other DataFrame.
You would need to concatenate data frames (with only numerical and label encoded values) on axis 1 with pd.concat function.
However, as you are using Scikit Learn then I would advise you to use it to the full extend. There is Pipeline and ColumnTransformer classes that can help you with the task of preprocessing and classification.
The Pipeline combines the sequence of SK Learn transformers so you don't need to pass the data to each component by yourself.
The ColumnTransformer is used to select the data and apply given transformers to the given data slices. Then it automatically combines the processed (and remained) data into single np.array.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
clf = make_pipeline(
ColumnTransformer(
[('categorical', LabelEncoder(), credit.select_dtypes(include='O').columns)],
remainder='passthrough'
),
DecisionTreeClassifier()
)
You can then use the standard clf.fit and clf.predict on the resulting pipeline and all of the data processing and prediction will happen at once.
Here are some examples of my dataset:
price,city,color,model,function,name
میتسوبیشی اوتلندر,1396,200000,سفید,میرداماد,665000000
ب ام و,1386,120000,سفید صدفی,تهران,820000000
تویوتا لندکروز ,1386,14000,سفید,خیابان شهید بهشتی,1950000000
ام وی ام ,1385,0,مشکی,تهران,1290000000
پژو ,1399,0,سفید,تهران,310000000
I want to use the decision tree algorithm : name, function, model, color, city as a test and guess the price as a result
How do I use one hot encoding can convert name, color, city to encoding and later along with the use Model, function as test input to DecisionTreeClassifier().fit() to guess the price?
If you have a better and faster method, I will be happy to introduce it.
you can do with scikit learn. I've added links of the API documentation of Scikitlearn and One hot encoding samples.
from sklearn.preprocessing import OneHotEncoder
One hot encoding sample
Scikit Learn Website
import pandas as pd
df=pd.read_clipboard(sep=',')
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['price_cat'] = labelencoder.fit_transform(df['price'])
df['model_cat'] = labelencoder.fit_transform(df['model'])
df['function-cat'] = labelencoder.fit_transform(df['function'])
enc = OneHotEncoder(handle_unknown='ignore')# passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(df[['model_cat']]).toarray())
enc_df.columns = enc.get_feature_names(['model'])
df = df.join(enc_df)
df
Note: You need to do labelencoder before Onehotencoder, because scikitlearn's OneHotEncoder doesn't work string categorical variables.
i have a problem where i am trying to apply transformations to my catgeorical feature 'country' and the rest of my numerical columns. how can i do this as i am trying below:
preprocess = make_column_transformer(
(numeric_cols, make_pipeline(MinMaxScaler())),
(categorical_cols, OneHotEncoder()))
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X_train, y_train)
note that numeric_cols is passed as a list and so is categorical_cols.
however i get this error: TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. along with a list of all my numerical columns (type <class 'list'>) doesn't.
what am i doing wrong, also how can i deal with unseen categories in column country?
You need to put the transform function first, then the columns as subsequent arguments, if you check out the help page, it writes:
sklearn.compose.make_column_transformer(*transformers, **kwargs)
Some like below will work:
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
X = pd.DataFrame({'x1':np.random.uniform(0,1,5),
'x2':np.random.choice(['A','B'],5)})
y = pd.Series(np.random.choice(['0','1'],5))
numeric_cols = X.select_dtypes('number').columns.to_list()
categorical_cols = X.select_dtypes('object').columns.to_list()
preprocess = make_column_transformer(
(MinMaxScaler(),numeric_cols),
(OneHotEncoder(),categorical_cols)
)
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X,y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('minmaxscaler',
MinMaxScaler(), ['x1']),
('onehotencoder',
OneHotEncoder(), ['x2'])])),
('xgbclassifier', XGBClassifier())])