sklearn2pmml omits field names - python

I export an instance of sklearn.preprocessing.StandardScaler into a pmml-file. The problem is, that the names of the fields do not appear in the pmml-file, e.g. when using the iris dataset then the original field names ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'] do not appear. Instead only names like x1,x2, etc appear. Is there a way to get the original field names in the pmml-file?
The Following code should be runnable:
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
ssModel.fit(dfIris)
pipe = PMMLPipeline([("StandardScaler", ssModel)])
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
In the ssIris.pmml I see this:

First, I believe you want to fit the PMMLPipeline after initialization so you may use pipe.fit(dfIris) instead of fitting before the ssModel. To preserve the column names add a none preprocessing function that uses DataFrameMapper to map pandas data frame columns to different sklearn transformations before the scaler, as the pipeline expects a preprocessing function in order to keep the column names. I am not sure whether this is the best way but I checked it and it was preserving the column names.
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
pipe.fit(dfIris)
pipe = PMMLPipeline([("df_mapper",
DataFrameMapper([(d, None) for d in data.feature_names],
df_out=True)), ("StandardScaler", ssModel)])
pipe.fit(dfIris)
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")

The only component that comes in contact with dfIris data frame (holds feature name information) is the StandardScaler.fit(X) method. This method does not collect or store incoming feature names in any way.
The SkLearn2PMML package gets feature names from the value of the PMMLPipeline.active_fields attribute. Right now it's missing, which is why SkLearn2PMML falls back to default feature names "x1", "x2", .., "xn".
This attribute is automatically set during the PMMLPipeline.fit(X, y) method invocation. Alternatively, you may set/reset this attribute manually at any time.
If you're constructing a PMMLPipeline object using the sklearn2pmml.make_pmml_pipeline utility method, then this method also takes active_fields and target_fields arguments. Please note that in your example code you have a manually constructed PMMLPipeline object, which you then wrap into a new PMMLPipeline object using this utility function call. This is redundant, and actually masks any feature/target names that were possibly set there.
A much better example:
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml, PMMLPipeline
data = load_iris()
iris_X = DataFrame(data = data.data, columns = data.feature_names)
iris_y = None
pipeline = PMMLPipeline([
("ss", StandardScaler())
])
pipeline.fit(iris_X, iris_y)
sklearn2pmml(pipeline, "ssIris.pmml")

Related

Creating a Decision Tree in Python, Numerical and Categorical Variables: "Unable to coerce to Series"

Creating a Decision Tree and the dataset has 21 columns, a mix of numeric and categorical variables. Using sklearn, I understand it does not support categorical variables. I converted categorical to numeric using Label Encoding while also separating the numeric variables. I would then think I'd have to add both groups together so I can split into testing and training data. However when I tried to add the two together (originally numeric variables with the categorical variables converted to numeric) I received a ValueError.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
credit = pd.read_csv('german_credit_risk.csv')
credit.head(10)
image of output
credit.info()
image of output
credit.describe(include='all')
image ouf output
col_names = ['Duration', 'Credit.Amount', 'Disposable.Income', 'Present.Residence', 'Age', 'Existing.Credits', 'Number.Liable', 'Cost.Matrix']
obj_cols = list(credit.select_dtypes(include='O').columns)
obj_cols
image of output
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_obj_df = pd.DataFrame(columns=obj_cols)
for col in obj_cols:
encoded_obj_df[col] = le.fit_transform(credit[col])
encoded_obj_df.head(10)
image of output
credit.columns = col_names + encoded_obj_df
ValueError
Do I have the right idea and I'm just not adding the two together properly?
The error occurred because you are adding a list of strings to a DataFrame and try to assign the result of this operation to column names of other DataFrame.
You would need to concatenate data frames (with only numerical and label encoded values) on axis 1 with pd.concat function.
However, as you are using Scikit Learn then I would advise you to use it to the full extend. There is Pipeline and ColumnTransformer classes that can help you with the task of preprocessing and classification.
The Pipeline combines the sequence of SK Learn transformers so you don't need to pass the data to each component by yourself.
The ColumnTransformer is used to select the data and apply given transformers to the given data slices. Then it automatically combines the processed (and remained) data into single np.array.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
clf = make_pipeline(
ColumnTransformer(
[('categorical', LabelEncoder(), credit.select_dtypes(include='O').columns)],
remainder='passthrough'
),
DecisionTreeClassifier()
)
You can then use the standard clf.fit and clf.predict on the resulting pipeline and all of the data processing and prediction will happen at once.

creating a pipeline for onehotencoded variables not working

i have a problem where i am trying to apply transformations to my catgeorical feature 'country' and the rest of my numerical columns. how can i do this as i am trying below:
preprocess = make_column_transformer(
(numeric_cols, make_pipeline(MinMaxScaler())),
(categorical_cols, OneHotEncoder()))
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X_train, y_train)
note that numeric_cols is passed as a list and so is categorical_cols.
however i get this error: TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. along with a list of all my numerical columns (type <class 'list'>) doesn't.
what am i doing wrong, also how can i deal with unseen categories in column country?
You need to put the transform function first, then the columns as subsequent arguments, if you check out the help page, it writes:
sklearn.compose.make_column_transformer(*transformers, **kwargs)
Some like below will work:
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
X = pd.DataFrame({'x1':np.random.uniform(0,1,5),
'x2':np.random.choice(['A','B'],5)})
y = pd.Series(np.random.choice(['0','1'],5))
numeric_cols = X.select_dtypes('number').columns.to_list()
categorical_cols = X.select_dtypes('object').columns.to_list()
preprocess = make_column_transformer(
(MinMaxScaler(),numeric_cols),
(OneHotEncoder(),categorical_cols)
)
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X,y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('minmaxscaler',
MinMaxScaler(), ['x1']),
('onehotencoder',
OneHotEncoder(), ['x2'])])),
('xgbclassifier', XGBClassifier())])

How to use LabelEncoder in sklearn make_column_tranformer?

How to use LabelEncoder in sklearn pipeline?
NOTE
The following code works for "OneHotEncoder" but fails for "LabelEncoder", How to use LabelEncoder in this circumstance?
MWE
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
import sklearn
print(sklearn.__version__) # 0.22.2.post1
df = sns.load_dataset('titanic').head()
le = OneHotEncoder() # this success
# le = LabelEncoder() # this fails
ct = make_column_transformer(
(le, ['sex','adult_male','alone']),
remainder='drop')
ct.fit_transform(df)
$$\begin{align}\mathsf P(N\mid E)&=\dfrac{\mathsf P(N\cap E)}{\mathsf P(E)}\[2ex]&=\dfrac{\mathsf P(N\cap E\mid F),\mathsf P(F)+\mathsf P(N\cap E\mid F^{\small\complement}),\mathsf P(F^{\small\complement})}{\mathsf P(E\mid F),\mathsf P(F)+\mathsf P(E\mid F^{\small\complement}),\mathsf P(F^{\small\complement})}\end{align}$$
From the docs, OneHotEncoder can take a dataframe and convert the categorical columns into the vectors you see. LabelEncoder takes a Series(your y / dependent variable) and generates new labels.
OnHotEncoder's usage: fit_transform(X,[y])
LabelEncoder's usage: fit_transform(y)
That's why it'll tell you: "fit_transform() takes 2 positional arguments but 3 were given"
Just call LabelEncoder fit_transform on the y directly if you really want to use it. Here is a similar question: How to use sklearn Column Transformer?
Here are the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LabelEncoder was specially designed for encoding the target variable - y. That's why you can't use it to transform multiple columns at the same time as with OneHotEncoder.
Sklearn provides OrdinalEncoder for such circumstances. It can encode multiple columns at once when encoding features.

Using a Pipeline containing ColumnTransformer in SciKit's RFECV

I'm trying to do RFECV on the transformed data using SciKit.
For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
class CustomPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
X = pd.DataFrame({
'col1': [i for i in range(100)] ,
'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])
pipe = CustomPipeline([
('col_transform', ct),
('lr', LinearRegression())
])
rfecv = RFECV(
estimator=pipe,
step=1,
cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)
Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.
I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)

Dealing with empty data points in Titanic Machine Learning train.csv

In the train.csv data in Titanic Machine Learning project, some passengers have their age data missing so the pandas module fills it in as 'NaN' and when feeding it into a sklearn algorithm it does not accept it. I tried dataset.fillna('') but now it turns into a empty string and not a float. Please send help.
https://www.kaggle.com/c/titanic/data
import pandas as pd
from sklearn.cross_validation import train_test_split
dataset = pd.read_csv('train.csv')
#dataset = dataset.fillna()
def preprocess(df):
from sklearn.preprocessing import LabelEncoder
processed_df = df.copy()
le = LabelEncoder()
done = le.fit_transform(processed_df)
return done
survival = preprocess(dataset.Survived)
data = dataset.drop('Survived',axis= 1)
data = data.drop('PassengerId',axis=1)
data = data.drop('Embarked',axis = 1)
data = data.drop('Cabin',axis = 1)
data = data.drop('Fare',axis = 1)
data = data.drop('Ticket',axis = 1)
data = data.drop('Name',axis=1)
x_train,x_test,y_train,y_test=
train_test_split(data,survival,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
pipeline = make_pipeline(StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
pipeline.fit(x_train,y_train)
print(accuracy_score(pipeline.predict(x_test),y_test))
fillna replaces the Nan values with what you write so if you write '', it will be an empty string. just write:
dataset.fillna(0)
if you need to distinguish between 0 and Nan, you can try replace it with -1, that's what we do.
there are many methods you can use to deal with the missing values in a machine learning project :
drop all the column with missing values
drop row containing missing values
Set the values to some value (zero, the mean, the median, etc.).
For the third option :
Scikit-Learn provides a handy class to take care of missing values:
Imputer. Here is how to use it. First, you need to create an Imputer
instance, specifying that you want to replace each attribute’s missing
values with the median of that attribute:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") #or mean as you want
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)
The result is a plain Numpy array containing the transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple.
NB : You could also add the imputer in the pipeline just before the scaler .
pipeline = make_pipeline(Imputer(strategy="median"),
StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))

Categories

Resources