Mlxtend: SequentialFeatureSelector with ColumnTransformer gives AttributeError - python

I'm trying to use mlxtend SequentialFeatureSelector() in combination with a pipeline by using ColumnTransformer(). I use the ColumnTransformer() to make power transformations (via PowerTransformer()) only on the numeric variables, but not on the binary variables. My problem is that I either get the error: AttributeError: 'numpy.ndarray' object has no attribute 'columns' or the results make no sense.
If I define like this: numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5'], then I get the AttributeError: 'numpy.ndarray' object has no attribute 'columns'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
# numeric_features = [0,1,2,3,4,]
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop'))
pipe = make_pipeline((preprocessor), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
If I define numeric_features like this: numeric_features = [0,1,2,3,4] , then I get incorrect results (same value for every iteration) which also include NaNs. See pictures below.
Results of the SequentialFeatureSelector() - Part 1
Results of the SequentialFeatureSelector() - Part 2
The problem lies only in preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop')). If for example, I would remove the specific column transformation and would perform the power transformations on the whole dataset, then code works and results would be:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
pipe = make_pipeline((numeric_transformer), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
Results:
Working results of SFS - Part 1
Working result of SFS - Part 2
Does anyone has an idea how this could be solved? Any help would be much appreciated.

Related

rfecv.fit() in python not accepting my x and y arguments

I am really new to Python so I don't know a lot of basics, but I have a college report that has to be done in Python and I'm struggling with figuring out how to resolve the issues in my code.
First I created my training data for X and y, then transformed it into a pandas DataFrame so I can call ols from statsmodels on it for my initial model. Now I want to use rfe to reduce my model, and I'm starting with RFECV so I can determine how many features I want RFE to select. But every time I run the code I have an issue with rfecv.fit().
Here is my code:
'''
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
pipe = make_pipeline(StandardScaler())
pipe.fit(X_train, y_train)
#recombine training dataset in order to call ols
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)
traindata = pd.concat([X_train_df.reset_index(drop=True), y_train_df.reset_index(drop=True)], axis=1)
#create first linear model
from statsmodels.formula.api import ols
model1 = ols('Tenure ~ Population + Children + Age + Income + Outage_sec_perweek + Email + Contacts + Yearly_equip_failure + MonthlyCharge + Bandwidth_GB_Year + Area_Suburban + Area_Urban + Marital_Married + Marital_Never_Married + Marital_Separated + Marital_Widowed + Gender_Male + Gender_Nonbinary + Churn_Yes + Contract_One_Year + Contract_Two_Year', data=traindata).fit()
print(model1.params)
#RFECV to determine number of variables to include for the optimal model
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
svc = SVC(kernel="linear")
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold, scoring = "accuracy")
rfecv.fit(X_train_df, y_train_df)
'''
The output error looks like this:
TypeError: Singleton array array(None, dtype=object) cannot be considered a valid collection.
Any help or resources would be really appreciated! Thanks
You need to pass cv = StratifiedKFold() instead of cv = StratifiedKFold, so the below will work:
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(), scoring = "accuracy")
Or if you want 10 folds (default is 5):
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(n_splits=10), scoring = "accuracy")
You can check out the difference between having / not having the parenthesis from post like this or this.

Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline

I'm trying to use SequentialFeatureSelector and for estimator parameter I'm passing it a pipeline that includes a step that inputes the missing values:
model = Pipeline(steps=[('preprocessing',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('imputing',
SimpleImputer(fill_value=-1,
strategy='constant')),
('preprocessing',
StandardScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
('pipeline-2',
Pipeline(steps=[('imputing',
SimpleImputer(fill_value='missing',
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='ignore'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
('model',
LGBMClassifier(class_weight='balanced', random_state=1,
reg_lambda=0.1))])
Nonetheless when passing this to selector it shows an error, what does not make any sense since I have already fit and evaluated my model and it runs ok
fselector = SequentialFeatureSelector(estimator = model, scoring= "roc_auc", cv = 3, n_jobs= -1, ).fit(X, target)
_assert_all_finite(X, allow_nan, msg_dtype)
101 not allow_nan and not np.isfinite(X).all()):
102 type_err = 'infinity' if allow_nan else 'NaN, infinity'
--> 103 raise ValueError(
104 msg_err.format
105 (type_err,
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
EDIT:
Reproducible example:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])
SequentialFeatureSelector(estimator = clf,
scoring= "accuracy",
cv = 3).fit(X, y)
It shows the same error, in spite of the clf can be fit without problems
ScikitLearn's documentation does not state that the SequentialFeatureSelector works with pipeline objects. It only states that the class accepts an unfitted estimator. In view of this, you could remove the classifier from your pipeline, preprocess X, and then pass it along with an unfitted classifier for feature selection as shown in the example below.
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
('scaler', MaxAbsScaler())])
# Preprocess your data
X = pipe.fit_transform(X)
# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
scoring= "accuracy",
cv = 3).fit(X, y)
# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)
You can use SequentialFeatureSelection from mlxtend package
https://rasbt.github.io/mlxtend/
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
clf = Pipeline([
("preprocessing", SimpleImputer(missing_values= np.NaN)),
("model",LogisticRegression(random_state = 1))
])
sfs = SequentialFeatureSelector(estimator = clf,
forward = True,
k_features = 'best',
scoring = "accuracy",
cv = 3, n_jobs=-1).fit(X, y)
sfs.k_feature_idx_
>>> (0, 1, 2, 3)

how can i get_param names for targetencoder? gridsearch

i have the below scenario:
preprocess = make_column_transformer(
(SimpleImputer(strategy='constant',fill_value = 0),numeric_cols),
(ce.TargetEncoder(),['country'])
)
pipeline = make_pipeline(preprocess,XGBClassifier())
pipeline[0].get_params().keys()
dict_keys(['n_jobs', 'remainder', 'sparse_threshold', 'transformer_weights', 'transformers', 'verbose', 'simpleimputer', 'targetencoder', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'targetencoder__cols', 'targetencoder__drop_invariant', 'targetencoder__handle_missing', 'targetencoder__handle_unknown', 'targetencoder__min_samples_leaf', 'targetencoder__return_df', 'targetencoder__smoothing', 'targetencoder__verbose'])
i then wish to do a grid search on the smoothing factor:
so:
param_grid = {
'xgbclassifier__learning_rate': [0.01,0.005,0.001],
'targetencoder__smoothing': [1, 10, 30, 50]
}
pipeline = make_pipeline(preprocess,XGBClassifier())
# Initialize Grid Search Modelg
clf = GridSearchCV(pipeline,param_grid = param_grid,scoring = 'neg_mean_squared_error',
verbose= 1,iid= True,
refit = True,cv = 3)
clf.fit(X_train,y_train)
however i get this error:
ValueError: Invalid parameter transformer_targetencoder for estimator Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers...
how can i access the smoothing paramter?
Using your example, it will be columntransformer__targetencoder__smoothing . To reproduce the pipeline, first I use an example dataset and define the columns:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
X_train = pd.DataFrame({'x1':np.random.normal(0,1,50),
'x2':np.random.normal(0,1,50),
'country':np.random.choice(['A','B','C'],50)})
y_train = np.random.binomial(1,0.5,50)
numeric_cols = ['x1','x2']
preprocess = make_column_transformer(
(SimpleImputer(strategy='constant',fill_value = 0),numeric_cols),
(ce.TargetEncoder(),['country'])
)
pipeline = make_pipeline(preprocess,XGBClassifier())
You should look at the keys at a higher level:
pipeline.get_params().keys()
Then set up the grid, make sure that the smoothing is a float (see this issue):
param_grid = { 'columntransformer__targetencoder__smoothing': [1.0, 10.0],
'xgbclassifier__learning_rate': [0.01,0.001]}
pipeline = make_pipeline(preprocess,XGBClassifier())
clf = GridSearchCV(pipeline,param_grid = param_grid,scoring = 'neg_mean_squared_error',
verbose= 1,refit = True,cv = 3)
clf.fit(X_train,y_train)
And it should work

What can I do to change dot in comma?

Good morning! I'm new of python, I use Spyder 4.0 to build neural network.
In the script below I use the random forest in order to do feature importances. So the values importances are the ones that tell me what is the importance of each features. Unfortunatly I can't upload the dataset, but I can tell you that there are 18 features and 1 label, both are phisical quantyties and it's a regression problem.
I want to export in a excel file the variable importances, but when I do it (simply cooping the vector) the numbers are with the dot (eg 0.012, 0.015, .....ect). In order to use it in the excel file I prefere to have the comma instead of the dot.
I try to use .replace('.',',') but it doesn't works, the error is:
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
It think that it happens because the vector importances is an Array of float64 (18,).
What can I do?
Thanks.`
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
dataset = pd.read_csv('Dataset.csv', decimal=',', delimiter = ";")
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
return final_value
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.20, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
sel = RandomForestRegressor(n_estimators = 200,max_depth = 9, max_features = 5, min_samples_leaf = 1, min_samples_split = 2,bootstrap = False)
sel.fit(X_train, y_train)
importances = sel.feature_importances_
# sel.fit(X_train, y_train)
# a = []
# for feature_list_index in sel.get_support(indices=True):
# a.append(feat_labels[feature_list_index])
# print(feat_labels[feature_list_index])
# X_important_train = sel.transform(X_train1)
# X_important_test = sel.transform(X_test1)
I will try to show you an example of what you should do by using some random values. I ran this on the python shell that's why you see also the ">>>".
>>> import numpy as np # first I import numpy as "np"
# I generate 10 random values and I store them in "importance"
>>> importance=np.random.rand(10)
# here I just want to see the content of "importance"
>>> importance
array([0.77609076, 0.97746829, 0.56946118, 0.23986983, 0.93655692,
0.22003531, 0.7711095 , 0.36083248, 0.58277805, 0.57865248])
# here there is your error that I reproduce for teaching purpose
>>>importance.replace(".", ",")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
What you need to to is to convert the elements of "importance" to a list of strings
>>> imp_astr=[str(i) for i in importance]
>>> imp_astr
['0.7760907642658763', '0.9774682868805988', '0.569461184647781', '0.23986982589422634', '0.9365569207431337', '0.22003531170279356', '0.7711094966708247', '0.3608324767276052', '0.5827780487688116', '0.5786524781334242']
# at the end, for each string, you can use the "replace" function
>>> imp_astr=[i.replace(".", ",") for i in imp_astr]
>>> imp_astr
['0,7760907642658763', '0,9774682868805988', '0,569461184647781', '0,23986982589422634', '0,9365569207431337', '0,22003531170279356', '0,7711094966708247', '0,3608324767276052', '0,5827780487688116', '0,5786524781334242']

Getting Error About Arrays Must Be the Same Length

I'm pulling data from a CSV into a DF, and running the code below....getting this error:
# Import the necessary packages
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
# Define a normalizer
normalizer = Normalizer()
# Fit and transform
norm_movements = normalizer.fit_transform(dfMod)
# Create Kmeans model
kmeans = KMeans(n_clusters = 10,max_iter = 1000)
# Make a pipeline chaining normalizer and kmeans
pipeline = make_pipeline(normalizer,kmeans)
# Fit pipeline to daily stock movements
pipeline.fit(dfMod)
labels = pipeline.predict(dfMod)
print(len(labels), len(dfMod))
df1 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod)}).sort_values(by=['labels'],axis = 0)
# now...with PCA reduction
# Define a normalizer
normalizer = Normalizer()
# Reduce the data
reduced_data = PCA(n_components = 2)
# Create Kmeans model
kmeans = KMeans(n_clusters = 10,max_iter = 1000)
# Make a pipeline chaining normalizer, pca and kmeans
pipeline = make_pipeline(normalizer,reduced_data,kmeans)
# Fit pipeline to daily stock movements
pipeline.fit(dfMod)
# Prediction
labels = pipeline.predict(dfMod)
# Create dataframe to store companies and predicted labels
df2 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod.keys())}).sort_values(by=['labels'],axis = 0)
This line throws the error.
df1 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod)}).sort_values(by=['labels'],axis = 0)
The weird thing is, that this shows 50k and 50k.
print(len(labels), len(dfMod))
50000 50000
Am I missing something here? How can I make this work? Thanks!!

Categories

Resources