Scikit-Learn: how to perform best subset GLM Poisson Regression? - python

Can someone show me how to perform best subset GLM Poisson regression using Pipeline and GridSearchCV? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV
In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm? And how do I embed this in pipeline and GridSearchCV ?
from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor = ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
('discrete',discrete_transformer,discrete_col)],remainder='passthrough')
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])
param_grid = { ??? different combinations of features ????}
gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)

Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset. However there are various classes:
https://scikit-learn.org/0.15/modules/classes.html#module-sklearn.feature_selection
including some hidden interesting ones --- https://scikit-learn.org/0.15/auto_examples/feature_stacker.html#example-feature-stacker-py.
Now pipelining for this can be tricky. When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform(). If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another. So depending on which feature selection approach you pick your code will differ. See below
Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.
Imports
### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor
### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Data
X, y = load_diabetes(as_frame=True, return_X_y=True)
Supplemental function
### make parameter grid for your GridSearchCV
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).items():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
#1 Example using RFE feature selection class
(aka where feature selection class does not return transform, but is of wrapper type)
### pipelines work from one step to another as long as previous step returns transform
### adjust next steps to fit your problem space
### below in all_params_grid
### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
'classifier':['rf']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
'with_mean':[True,False]
},
'rf':{'object':RFE(estimator=PoissonRegressor(),
step=1,
verbose=0),
'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to 1 for example to see how it influences accuracy of your grid search
'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',RFE(estimator=PoissonRegressor()))])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
#2 Example using KBest feature selection class
(an example where feature selection exposes .transform()
pipeline_steps = {'transform':['ss'],
'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'kbest': {'object': SelectKBest(),
'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
('classifier',PoissonRegressor())])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
#3 Brute force / looping over all possible combinations with pipeline
pipeline_steps = {'transform':['ss'],
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',PoissonRegressor())])
pipe
feature_combo = [] ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score
stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 0)
fitted = gs_en_cv.fit(X[list(subset)],y)
score.append(fitted.best_score_) ### append results
params.append(fitted.best_params_) ### append results
feature_combo.append(list(subset)) ### append results
### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
'score':score,
'params':params})
df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]
PS
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them? Why not interact them from the start and let gridCV feature selection portion tell you whats best?

Related

How to automatically choose meaning num_features_to_select with best result in select_features from CatBoostClassifier?

I'm writting a class on Python, where I'm trying to automatically pick up a value of num_features_to_select in CatBoostClassifier().select_features(). Right now, function uses enumeration of num_features_to_select values.
Code:
def CatBoost(X_var=df.drop(columns=['status']), y_var=df[['creation_date','status']]):
from catboost import CatBoostClassifier, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
from sklearn.model_selection import train_test_split
from datetime import datetime, timedelta # подключаем библиотеку datetime для работы с датами
import os
os.environ['OPENBLAS_NUM_THREADS'] = '10'
valid_time_border = X_var['creation_date'].max()-timedelta(days=7)
X_train, X_test, y_train, y_test = train_test_split(X_var[X_var['creation_date']<=valid_time_border]\
.drop(columns=['creation_date']),\
y_var[y_var['creation_date']<=valid_time_border]['status'],\
test_size=0.3)
X_valid = X_var[X_var['creation_date']>valid_time_border].drop(columns=['creation_date'])
y_valid = y_var[y_var['creation_date']>valid_time_border]['status']
best_accurancy = 0
mas_num_features_to_select = [10,20,30,40,50,60]
for i in mas_num_features_to_select:
# Определяем все переменные
predict_columns = X_train.columns.to_list()
# определяем категориальные переменные
cat_features_num = np.where(np.isin(X_train[X_train.columns].dtypes, ['bool', 'object']))[0]
train_pool = Pool(X_train, y_train, cat_features=cat_features_num, feature_names=list(predict_columns))
test_pool = Pool(X_test, y_test, cat_features=cat_features_num, feature_names=list(predict_columns))
model = CatBoostClassifier(iterations=round(200), eval_metric='AUC', thread_count = 10)
summary = model.select_features(
train_pool,
eval_set=test_pool,
features_for_select=predict_columns,
num_features_to_select=i,
steps=15,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Regular,
train_final_model=False,
logging_level='Silent',
plot=False
)
predict_columns = summary['selected_features_names']
model.fit(X_train, y_train)
y_pred = model.predict(X_valid) # предсказываем новые данные
mislabel = np.sum((y_valid!=y_pred)) # считаем неправильно посчитанные значения
accurancy = 1 - mislabel/len(y_pred)
print(accurancy)
if accurancy > best_accurancy:
best_accurancy = accurancy
best_predict_columns = predict_columns
print('Лучшая точность предсказания: '+str(best_accurancy))
print('Лучшие фичи:')
print(best_predict_columns)
return(best_predict_columns)
I can't find any information about methods which afford to use built in function of automatic feature selection. Is it even possible using CatBoost?
Use the summary dictionary output to find your best point. If you want an interactive plot to define it, you can use the following:
import matplotlib.pyplot as plt
line = plt.plot(summary["loss_graph"]["removed_features_count"], summary["loss_graph"]["loss_values"], picker=True)
x = plt.ginput(n=1, timeout=30, show_clicks=True)
print(x)
If I understand your question correctly, you're looking for a way of using select_features to determine how many and which features to include in the model such that performance is maintained/improved while eliminating the maximum number of features. Sadly, your approach seems to be the best for an automated function. CatBoost does not return the features from the iteration with the best performance, only the features remaining after pruning down to the number of features specified in num_features_to_select by iterating steps number of times.
If you can compromise and add a manual step, you can set plot=True and see at which number of features the loss value is minimized, such as in CatBoost's documentation here:
If you set steps to the number of features, features will be removed one by one, and you can see the loss for the removal of each feature. You could then manually select the number of features to match that iteration. It would be nice if CatBoost had a "train_best_model" parameter instead of just a "train_final_model" parameter! I don't know if theres a way to capture what this function logs to stdout or outputs in the plot, but that contains the loss value, and would allow you set the value.
Edit: I thought of one more approach that is still a form of iterating over num_features_to_select parameter, but may be interesting.
Set train_final_model=True, steps=1, and num_features_to_select to the width of your dataset
Iteratively subtract 1 from num_features_to_select
At the end of each loop, test the performance of the model
Stop if negative performance change exceeds a threshold (e.g., -5% or -2%)
This may take a while, depending on how long the training takes, but would automatically pick the num_features_to_select as you desire.

Using a greedy feature selection algorithm for linear regression in Python

This is a homework problem for a machine learning course I'm taking. I'll be as descriptive as I can regarding the approaches I took, what worked, and what didn't.
We are given four types of data sets: dev_sample.npy, dev_label.npy, test_sample.npy, and test_label.npy. We first load the data set as follows:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)
The problem we need to solve is to implement a "greedy feature selection" algorithm until the best 100 of the 126 features are selected. Basically we train models with one feature, select the best one and store it, train 125 models with each remaining feature paired with the selected, choose the next best one and store it, and continue until we have reached 100.
Here is the code:
# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code
# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)
selected_train_error = []
selected_valid_error = []
# For greedy selection
for sel in range(sel_num) :
min_train_error = +1000
min_valid_error = +1000
min_feature = 0
for i in range(X_dev.shape[1]) :
train_error_ith = []
valid_error_ith = []
# Select feature greedy
# Hint : There should be no duplicated feature in selected_feature
# Your code here
X_dev_fs = X_dev[:, i]
if (i in selected_feature):
continue
else:
pass
# End your code
# For cross validation
for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
y_train, y_valid = y_dev[train_index], y_dev[test_index]
# Derive training error, validation error
# You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()
# Your code here
model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
predictions_train = model_train.predict(X_valid.reshape(-1, 1))
train_error_ith.append(mean_squared_error(y_valid, predictions_train))
model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))
# End your code
# Select best performance feature set on each features
# You should choose the feature which has minimum mean cross validation error
# Your code here
min_train_error = train_error_ith[np.argmin(train_error_ith)]
min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
min_feature = np.argmin(valid_error_ith)
# End your code
print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)
The algorithm that I had in mind when filling in the #Your code sections is that X_dev_fs would hold the feature of the current iteration along with the previously selected features. We would then use cross validation to derive training and CV errors.
The current output that I get after running this program is
==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================
and so on, with the # of selected feature(s) going on until 100.
The problem is that Selected feature of this iteration : should not output the same number more than once. I'm also having trouble figuring out how to store the best feature and use it with the subsequent iterations.
The questions that I have are:
Why is my selected_feature list containing the same duplicate features, and how do I prevent that?
How do I store the best feature in selected_feature, then use that paired up with each subsequent remaining feature?
Any feedback is appreciated. Thank you.
EDIT
Here are the links to the files that I am loading into the variables, in case anybody needs them.
dev_sample.npy
dev_label.npy
test_sample.npy
test_label.npy

Make RandomForestClassifier pick a variable for sure during training

This is a bit of a newbie question.
I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.
How do I do that? I don't see any paramters in the classifier that would help in this case.
Any help would be appreciated!
TIA.
EDIT
So here's the scenario I have..
If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.
If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.
Does this make things clear? Is random forest not a candidate at all for this job?
There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.
So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:
FixedFeatureRFC.py
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class FixedFeatureRFC:
def __init__(self, n_estimators=10, random_state=None):
self.n_estimators = n_estimators
if random_state is None:
self.random_state = np.random.RandomState()
def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
"""
feats_fixed: indices of features (columns of X) to be
always used to train each estimator
max_features: number of features that each estimator will use,
including the fixed features.
bootstrap_frac: size of bootstrap sample that each estimator will use.
"""
self.estimators = []
self.feats_used = []
self.n_classes = np.unique(y).shape[0]
if feats_fixed is None:
feats_fixed = []
if max_features is None:
max_features = X.shape[1]
n_samples = X.shape[0]
n_bs = int(bootstrap_frac*n_samples)
feats_fixed = list(feats_fixed)
feats_all = range(X.shape[1])
random_choice_size = max_features - len(feats_fixed)
feats_choosable = set(feats_all).difference(set(feats_fixed))
feats_choosable = np.array(list(feats_choosable))
for i in range(self.n_estimators):
chosen = self.random_state.choice(feats_choosable,
size=random_choice_size,
replace=False)
feats = feats_fixed + list(chosen)
self.feats_used.append(feats)
bs_sample = self.random_state.choice(n_samples,
size=n_bs,
replace=True)
dtc = DecisionTreeClassifier(random_state=self.random_state)
dtc.fit(X[bs_sample][:,feats], y[bs_sample])
self.estimators.append(dtc)
def predict_proba(self, X):
out = np.zeros((X.shape[0], self.n_classes))
for i in range(self.n_estimators):
out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
return out / self.n_estimators
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return (self.predict(X) == y).mean()
Here is a test script to see if the class above works as intended:
test.py
import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC
rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y = BC.data, BC.target
train = rs.rand(X.shape[0]) < 0.8
print "n_features =", X.shape[1]
fixed = [0,4,21]
maxf = 10
ffrfc = FixedFeatureRFC(n_estimators=1000)
ffrfc.fit(X[train], y[train], feats_fixed=fixed, max_features=maxf)
for feats in ffrfc.feats_used:
assert len(feats) == maxf
for f in fixed:
assert f in feats
print ffrfc.score(X[~train], y[~train])
The output is:
n_features = 30
0.983739837398
None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.
I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.
If you can switch packages, R's Ranger (https://cran.r-project.org/web/packages/ranger/ranger.pdf) has options split.select.weights and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.
This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As #Michal alluded to, proceed carefully here.

Pyspark - Get all parameters of models created with ParamGridBuilder

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want several things:
What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
What are the parameters of all models?
What are the results (aka recall, accuracy, etc...) of each model ? I only found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.
If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn.
Spark 2.4+
SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.
By default this behavior is disabled, but can be controlled using CollectSubModels Param (setCollectSubModels).
valid = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
collectSubModels=True)
model = valid.fit(df)
model.subModels
Spark < 2.4
Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.
What are the parameters of all models?
While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:
from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel
EvalParam = List[Tuple[float, Dict[Param, Any]]]
def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))
to get some about relationship between metrics and parameters.
If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:
models = pipeline.fit(df, params=paramGrid)
It will generate a list of the PipelineModels corresponding to the params argument:
zip(models, params)
I think I've found a way to do this. I wrote a function that specifically pulls out hyperparameters for a logistic regression that has two parameters, created with a CrossValidator:
def hyperparameter_getter(model_obj,cv_fold = 5.0):
enet_list = []
reg_list = []
## Get metrics
metrics = model_obj.avgMetrics
assert type(metrics) is list
assert len(metrics) > 0
## Get the paramMap element
for x in range(len(model_obj._paramMap.keys())):
if model_obj._paramMap.keys()[x].name=='estimatorParamMaps':
param_map_key = model_obj._paramMap.keys()[x]
params = model_obj._paramMap[param_map_key]
for i in range(len(params)):
for k in params[i].keys():
if k.name =='elasticNetParam':
enet_list.append(params[i][k])
if k.name =='regParam':
reg_list.append(params[i][k])
results_df = pd.DataFrame({'metrics':metrics,
'elasticNetParam': enet_list,
'regParam':reg_list})
# Because of [SPARK-16831][PYTHON]
# It only sums across folds, doesn't average
spark_version = [int(x) for x in sc.version.split('.')]
if spark_version[0] <= 2:
if spark_version[1] < 1:
results_df.metrics = 1.0*results_df['metrics'] / cv_fold
return results_df

Training a sklearn LogisticRegression classifier without all possible labels

I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.

Categories

Resources