Pyspark - Get all parameters of models created with ParamGridBuilder - python

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want several things:
What are the parameters of the best model? This post partially answers the question: How to extract model hyper-parameters from spark.ml in PySpark?
What are the parameters of all models?
What are the results (aka recall, accuracy, etc...) of each model ? I only found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.
If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn.

Spark 2.4+
SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.
By default this behavior is disabled, but can be controlled using CollectSubModels Param (setCollectSubModels).
valid = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
collectSubModels=True)
model = valid.fit(df)
model.subModels
Spark < 2.4
Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.
What are the parameters of all models?
While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:
from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel
EvalParam = List[Tuple[float, Dict[Param, Any]]]
def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))
to get some about relationship between metrics and parameters.
If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:
models = pipeline.fit(df, params=paramGrid)
It will generate a list of the PipelineModels corresponding to the params argument:
zip(models, params)

I think I've found a way to do this. I wrote a function that specifically pulls out hyperparameters for a logistic regression that has two parameters, created with a CrossValidator:
def hyperparameter_getter(model_obj,cv_fold = 5.0):
enet_list = []
reg_list = []
## Get metrics
metrics = model_obj.avgMetrics
assert type(metrics) is list
assert len(metrics) > 0
## Get the paramMap element
for x in range(len(model_obj._paramMap.keys())):
if model_obj._paramMap.keys()[x].name=='estimatorParamMaps':
param_map_key = model_obj._paramMap.keys()[x]
params = model_obj._paramMap[param_map_key]
for i in range(len(params)):
for k in params[i].keys():
if k.name =='elasticNetParam':
enet_list.append(params[i][k])
if k.name =='regParam':
reg_list.append(params[i][k])
results_df = pd.DataFrame({'metrics':metrics,
'elasticNetParam': enet_list,
'regParam':reg_list})
# Because of [SPARK-16831][PYTHON]
# It only sums across folds, doesn't average
spark_version = [int(x) for x in sc.version.split('.')]
if spark_version[0] <= 2:
if spark_version[1] < 1:
results_df.metrics = 1.0*results_df['metrics'] / cv_fold
return results_df

Related

Scikit-Learn: how to perform best subset GLM Poisson Regression?

Can someone show me how to perform best subset GLM Poisson regression using Pipeline and GridSearchCV? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV
In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm? And how do I embed this in pipeline and GridSearchCV ?
from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor = ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
('discrete',discrete_transformer,discrete_col)],remainder='passthrough')
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])
param_grid = { ??? different combinations of features ????}
gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)
Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset. However there are various classes:
https://scikit-learn.org/0.15/modules/classes.html#module-sklearn.feature_selection
including some hidden interesting ones --- https://scikit-learn.org/0.15/auto_examples/feature_stacker.html#example-feature-stacker-py.
Now pipelining for this can be tricky. When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform(). If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another. So depending on which feature selection approach you pick your code will differ. See below
Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.
Imports
### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor
### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Data
X, y = load_diabetes(as_frame=True, return_X_y=True)
Supplemental function
### make parameter grid for your GridSearchCV
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).items():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
#1 Example using RFE feature selection class
(aka where feature selection class does not return transform, but is of wrapper type)
### pipelines work from one step to another as long as previous step returns transform
### adjust next steps to fit your problem space
### below in all_params_grid
### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html
pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
'classifier':['rf']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
'with_mean':[True,False]
},
'rf':{'object':RFE(estimator=PoissonRegressor(),
step=1,
verbose=0),
'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to 1 for example to see how it influences accuracy of your grid search
'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',RFE(estimator=PoissonRegressor()))])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
#2 Example using KBest feature selection class
(an example where feature selection exposes .transform()
pipeline_steps = {'transform':['ss'],
'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'kbest': {'object': SelectKBest(),
'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
('classifier',PoissonRegressor())])
pipe
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 1)
gs_en_cv.fit(X,y)
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"
#3 Brute force / looping over all possible combinations with pipeline
pipeline_steps = {'transform':['ss'],
'classifier':['pr']}
# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
'with_mean':[True,False]
},
'pr':{'object':PoissonRegressor(verbose=2),
'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
'fit_intercept':[True,False], ### tuning your models hyperparams
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list
pipe = Pipeline(steps=[('transform',StandardScaler()),
('classifier',PoissonRegressor())])
pipe
feature_combo = [] ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score
stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
### run it
gs_en_cv = GridSearchCV(pipe,
param_grid=param_grids_list,
cv=KFold(n_splits=3,
shuffle = True,
random_state=123),
scoring = 'neg_root_mean_squared_error',
return_train_score=True,
### change verbose to higher number for more print outs
### about fitting info which can also verify that
### all parameters you specify are getting fit
verbose = 0)
fitted = gs_en_cv.fit(X[list(subset)],y)
score.append(fitted.best_score_) ### append results
params.append(fitted.best_params_) ### append results
feature_combo.append(list(subset)) ### append results
### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
'score':score,
'params':params})
df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]
PS
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them? Why not interact them from the start and let gridCV feature selection portion tell you whats best?

Ensembling Technique

I want to assign weights to multiple models and make an single ensemble model.
I want to use my outputs as the input to a new machine learning algorithm and the algorithm will learn the correct weights.
but how do I give the output of multiple models as an input to a new ML Algorithm as I am getting output like this
preds1=model1.predict_prob(xx)
[[0.28054154 0.35648097 0.32954868 0.03342881]
[0.20625692 0.30749627 0.37018309 0.11606372]
[0.28362306 0.33325501 0.34658685 0.03653508]
...
preds2=model2.predict_prob(xx)
[[0.22153498 0.30271243 0.26420254 0.21155006]
[0.32327647 0.39197589 0.23899729 0.04575035]
[0.18440374 0.32447016 0.4736297 0.0174964 ]
...
How to I make a single Dataframe from the output of these 2 or more models ?
the simplest way of doing this is given below but I want to give the output to a different ML algorithm to learn weights.
model = LogisticRegression()
model.fit(xx_train, yy_train)
preds1 = model.predict_proba(xx_test)
model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
model.fit(xx_train, yy_train)
preds2 = model.predict_proba(xx_test)
# Each weight is evaluated by calculating the corresponding score
for i in range(len(weights)):
final_inner_preds = np.argmax(preds1*weights[i]+ preds2*(1-weights[i]), axis=1)
scores_corr_wts[i]+= accuracy_score(yy_test, final_inner_preds)
In sklearn you can use the StackingClassifier. This should fit your need.
Create a list of the definition of your base model
base_models = [('SVC', LinearSVC(C = 1)),('RF',RandomForestClassifier(n_estimators=500))]
Instanciate your Meta-learner
meta_model = LogisticRegressionCV()
Instanciate Stacking-Model
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, passthrough=True, cv=3)
Fit and predict

How do we predict on new unseen groups in a hierarchical model in PyMC3?

If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a + tt.dot(X_shared,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
size=n_samples,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a + tt.dot(X, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
site_shared=train_site_shared,
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
site_shared=test_site_shared,
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
y=train_Y,
site_shared=train_site_shared,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
y=new_site_Y,
site_shared=test_site_shared,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
include_transformed=True)
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
samples=len(df))
plt.figure()
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.

Confusion Matrix on H2O

Final Edit: this problem ended up occurring because the target array were integers that were supposed to represent categories so it was doing a regression. Once I converted them into factors using .asfactor(), then the confusion matrix method detailed in the answer below worked
I am trying to run a confusion matrix on my Random Forest Model (my_model), but the documentation has been less than helpful. From here it says the command is h2o.confusionMatrix(my_model) but there is no such thing in 3.0.
Here are the steps to fit the model:
from h2o.estimators.random_forest import H2ORandomForestEstimator
data_h = h2o.H2OFrame(data)
train, valid = data_h.split_frame(ratios=[.7], seed = 1234)
my_model = H2ORandomForestEstimator(model_id = "rf_h", ntrees = 400,
max_depth = 30, nfolds = 8, seed = 25)
my_model.train(x = features, y = target, training_frame=train)
pred = rf_h.predict(valid)
I have tried the following:
my_model.confusion_matrix()
AttributeError: type object 'H2ORandomForestEstimator' has no attribute
'confusion_matrix'
Gotten from this example.
I have attempted to use tab completion to find out what it might be and have tried:
h2o.model.confusion_matrix(my_model)
TypeError: 'module' object is not callable
and
h2o.model.ConfusionMatrix(my_model)
which outputs simply all the model diagnostics and then the error:
H2OTypeError: Argument `cm` should be a list, got H2ORandomForestEstimator
Finally,
h2o.model.ConfusionMatrix(pred)
Which gives the same error as above.
Not sure what to do here, how can I view the results of the confusion matrix of the model?
Edit: Added more code to the beginning of the question for Context
please see the documentation for the full parameter list. For your convenience here is the list confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False).
Here is a working example of how to use the method:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the binomial_double_trees (boolean parameter):
# Initialize and train a DRF
cars_drf = H2ORandomForestEstimator(binomial_double_trees = False, seed = 1234)
cars_drf.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
cars_drf.confusion_matrix()
# or specify the validation frame
cars_drf.confusion_matrix(valid=True)

Wrong classification outputs with sklearn GMM classifier

I'm building a basic speaker recognizer with the GMM toolkit from sklearn. I have 3 classes, for each class I have a classifier. In the testing stage, the GMM for the speaker with the highest probability should be selected and the program should return the predicted class for each test sample. I want to vary the number of mixture components and set n_components=4 in this example code.
If I use 4 mixture components the output of my classifier will either be 0, 1, 2 or 3. If I use 3 mixture components, it will be 0, 1 or 2. I have the feeling that the classifier returns the predicted mixture component instead of the whole GMM. But I want it to predict the class: 1, 2 or 3.
Here is my code:
import numpy as np
from sklearn.mixture import GMM
#set path
path="path"
class_names = [1,2,3]
covs = ['spherical', 'diag', 'tied', 'full']
training_data = {1: np.loadtxt(path+"/01_train_debug.data"), 2: np.loadtxt(path+"/02_train_debug.data"), 3: np.loadtxt(path+"/03_train_debug.data")}
print "Training models"
models = {}
for c in class_names:
# make a GMM for each of the classes in class_names
models[c] = dict((covar_type,GMM(n_components=4,
covariance_type=covar_type, init_params='wmc',n_init=1, n_iter=20))
for covar_type in covs)
for cov in covs:
for c in class_names:
models[c][cov].fit(training_data[c])
#define test set
test01 = np.loadtxt(path+"/01_test_debug.data")
test02 = np.loadtxt(path+"/02_test_debug.data")
test03 = np.loadtxt(path+"/03_test_debug.data")
testing_data = {1: test01, 2: test02, 3: test03}
probs = {}
print "Calculating Probabilities"
for c in class_names:
probs[c] = {}
for cov in covs:
probs[c][cov] = {}
for p in class_names:
probs[c][cov] = models[p][cov].predict(testing_data[c])
for c in class_names:
print c
for cov in covs:
print " ",cov,
for p in class_names:
print p, probs,
print
Is my assumption from above correct or do I have a logical error in my code?
Is there a way to solve this in sklearn?
Thanks in advance for your help!
In your code, the first time you the keys of the models dict are covariance types and the second time the keys are class names. I misread your code, sorry.
Edit: if you want the per-sample likelihood of the data under a fitted GMM models you should use the score_samples method. The predict method does not return probabilities but component assignments instead.
Also GMM by default is non supervised model. If you want to build a supervised model out of a bunch GMM models, you should probably wrap it as an estimator class that wraps them and implement the fit / predict API to be able to estimate its accuracy via cross validation and adjust the hyper parameter values by grid search. Pull request #2468 is implementing something like this. It it's merged in time it might get included in the next scikit-learn release (0.15 that should come out early 2014).

Categories

Resources