Grid search for hyperparameter evaluation of clustering in scikit-learn

Grid search for hyperparameter evaluation of clustering in scikit-learn - python

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.
My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.
My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?
def silhouette_score(estimator, X):
clusters = estimator.fit_predict(X)
score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
return score
ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}
# run randomized search
search = GridSearchCV(
ca,
param_distributions=param_dist,
n_iter=n_iter_search,
scoring=silhouette_score,
cv= # can I pass something here to only use a single fold?
)
search.fit(distance_matrix)

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.
pip install clusteval
Depending on your data, the evaluation method can be chosen.
# Import library
from clusteval import clusteval
# Set parameters, as an example dbscan
ce = clusteval(method='dbscan')
# Fit to find optimal number of clusters using dbscan
results= ce.fit(X)
# Make plot of the cluster evaluation
ce.plot()
# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(X)
# results is a dict with various output statistics. One of them are the labels.
cluster_labels = results['labx']

Ok, this might be an old question but I use this kind of code:
First, we want to generate all the possible combinations of parameters:
def make_generator(parameters):
if not parameters:
yield dict()
else:
key_to_iterate = list(parameters.keys())[0]
next_round_parameters = {p : parameters[p]
for p in parameters if p != key_to_iterate}
for val in parameters[key_to_iterate]:
for pars in make_generator(next_round_parameters):
temp_res = pars
temp_res[key_to_iterate] = val
yield temp_res
Then create a loop out of this:
# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 }
param_grid = {"n_clusters": range(2, 11)}
for params in make_generator(param_grid):
params.update(fixed_params)
ca = KMeans( **params )
ca.fit(_data)
labels = ca.labels_
# Estimate your clustering labels and
# make decision to save or discard it!
Of course, it can be combined in a pretty function. So this solution is mostly an example.
Hope it helps someone!

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])
N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)

Related

Understading hyperopt's TPE algorithm

I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?

So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.

Possible to manually set parameters in linear model and get R Squared, etc?

I am solving a linear model with bounds on the parameters. The simple statsmodels OLS method doesn't allow for bounds on the fitted parameters, so to do this, I maximize a likelihood function using scipy.optimize.minimize. From this, I have my set of parameters for a linear model. All good so far.
All I need to acheive now is to be able to access statistics for my model, such as R^2, F-Stat, etc. For an OLS, these things all come with the object returned by model.fit() along with other nice features.
I'm wondering if it is possible to create this object, manually assign my parameters from the bounded fit, and have it compute the data fields on the fit result object? Obviously, I could just manually compute these things but I want it such that whether I am calling for a bounded or unbounded fit, I get the same object type returned and life is easy downstream.
Pseudo code:
bounded_params = fitBoundedLinear(x, y) # solution to bounded problem - a list of floats
model = statsmodels.api.OLS(y, x)
unbounded_fitResult = model.fit() # solution to unbounded problem - a regression results object
want to do something like:
aFitResult.params = bounded_params # manually set the parameters
aFitResult.calculate() # force it to compute data fields based on these params
rsq = aFitResult.rsquared # etc...

I have something that works - but it is probably not an ideal solution:
aFitResult = statsmodels.regression.linear_model.RegressionResultsWrapper(statsmodels.regression.linear_model.OLSResults(model, bounded_params))

you can add upper_bound and lower_bound to fit_elasticnet in elastic_net.py as:
def fit_elasticnet(model, method="coord_descent", maxiter=100,
alpha=0., L1_wt=1., start_params=None, cnvrg_tol=1e-7,
zero_tol=1e-8, refit=False, check_step=True,
loglike_kwds=None, score_kwds=None, hess_kwds=None, upper_bound=None, lower_bound=None):
then inside that function after the following line:
params[k] = _opt_1d(func, grad, hess, model_1var, params[k], alpha[k]*L1_wt,
tol=btol, check_step=check_step)
add:
if upper_bound is not None:
params[k] = min(params[k], upper_bound[k])
if lower_bound is not None:
params[k] = max(params[k], lower_bound[k])
then call the function similar to:
model = lm.OLS(y, x)
results_fu = model.fit()
#results_fu.summary()
results_fr = model.fit_regularized(alpha=0.001
,start_params=results_fu.params
,upper_bound=(.60,0,0,1,1,1,1,1)
,lower_bound=(-1, 0,0,0,-1,1,1,1,-10) )

Set the model's initial parameters to the desired values via start_params= and then fit them using maxiter=0 to do a fit with 0 steps (i.e. don't fit, but still run through all the initialization and metric computation).
result = model.fit(start_params=your_parameters_here, maxiter=0)
result.rsquared # or any other fit index

Learning Discrete HMM parameters in PyMC

I am trying to learn the parameters of a simple discrete HMM using PyMC. I am modeling the rainy-sunny model from the Wiki page on HMM. The model looks as follows:
I am using the following priors.
theta_start_state ~ beta(20,10)
theta_transition_rainy ~beta(8,2)
theta_transition_sunny ~beta(2,8)
theta_emission_rainy ~ Dirichlet(3,4,3)
theta_emission_sunny ~ Dirichlet(10,6,4)
Initially, I use this setup to create a training set as follows.
## Some not so informative priors!
# Prior on start state
theta_start_state = pm.Beta('theta_start_state',12,8)
# Prior on transition from rainy
theta_transition_rainy = pm.Beta('transition_rainy',8,2)
# Prior on transition from sunny
theta_transition_sunny = pm.Beta('transition_sunny',2,8)
# Prior on emission from rainy
theta_emission_rainy = pm.Dirichlet('emission_rainy',[3,4,3])
# Prior on emission from sunny
theta_emission_sunny = pm.Dirichlet('emission_sunny',[10,6,4])
# Start state
x_train_0 = pm.Categorical('x_0',[theta_start_state, 1-theta_start_state])
N = 100
# Create a train set for hidden states
x_train = np.empty(N, dtype=object)
# Creating a train set of observations
y_train = np.empty(N, dtype=object)
x_train[0] = x_train_0
for i in xrange(1, N):
if x_train[i-1].value==0:
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_rainy, 1- theta_transition_rainy])
else:
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_sunny, 1- theta_transition_sunny])
for i in xrange(0,N):
if x_train[i].value == 0:
# Rain
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_rainy)
else:
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_sunny)
However, I am not able to understand how to learn these parameters using PyMC. I made a start as follows.
#pm.observed
def y(x=x_train, value =y_train):
N = len(x)
out = np.empty(N, dtype=object)
for i in xrange(0,N):
if x[i].value == 0:
# Rain
out[i] = pm.Categorical('y_%d' %i, theta_emission_rainy)
else:
out[i] = pm.Categorical('y_%d' %i, theta_emission_sunny)
return out
The complete notebook containing this code can be found here.
Aside: The gist containing HMM code for a Gaussian is really hard to understand! (not documented)
Update
Based on the answers below, I tried changing my code as follows:
#pm.stochastic(observed=True)
def y(value=y_train, hidden_states = x_train):
def logp(value, hidden_states):
logprob = 0
for i in xrange(0,len(hidden_states)):
if hidden_states[i].value == 0:
# Rain
logprob = logprob + pm.categorical_like(value[i], theta_emission_rainy)
else:
# Sunny
logprob = logprob + pm.categorical_like(value[i], theta_emission_sunny)
return logprob
The next step would be to create a model and then run the MCMC algorithm. However, the above
edited code would also not work. It gives a ZeroProbability error.
I am not sure if I have interpreted the answers correctly.

Just some thoughts on this:
Sunny and Rainy are mutually exclusive and exhaustive hidden states. Why don't you encode them as a single categorical weather variable which can take one of two values (coding for sunny, rainy) ?
In your likelihood function, you seem to observe Rainy / Sunny. The way I see it in your model graph, these seem to be the hidden, not the observed variables (that would be "walk", "shop" and "clean")
In your likelihood function, you need to sum (for all time steps t) the log-probability of the observed values (of walk, shop and clean respectively) given the current (sampled) values of rainy/sunny (i.e., Weather) at the same time step t.
Learning
If you want to learn parameters for the model, you might want to consider switching to PyMC3 which would be better suited for automatically computing gradients for your logp function. But in this case (since you chose conjugate priors) this is not really neccessary. If you don't know what Conjugate Priors are, or are in need of an overview, ask Wikipedia for List of Conjugate Priors, it has a great article on that.
Depending on what you want to do, you have a few choices here. If you want to sample from the posterior distribution of all parameters, just specify your MCMC model as you did, and press the inference button, after that just plot and summarize the marginal distributions of the parameters you're interested in, and you are done.
If you are not interested in marginal posterior distributions, but rather in finding the joint MAP paramters, you might consider using Expectation Maximization (EM) learning or Simulated Annealing. Both should work reasonably well within the MCMC Framework.
For EM Learning simply repeat these steps until convergence:
E (Expectation) Step: Run the MCMC chain for a while and collect samples
M (Maximization) Step: Update the hyperparamters for your Beta and Dirichlet Priors as if these samples had been actual observations. Look up the Beta and Dirichlet Distributions if you don't know how to do that.
I would use a small learning rate factor so you don't fall into the first local optimum (now we're approaching Simulated Annealing): Instead of treating the N samples you generated from the MCMC chain as actual observations, treat them as K observations (for a value K << N) by scaling the updates to the hyperparameters down by a learning rate factor of K/N.

The first thing that pops out at me is the return value of your likelihood. PyMC expects a scalar return value, not a list/array. You need to sum the array before returning it.
Also, when you use a Dirichlet as a prior for the Categorical, PyMC detects this and fills in the last probability. Here's how I would code your x_train/y_train loops:
p = []
for i in xrange(1, N):
# This will return the first variable if prev=0, and the second otherwise
p.append(pm.Lambda('p_%i' % i, lambda prev=x_train[i-1]: (theta_transition_rainy, theta_transition_sunny)[bool(prev)]))
x_train[i] = pm.Categorical('x_train_%i' % i, p[-1])
So, you grab the appropriate probabilities with a Lambda, and use it as the argument for the Categorical.

Sequential k-means clustering using scikit-learn

Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.
Thank you

scikit-learn's KMeans class has a predict method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.
If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans estimator and its partial_fit method.

You can pass in initial values for the centroids with the init parameter to sklearn.cluster.kmeans. So then you can just do:
centroids, labels, inertia = k_means(data, k)
new_data = np.append(data, extra_pts)
new_centroids, new_labels, new_inertia = k_means(new_data, k, init=centroids)
assuming you're just adding data points and not changing k.
I think this will sometimes mean you get a suboptimal result, but it should usually be faster. You might want to occasionally redo the fit with, say, 10 random seeds and take the best one.

It's also relatively easy to write your own function that finds out which centroid is closest to a point that you are considering. Assuming you have some matrix X that is ready for kmeans:
centroids, labels, inertia = cluster.k_means(X, 5)
def pred(arr):
return np.argmin([np.linalg.norm(arr-b) for b in centroids])
You can confirm that this works via:
[pred(X[i]) == labels[i] for i in range(len(X))]

Calculating Nearest Match to Mean/Stddev Pair With LibSVM

I'm new to SVMs, and I'm trying to use the Python interface to libsvm to classify a sample containing a mean and stddev. However, I'm getting nonsensical results.
Is this task inappropriate for SVMs or is there an error in my use of libsvm? Below is the simple Python script I'm using to test:
#!/usr/bin/env python
# Simple classifier test.
# Adapted from the svm_test.py file included in the standard libsvm distribution.
from collections import defaultdict
from svm import *
# Define our sparse data formatted training and testing sets.
labels = [1,2,3,4]
train = [ # key: 0=mean, 1=stddev
{0:2.5,1:3.5},
{0:5,1:1.2},
{0:7,1:3.3},
{0:10.3,1:0.3},
]
problem = svm_problem(labels, train)
test = [
({0:3, 1:3.11},1),
({0:7.3,1:3.1},3),
({0:7,1:3.3},3),
({0:9.8,1:0.5},4),
]
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10, probability = 1)
model = svm_model(problem, param)
for test_sample,correct_label in test:
pred_label, pred_probability = model.predict_probability(test_sample)
correct[kn] += pred_label == correct_label
# Show results.
print '-'*80
print 'Accuracy:'
for kn,correct_count in correct.iteritems():
print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test))
The domain seems fairly simple. I'd expect that if it's trained to know a mean of 2.5 means label 1, then when it sees a mean of 2.4, it should return label 1 as the most likely classification. However, each kernel has an accuracy of 0%. Why is this?
A couple of side notes, is there a way to hide all the verbose training output dumped by libsvm in the terminal? I've searched libsvm's docs and code, but I can't find any way to turn this off.
Also, I had wanted to use simple strings as the keys in my sparse dataset (e.g. {'mean':2.5,'stddev':3.5}). Unfortunately, libsvm only supports integers. I tried using the long integer representation of the string (e.g. 'mean' == 1109110110971110), but libsvm seems to truncate these to normal 32-bit integers. The only workaround I see is to maintain a separate "key" file that maps each string to an integer ('mean'=0, 'stddev'=1). But obviously that'll be a pain since I'll have to maintain and persist a second file along with the serialized classifier. Does anyone see an easier way?

The problem seems to be coming from combining multiclass prediction with probability estimates.
If you configure your code not to make probability estimates, it actually works, e.g.:
<snip>
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
model = svm_model(problem, param)
for test_sample,correct_label in test:
# Here -> change predict_probability to just predict
pred_label = model.predict(test_sample)
correct[kn] += pred_label == correct_label
</snip>
With this change, I get:
--------------------------------------------------------------------------------
Accuracy:
polynomial 1.000000 (4 of 4)
rbf 1.000000 (4 of 4)
linear 1.000000 (4 of 4)
Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.

If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.
By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.
For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use
distance(p1, p2) = KL(p1, p2) + KL(p1, p2)
For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)
Long story short:
Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.
To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.