Calculating Nearest Match to Mean/Stddev Pair With LibSVM

Calculating Nearest Match to Mean/Stddev Pair With LibSVM - python

I'm new to SVMs, and I'm trying to use the Python interface to libsvm to classify a sample containing a mean and stddev. However, I'm getting nonsensical results.
Is this task inappropriate for SVMs or is there an error in my use of libsvm? Below is the simple Python script I'm using to test:
#!/usr/bin/env python
# Simple classifier test.
# Adapted from the svm_test.py file included in the standard libsvm distribution.
from collections import defaultdict
from svm import *
# Define our sparse data formatted training and testing sets.
labels = [1,2,3,4]
train = [ # key: 0=mean, 1=stddev
{0:2.5,1:3.5},
{0:5,1:1.2},
{0:7,1:3.3},
{0:10.3,1:0.3},
]
problem = svm_problem(labels, train)
test = [
({0:3, 1:3.11},1),
({0:7.3,1:3.1},3),
({0:7,1:3.3},3),
({0:9.8,1:0.5},4),
]
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10, probability = 1)
model = svm_model(problem, param)
for test_sample,correct_label in test:
pred_label, pred_probability = model.predict_probability(test_sample)
correct[kn] += pred_label == correct_label
# Show results.
print '-'*80
print 'Accuracy:'
for kn,correct_count in correct.iteritems():
print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test))
The domain seems fairly simple. I'd expect that if it's trained to know a mean of 2.5 means label 1, then when it sees a mean of 2.4, it should return label 1 as the most likely classification. However, each kernel has an accuracy of 0%. Why is this?
A couple of side notes, is there a way to hide all the verbose training output dumped by libsvm in the terminal? I've searched libsvm's docs and code, but I can't find any way to turn this off.
Also, I had wanted to use simple strings as the keys in my sparse dataset (e.g. {'mean':2.5,'stddev':3.5}). Unfortunately, libsvm only supports integers. I tried using the long integer representation of the string (e.g. 'mean' == 1109110110971110), but libsvm seems to truncate these to normal 32-bit integers. The only workaround I see is to maintain a separate "key" file that maps each string to an integer ('mean'=0, 'stddev'=1). But obviously that'll be a pain since I'll have to maintain and persist a second file along with the serialized classifier. Does anyone see an easier way?

The problem seems to be coming from combining multiclass prediction with probability estimates.
If you configure your code not to make probability estimates, it actually works, e.g.:
<snip>
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
print kt
param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
model = svm_model(problem, param)
for test_sample,correct_label in test:
# Here -> change predict_probability to just predict
pred_label = model.predict(test_sample)
correct[kn] += pred_label == correct_label
</snip>
With this change, I get:
--------------------------------------------------------------------------------
Accuracy:
polynomial 1.000000 (4 of 4)
rbf 1.000000 (4 of 4)
linear 1.000000 (4 of 4)
Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.

If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.
By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.
For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use
distance(p1, p2) = KL(p1, p2) + KL(p1, p2)
For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)
Long story short:
Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.
To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.

Related

Avoid getting nonsensical values (eg. -0.0) when classifying very long texts with Naive Bayes

I've implemented a fairly efficient implementation of a multinomial Naive Bayes classifier and it works like a charm. That is until the classifier encounters very long messages (on the order of 10k words) where the predictions results are nonsensical (e.g. -0.0) and I get a math domain error when using Python's math.log function. The reason why I'm using log is that when working with very small floats, what you get if you multiply them is very small floats and the log helps avoiding infinitely small numbers that would fail the predictions.
Some context
I'm using the Bag of words approach without any sort of vectorization (like TF-IDF, because I can't figure out how to implement it properly and balance 0-occurring words. A snippet for that would be appreciated too ;) ) and I'm using frequency count and Laplace additive smoothing (basically adding 1 to each frequency count so it's never 0).
I could just take the log out, but that would mean that in the case of such long messages the engine would fail to detect them properly anyway, so it's not the point.

There is no multiplication in Naive Bayes if you apply log-sum-exp, only additions, so the underflow is unlikely. And if you use smoothing (as you say yo do), you would never get undefined behavior for log.
This stats stackexchange answer describes the underlying math. For reference implementation I have a snippet of mine lying around for MultinomialNaiveBayes (analogous to sklearn's sklearn.naive_bayes.MultinomialNB and with similar API):
import numpy as np
import scipy
class MultinomialNaiveBayes:
def __init__(self, alpha: float = 1.0):
self.alpha = alpha
def fit(self, X, y):
# Calculate priors from data
self.log_priors_ = np.log(np.bincount(y) / y.shape[0])
# Get indices where data belongs to separate class, creating a slicing mask.
class_indices = np.array(
np.ma.make_mask([y == current for current in range(len(self.log_priors_))])
)
# Divide dataset based on class indices
class_datasets = np.array([X[indices] for indices in class_indices])
# Calculate how often each class occurs and add alpha smoothing.
# Reshape into [n_classes, features]
classes_metrics = (
np.array([dataset.sum(axis=0) for dataset in class_datasets]).reshape(
len(self.log_priors_), -1
)
+ self.alpha
)
# Calculate log likelihoods
self.log_likelihoods_ = np.log(
classes_metrics / classes_metrics.sum(axis=1)[:, np.newaxis]
)
return self
def predict(self, X):
# Return most likely class
return np.argmax(
scipy.sparse.csr_matrix.dot(X, self.log_likelihoods_.T) + self.log_priors_,
axis=1,
)
BTW. -0.0 is exactly the same as 0.0 and is sensical value.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?

You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)

It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.
My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.
My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?
def silhouette_score(estimator, X):
clusters = estimator.fit_predict(X)
score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
return score
ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}
# run randomized search
search = GridSearchCV(
ca,
param_distributions=param_dist,
n_iter=n_iter_search,
scoring=silhouette_score,
cv= # can I pass something here to only use a single fold?
)
search.fit(distance_matrix)

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.
pip install clusteval
Depending on your data, the evaluation method can be chosen.
# Import library
from clusteval import clusteval
# Set parameters, as an example dbscan
ce = clusteval(method='dbscan')
# Fit to find optimal number of clusters using dbscan
results= ce.fit(X)
# Make plot of the cluster evaluation
ce.plot()
# Make scatter plot. Note that the first two coordinates are used for plotting.
ce.scatter(X)
# results is a dict with various output statistics. One of them are the labels.
cluster_labels = results['labx']

Ok, this might be an old question but I use this kind of code:
First, we want to generate all the possible combinations of parameters:
def make_generator(parameters):
if not parameters:
yield dict()
else:
key_to_iterate = list(parameters.keys())[0]
next_round_parameters = {p : parameters[p]
for p in parameters if p != key_to_iterate}
for val in parameters[key_to_iterate]:
for pars in make_generator(next_round_parameters):
temp_res = pars
temp_res[key_to_iterate] = val
yield temp_res
Then create a loop out of this:
# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 }
param_grid = {"n_clusters": range(2, 11)}
for params in make_generator(param_grid):
params.update(fixed_params)
ca = KMeans( **params )
ca.fit(_data)
labels = ca.labels_
# Estimate your clustering labels and
# make decision to save or discard it!
Of course, it can be combined in a pretty function. So this solution is mostly an example.
Hope it helps someone!

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])
N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)

Learning Discrete HMM parameters in PyMC

I am trying to learn the parameters of a simple discrete HMM using PyMC. I am modeling the rainy-sunny model from the Wiki page on HMM. The model looks as follows:
I am using the following priors.
theta_start_state ~ beta(20,10)
theta_transition_rainy ~beta(8,2)
theta_transition_sunny ~beta(2,8)
theta_emission_rainy ~ Dirichlet(3,4,3)
theta_emission_sunny ~ Dirichlet(10,6,4)
Initially, I use this setup to create a training set as follows.
## Some not so informative priors!
# Prior on start state
theta_start_state = pm.Beta('theta_start_state',12,8)
# Prior on transition from rainy
theta_transition_rainy = pm.Beta('transition_rainy',8,2)
# Prior on transition from sunny
theta_transition_sunny = pm.Beta('transition_sunny',2,8)
# Prior on emission from rainy
theta_emission_rainy = pm.Dirichlet('emission_rainy',[3,4,3])
# Prior on emission from sunny
theta_emission_sunny = pm.Dirichlet('emission_sunny',[10,6,4])
# Start state
x_train_0 = pm.Categorical('x_0',[theta_start_state, 1-theta_start_state])
N = 100
# Create a train set for hidden states
x_train = np.empty(N, dtype=object)
# Creating a train set of observations
y_train = np.empty(N, dtype=object)
x_train[0] = x_train_0
for i in xrange(1, N):
if x_train[i-1].value==0:
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_rainy, 1- theta_transition_rainy])
else:
x_train[i] = pm.Categorical('x_train_%d'%i,[theta_transition_sunny, 1- theta_transition_sunny])
for i in xrange(0,N):
if x_train[i].value == 0:
# Rain
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_rainy)
else:
y_train[i] = pm.Categorical('y_train_%d' %i, theta_emission_sunny)
However, I am not able to understand how to learn these parameters using PyMC. I made a start as follows.
#pm.observed
def y(x=x_train, value =y_train):
N = len(x)
out = np.empty(N, dtype=object)
for i in xrange(0,N):
if x[i].value == 0:
# Rain
out[i] = pm.Categorical('y_%d' %i, theta_emission_rainy)
else:
out[i] = pm.Categorical('y_%d' %i, theta_emission_sunny)
return out
The complete notebook containing this code can be found here.
Aside: The gist containing HMM code for a Gaussian is really hard to understand! (not documented)
Update
Based on the answers below, I tried changing my code as follows:
#pm.stochastic(observed=True)
def y(value=y_train, hidden_states = x_train):
def logp(value, hidden_states):
logprob = 0
for i in xrange(0,len(hidden_states)):
if hidden_states[i].value == 0:
# Rain
logprob = logprob + pm.categorical_like(value[i], theta_emission_rainy)
else:
# Sunny
logprob = logprob + pm.categorical_like(value[i], theta_emission_sunny)
return logprob
The next step would be to create a model and then run the MCMC algorithm. However, the above
edited code would also not work. It gives a ZeroProbability error.
I am not sure if I have interpreted the answers correctly.

Just some thoughts on this:
Sunny and Rainy are mutually exclusive and exhaustive hidden states. Why don't you encode them as a single categorical weather variable which can take one of two values (coding for sunny, rainy) ?
In your likelihood function, you seem to observe Rainy / Sunny. The way I see it in your model graph, these seem to be the hidden, not the observed variables (that would be "walk", "shop" and "clean")
In your likelihood function, you need to sum (for all time steps t) the log-probability of the observed values (of walk, shop and clean respectively) given the current (sampled) values of rainy/sunny (i.e., Weather) at the same time step t.
Learning
If you want to learn parameters for the model, you might want to consider switching to PyMC3 which would be better suited for automatically computing gradients for your logp function. But in this case (since you chose conjugate priors) this is not really neccessary. If you don't know what Conjugate Priors are, or are in need of an overview, ask Wikipedia for List of Conjugate Priors, it has a great article on that.
Depending on what you want to do, you have a few choices here. If you want to sample from the posterior distribution of all parameters, just specify your MCMC model as you did, and press the inference button, after that just plot and summarize the marginal distributions of the parameters you're interested in, and you are done.
If you are not interested in marginal posterior distributions, but rather in finding the joint MAP paramters, you might consider using Expectation Maximization (EM) learning or Simulated Annealing. Both should work reasonably well within the MCMC Framework.
For EM Learning simply repeat these steps until convergence:
E (Expectation) Step: Run the MCMC chain for a while and collect samples
M (Maximization) Step: Update the hyperparamters for your Beta and Dirichlet Priors as if these samples had been actual observations. Look up the Beta and Dirichlet Distributions if you don't know how to do that.
I would use a small learning rate factor so you don't fall into the first local optimum (now we're approaching Simulated Annealing): Instead of treating the N samples you generated from the MCMC chain as actual observations, treat them as K observations (for a value K << N) by scaling the updates to the hyperparameters down by a learning rate factor of K/N.

The first thing that pops out at me is the return value of your likelihood. PyMC expects a scalar return value, not a list/array. You need to sum the array before returning it.
Also, when you use a Dirichlet as a prior for the Categorical, PyMC detects this and fills in the last probability. Here's how I would code your x_train/y_train loops:
p = []
for i in xrange(1, N):
# This will return the first variable if prev=0, and the second otherwise
p.append(pm.Lambda('p_%i' % i, lambda prev=x_train[i-1]: (theta_transition_rainy, theta_transition_sunny)[bool(prev)]))
x_train[i] = pm.Categorical('x_train_%i' % i, p[-1])
So, you grab the appropriate probabilities with a Lambda, and use it as the argument for the Categorical.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.