How to address numerical underflow problem in Naive Bayes classification?

How to address numerical underflow problem in Naive Bayes classification? - python

I have functions to implement Naive Bayes classifier (for my dataset) without using any ML library.
I would like to know how to address numerical underflow problem in this code. I know I need to use log to calculate probabilities in the classifier but I am unable to get it to work. When I print p1 and p0, I am currently getting 0 as output for both. How do I change the function to calculate the probabilities p0 and p1 with log.
# build a naive bayes classifier
def classifyNB0(vec2Classify, p0Vec, p1Vec, pAbusive):
p1 = np.prod(np.power(p1Vec, vec2Classify)) * pAbusive
print('p1 =',p1)
# element-wise power computation
p0 = np.prod(np.power(p0Vec, vec2Classify)) * (1.0 - pAbusive)
print('p0 =',p0)
if p1 > p0:
return 1
else:
return 0
Values in p1Vec:
p1Vec = [0.05263158 0.15789474 0.05263158 0. 0. 0.05263158
0. 0.05263158 0. 0.10526316 0. 0.
0. 0. 0.05263158 0.05263158 0.05263158 0.05263158
0.10526316 0.05263158 0. 0. 0.05263158 0.
0.05263158 0.05263158 0. 0. 0. 0.
0. 0. ]
Values in vec2Classify:
vec2Classify = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0]

I would propose that this is actually a math problem, and that your post might be better suited for Math Exchange
I agree with #simon, this is best solved by some "logarithms", but first I suggest doing some work with pen and paper to simplify the code:
I know nothing about "Naive Bayes Classification", but as far as I can see from your code, you essentially need to evaluate the inequality, p1 > p0. Let's do some math..
Obviously, we could equivalently evaluate log (p_1) > log (p_0). So let's try to rewrite the two expressions for p1 and p0.
In code, we would you will need to iterate over your lists/vectors to get the sums..
log_p1 = log(p1) = V[0]*log(U[0]) + ... + V[n]*log(U[n]) + log(pA)
Depending on your numerical values, I would hope that these calculations would not be subject of underflow and thus, possible to evaluate: log_p1 > log_p0.
In terms of python code the sums would be,
import numpy as np
log_p1 = np.log(pAbusive)
log_p0 = np.log(1-pAbusive)
for i in range(len(p1Vec)):
log_p1 += vec2Classify[i] * np.log(p1Vec[i])
log_p0 += vec2Classify[i] * np.log(p0Vec[i])
And then just evaluate,
log_p1 > log_p0
EDIT:
When I look at your data which you added to the post in a later edit, your math becomes trivial.You don't need power nor log. You can avoid them all together. Note that,
power(x,0) = 1,
power(x,1) = x,
log(1) = 0,
... always!.
You could simply write,
p1 = pAbusive
for x,y in zip(p1Vec, vec2Classify):
if y: # == 1
p1 *= x
Or, as a one-liner list-comprehension
p1 = pA * np.prod([x if y else 1 for x,y in zip(p1Vec,vec2Classify)])
If you get underflow from this, try again with log,
log_p1 = np.log(pA) + sum([np.log(x) if y else 0 for x,y in zip(p1Vec,vec2Classify)])
# ...
# and evaluate,
log_p1 > log_p0
EDIT2:
You do not really have an underflow problem. I've tried to input your data, and quite frankly, p1 evaluates correctly to 0.0. If you take a closer look at vec2Classify, you'll see that it only holds 1 at three different indices, and that p1Vec is 0 at the exact same indices.
If p1Vec is zero at at least one of the indices where vec2Classify is 1, then the whole p1 = prod( ... ) will always be zero, because you'll multiply with power(0,1) = 0.
Maybe your input data (p1Vec, vec2Classify) is incorrectly typed?

Related

Simple quadratic problem in Gurobi not producing optimal result?

I am having trouble understanding why my code below is not producing the optimal result. I am attempting to create a step function but this clearly isn't working as the solution value for model.Z isn't one of the range points specified.
Any help in understanding/correcting this is greatly appreciated.
What I am trying to do
Maximize Z * X, subject to:
/ 20.5 , X <= 5
Z(X) = | 10 , 5 <= X <= 10
\ 9 , 10 <= X <= 11
Even better, I'd like to solve under the following conditions (disjoint at breakpoints):
/ 20.5 , X <= 5
Z(X) = | 10 , 5 < X <= 10
\ 9 , 10 < X <= 11
where X and Z are floating point numbers.
I would expect X to be 5 and Z to be 20.5, however model results are 7.37 and 15.53.
Code
from pyomo.core import *
# Break points for step-function
DOMAIN_PTS = [5., 10., 11.]
RANGE_PTS = [20.5, 10., 9.]
# Define model and variables
model = ConcreteModel()
model.X = Var(bounds=(5,11))
model.Z = Var()
# Set piecewise constraint
model.con = Piecewise(model.Z,model.X,
pw_pts=DOMAIN_PTS ,
pw_constr_type='EQ',
f_rule=RANGE_PTS,
force_pw=True,
pw_repn='SOS2')
model.obj = Objective(expr= model.Z * model.X, sense=maximize)
opt = SolverFactory('gurobi')
opt.options['NonConvex'] = 2
obj_val = opt.solve(model)
print(value(model.X))
print(value(model.Z))
print(model.obj())

I would never piecewice linearize z, but always z*x. If you have a piecewise linear expression for z only, then z*x is nonlinear (and in a nasty way). If you however write down a piecewise linear expression for z*x then the whole thing becomes linear. Note that discontinuities in the piecewise functions require attention.
It is important to understand mathematically what you are writing down before you pass it on to a solver.

Piecewise in Pyomo is intended to interpolate linearly between some bounds given by another variable. This means that if you give the bounds given as you're trying, your interpolating as this (sorry for such a bad graph) which basically means that you're placing a line between x=5 and x=10 a line given by Z= 31 - 2.1X, and another line between 10 and 11. In fact, Gurobi is computing the optimal result, since x its a NonNegativeReal and in such a line Z= 31 - 2.1X, x=7.37 give as result Z=15.53.
Now, I understand that you want rather a step function than a interpolation, something similar to this (sorry for such a bad graph, again), then you need to change your DOMAIN_PTS and RANGE_PTS in order to correctly model what you want
# Break points for step-function
DOMAIN_PTS = [5.,5., 10., 10.,11.]
RANGE_PTS = [20.5,10., 10., 9., 9.]
In this way, you're interpolating between f(x)=20.5: 5<=x<=10 and so on.

Efficient calculation of cosine_distance of a csc_sparse_matrix using scipy.spatial.distance

Have a csc_matrix sparse named as eventPropMatrix having datatype=float64 with shape=(13000,7).
Upon which I am applying following distance calculating function.
Here
eventPropMatrix.getrow(i).todense()==[[0. 0. 0. 0. 0. 0. 0.]]
eventPropMatrix.getrow(j).todense()==[[0. 0. 0. 0. 0. 0. 0.]]
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
epsim = scipy.spatial.distance.correlation(eventPropMatrix.getrow(i).todense(), eventPropMatrix.getrow(j).todense())
Here the scipy.spatial.distance.correlation is following:
def correlation(u, v, w=None, centered=True):
"""
Compute the correlation distance between two 1-D arrays.
The correlation distance between `u` and `v`, is
defined as
.. math::
1 - \\frac{(u - \\bar{u}) \\cdot (v - \\bar{v})}
{{||(u - \\bar{u})||}_2 {||(v - \\bar{v})||}_2}
where :math:`\\bar{u}` is the mean of the elements of `u`
and :math:`x \\cdot y` is the dot product of :math:`x` and :math:`y`.
Parameters
----------
u : (N,) array_like
Input array.
v : (N,) array_like
Input array.
w : (N,) array_like, optional
The weights for each value in `u` and `v`. Default is None,
which gives each value a weight of 1.0
Returns
-------
correlation : double
The correlation distance between 1-D array `u` and `v`.
"""
u = _validate_vector(u)
v = _validate_vector(v)
if w is not None:
w = _validate_weights(w)
if centered:
umu = np.average(u, weights=w)
vmu = np.average(v, weights=w)
u = u - umu
v = v - vmu
uv = np.average(u * v, weights=w)
uu = np.average(np.square(u), weights=w)
vv = np.average(np.square(v), weights=w)
dist = 1.0 - uv / np.sqrt(uu * vv)
return dist
Here I am having "nan" values as return value for most of the time as uu=0.0 and vv=0.0
My query is that for the 13000 rows this calculation takes too much time. It has been running for last 15+ hours (i5, 8th Gen, 4 core processor, 12Gb RAM, Ubuntu).
Is there any way around for this humongous calculation. I am contemplating to Cythonize the code into C and then compile and run. Will this help, if does then how to do this???

How to tune / choose the preference parameter of AffinityPropagation?

I have large dictionary of "pairwise similarity matrixes" that would look like the following:
similarity['group1']:
array([[1. , 0. , 0. , 0. , 0. ],
[0. , 1. , 0.09 , 0.09 , 0. ],
[0. , 0.09 , 1. , 0.94535157, 0. ],
[0. , 0.09 , 0.94535157, 1. , 0. ],
[0. , 0. , 0. , 0. , 1. ]])
In short, every element of the previous matrix is the probability that record_i and record_j are similar (values being 0 and 1 inclusive), 1 being exactly similar and 0 being completely different.
I then feed each similarity matrix into an AffinityPropagation algorithm in order to group / cluster similar records:
sim = similarities['group1']
clusterer = AffinityPropagation(affinity='precomputed',
damping=0.5,
max_iter=25000,
convergence_iter=2500,
preference=????)) # ISSUE here
affinity = clusterer.fit(sim)
cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_
However, since I run the above on multiple similarity matrixes, I need to have a generalised preference parameter which I can't seem to tune.
It says in the docs that it's by default set as the median of the similarity matrix, however I get lots of false positives with this setup, the mean sometimes work sometimes gives too many clusters etc...
e.g: when playing with the preference parameter, these are the results I get from the similarity matrix
preference = default # which is the median (value 0.2) of the similarity matrix: (incorrect results, we see that the record 18 shouldn't be there because the similarity with the other records is very low):
# Indexes of the elements in Cluster n°5: [15, 18, 22, 27]
{'15_18': 0.08,
'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'18_22': 0.12, # Not Ok, the similarity is too low
'18_27': 0.19, # Not Ok, the similarity is too low
'22_27': 0.6909703138051403}
preference = 0.2 in fact from 0.11 to 0.26: (correct results as the records are similar):
# Indexes of the elements in Cluster n°5: [15, 22, 27]
{'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'22_27': 0.6909703138051403}
My question is: How should I choose this preference parameter in a way that would generalise?

A naive and brute force grid search solution can be implemented as such if a connection is scored less than a certain threshold (0.5 for example), we'd re-run the clustering with an adjusted value of the preference parameter.
A naive implementation would be like the following.
First, a function to test whether a clustering needs tuning, the threshold being 0.5 in this example:
def is_tuning_required(similarity_matrix, rows_of_cluster):
rows = similarity_matrix[rows_of_cluster]
for row in rows:
for col_index in rows_of_cluster:
score = row[col_index]
if score > 0.5:
continue
return True
return False
Build a preference range of values against which the clustering would run:
def get_pref_range(similarity):
starting_point = np.median(similarity)
if starting_point == 0:
starting_point = np.mean(similarity)
# Let's try to accelerate the pace of values picking
step = 1 if starting_point >= 0.05 else step = 2
preference_tuning_range = [starting_point]
max_val = starting_point
while max_val < 1:
max_val *= 1.25 if max_val > 0.1 and step == 2 else step
preference_tuning_range.append(max_val)
min_val = starting_point
if starting_point >= 0.05:
while min_val > 0.01:
min_val /= step
preference_tuning_range.append(min_val)
return preference_tuning_range
A normal AfinityPropagation, with a preference parameter passed:
def run_clustering(similarity, preference):
clusterer = AffinityPropagation(damping=0.9,
affinity='precomputed',
max_iter=5000,
convergence_iter=2500,
verbose=False,
preference=preference)
affinity = clusterer.fit(similarity)
labels = affinity.labels_
return labels, len(set(labels)), affinity.cluster_centers_indices_
The method we would actually call with a similarity (1 - distance) matrix as an argument:
def run_ideal_clustering(similarity):
preference_tuning_range = get_pref_range(similarity)
best_tested_preference = None
for preference in preference_tuning_range:
labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
needs_tuning = False
wrong_clusters = 0
for label_index in range(labels_count):
cluster_elements_indexes = np.where(labels == label_index)[0]
tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
if tuning_required:
wrong_clusters += 1
if not needs_tuning:
needs_tuning = True
if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
best_tested_preference = (preference, wrong_clusters)
if not needs_tuning:
return labels, labels_count, cluster_centers_indices
# The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
return run_clustering(similarity, preference)
Obviously, this is a brute force solution which will not be performant in large datasets / similarity matrixes.
If a simpler and better solution gets posted I'll accept it.

Different results of cost function in theano and numpy

I'm getting different results when calculating a negative log likelihood of a simple two layer neural net in theano and numpy.
This is the numpy code:
W1,b1,W2,b2 = model['W1'], model['b1'], model['W2'], model['b2']
N, D = X.shape
where model is a function that generates initial parameters and X is the input array.
z_1 = X
z_2 = np.dot(X,W1) + b1
z_3 = np.maximum(0, z_2)
z_4 = np.dot(z_3,W2)+b2
scores = z_4
exp_scores = np.exp(scores)
exp_sum = np.sum(exp_scores, axis = 1)
exp_sum.shape = (exp_scores.shape[0],1)
y_hat = exp_scores / exp_sum
loss = np.sum(np.log(y_hat[np.arange(y.shape[0]),y]))
loss = -1/float(y_hat.shape[0])*loss + reg/2.0*np.sum(np.multiply(W1,W1))+ reg/2.0*np.sum(np.multiply(W2,W2))
I'm getting a result of 1.3819194609246772, which is the correct value for the loss function. However my Theano code yields a value of 1.3715655944645178.
t_z_1 = T.dmatrix('z_1')
t_W1 = theano.shared(value = W1, name = 'W1', borrow = True)
t_b1 = theano.shared(value = b1, name = 'b1',borrow = True)
t_W2 = theano.shared(value = W2, name = 'W2')
t_b2 = theano.shared(value = b2, name = 'b2')
t_y = T.lvector('y')
t_reg = T.dscalar('reg')
first_layer = T.dot(t_z_1,W1) + t_b1
t_hidden = T.switch(first_layer > 0, 0, first_layer)
t_out = T.nnet.softmax(T.dot(t_hidden, W2)+t_b2)
t_cost = -T.mean(T.log(t_out)[T.arange(t_y.shape[0]),t_y],dtype = theano.config.floatX, acc_dtype = theano.config.floatX)+t_reg/2.0*T.sum(T.sqr(t_W1))+t_reg/2.0*T.sum(T.sqr(t_W2))
cost_func = theano.function([t_z_1,t_y,t_reg],t_cost)
loss = cost_func(z_1,y,reg)
I'm already getting wrong results when calculating the values in the output layer. I'm not really sure what the problem could be. Does the shared function keep the type of numpy array that is used as the value argument or is that converted to float32? Can anybody tell me what I'm doing wrong in the theano code?
EDIT: The problem seems to occur in the hidden layer after applying the ReLU function: Here's the comparison in the results between theano and numpy in each layer:
theano results of first layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[-0.26107962 -0.14975259 -0.03842555 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[-0.19759784 -0.07417904 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[-0.13411606 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[-0.07063428 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
numpy results of first layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[-0.26107962 -0.14975259 -0.03842555 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[-0.19759784 -0.07417904 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[-0.13411606 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[-0.07063428 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
theano results of hidden layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0. 0. 0.
0. 0. 0. ]
[-0.26107962 -0.14975259 -0.03842555 0. 0. 0. 0.
0. 0. 0. ]
[-0.19759784 -0.07417904 0. 0. 0. 0. 0.
0. 0. 0. ]
[-0.13411606 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[-0.07063428 0. 0. 0. 0. 0. 0.
0. 0. 0. ]]
numpy results of hidden layer
[[ 0. 0. 0. 0. 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[ 0. 0. 0. 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[ 0. 0. 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[ 0. 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[ 0. 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
theano results of output
[[ 0.14393463 0.2863576 0.56970777]
[ 0.14303947 0.28582359 0.57113693]
[ 0.1424154 0.28544871 0.57213589]
[ 0.14193274 0.28515729 0.57290997]
[ 0.14171057 0.28502272 0.57326671]]
numpy results of output
[[-0.5328368 0.20031504 0.93346689]
[-0.59412164 0.15498488 0.9040914 ]
[-0.67658362 0.08978957 0.85616275]
[-0.77092643 0.01339997 0.79772637]
[-0.89110401 -0.08754544 0.71601312]]
I have the idea of using the switch() function for the ReLU layer from this post: Theano HiddenLayer Activation Function and I don't really see how that function is different from the equivalent numpy code: z_3 = np.maximum(0, z_2)?!
Solution to first problem: T.switch(first_layer > 0,0,first_layer) sets all the values greater than 0 to 0 => it should be T.switch(first_layer < 0,0,first_layer).
EDIT2: The gradients that theano calculates significantly differ from the numerical gradients I was given, this is my implementation:
g_w1, g_b1, g_w2, g_b2 = T.grad(t_cost,[t_W1,t_b1,t_W2,t_b2])
grads = {}
grads['W1'] = g_w1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b1'] = g_b1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['W2'] = g_w2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b2'] = g_b2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
This is an assignment for the Convolutional Neural Networks class that was offered by Stanford earlier this year and I think it's safe to say that their numerical gradients are probably correct. I could post the code to their numerical implementation though if required.
Using a relative error the following way:
def relative_error(num, ana):
numerator = np.sum(np.abs(num-ana))
denom = np.sum(np.abs(num))+np.sum(np.abs(ana))
return numerator/denom
Calculating the numerical gradients using the eval_numerical_gradient method that was provided by the course get the following relative errors for the gradients:
param_grad_num = {}
rel_error = {}
for param_name in grads:
param_grad_num[param_name] = eval_numerical_gradient(lambda W: two_layer_net(X, model, y, reg)[0], model[param_name], verbose=False)
rel_error[param_name] = relative_error(param_grad_num[param_name],grads[param_name])
{'W1': 0.010069468997284833,
'W2': 0.6628490408291472,
'b1': 1.9498867941113963e-09,
'b2': 1.7223972753120753e-11}
Which are too large for W1 and W2, the relative error should be less than 1e-8. Can anybody explain this or help in any way?

How to use Mann-Whitney U test in learning

I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:
X = 0 0 1 0 1 and Y = 1
0 1 0 0 0 1
1 1 1 0 1 0
I want to use Mann-Whitney U test to compute the feature importance(feature selection)
from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
u, prob = mannwhitneyu(X[:,i], Y)
results[i,:] = u, pro
I'm not sure if this is correct or no? I obtained large values for a large table, u = 990 for some columns.

I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:
>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
array([[ 1. , -0.07155116],
[-0.07155116, 1. ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated
Because a and b have the same distributions we fail to reject the null hypothesis that the distributions are identical.
And since in features selection you are trying find features that mostly explain Y, Mann-Whitney U does not help you with that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to address numerical underflow problem in Naive Bayes classification? - python

Related

Simple quadratic problem in Gurobi not producing optimal result?

Efficient calculation of cosine_distance of a csc_sparse_matrix using scipy.spatial.distance

How to tune / choose the preference parameter of AffinityPropagation?

Different results of cost function in theano and numpy

How to use Mann-Whitney U test in learning

Categories

Resources