Related
I implemented BPSO as a feature selection approach using the pyswarms library. I followed this tutorial.
Is there a way to limit the maximum number of features? If not, are there other particle swarm (or genetic/simulated annealing) python-implementations that have this functionality?
An easy way is to introduce a penalty for using any number of features. The in the following code a objective i defined
# Perform classification and store performance in P
classifier.fit(X_subset, y)
P = (classifier.predict(X_subset) == y).mean()
# Compute for the objective function
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
return j
What you could do, is add a penalty if the number of features is about max_num_features, e.g.
features_count = np.count_nonzero(m)
features_overflow = np.clip( max_num_features - features_count, 0, 10)
feature_overflow_penalty = (features_overflow / 10)
and define a new objective with:
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features))) - feature_overflow_penalty
This is not tested, and there is work to do to find the right penalty. A alternative is to never suggest/try features above certain threshold.
It doesnt appear to be a regular log norm pdf as seen in https://en.wikipedia.org/wiki/Log-normal_distribution
https://www.tensorflow.org/tutorials/generative/cvae
def log_normal_pdf(sample, mean, logvar, raxis=1):
log2pi = tf.math.log(2. * np.pi)
return tf.reduce_sum(
-.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
axis=raxis)
This is the logarithm of the probability according to a normal distribution. I.e. log(p(x)) where p is a normal/Gaussian distribution. The naming is a little confusing though.
In case anyone else wanders down this rabbit hole, the previous answer checks out: "the logarithm of the pdf according to a normal distribution". Here's a simple check that you can run based on the gaussian function definition from wiki:
import numpy as np
def log_normal_pdf(sample, mean, std):
"""Function from tensorflow VAE example"""
logvar = np.log(std**2)
log2pi = np.log(2*np.pi)
return -.5 * ((sample - mean) ** 2. * np.exp(-logvar) + logvar + log2pi)
def test(sample, mean, std):
"""Alternate calc taking the log of the wiki gaussian function"""
out = (1 / (std * np.sqrt(2*np.pi))) * np.exp(-0.5 * (sample-mean)**2 / std**2)
return np.log(out)
print(log_normal_pdf(9, 10, 1))
print(test(9, 10, 1))
I asked a similar question in January that #Miłosz Wieczór was kind enough to answer. Now, I am faced with a similar but different challenge since I need to fit two parameters (fc and alpha) simultaneously on two datasets (e_exp and iq_exp). I basically need to find the values of fc and alpha that are the best fits to both data e_exp and iq_exp.
import numpy as np
import math
from scipy.optimize import curve_fit, least_squares, minimize
f_exp = np.array([1, 1.6, 2.7, 4.4, 7.3, 12, 20, 32, 56, 88, 144, 250000])
e_exp = np.array([7.15, 7.30, 7.20, 7.25, 7.26, 7.28, 7.32, 7.25, 7.35, 7.34, 7.37, 11.55])
iq_exp = np.array([0.010, 0.009, 0.011, 0.011, 0.010, 0.012, 0.019, 0.027, 0.038, 0.044, 0.052, 0.005])
ezero = np.min(e_exp)
einf = np.max(e_exp)
ig_fc = 500
ig_alpha = 0.35
def CCRI(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
RI = np.sqrt(R ** 2 + I ** 2)
return RI
def CCiQ(f_exp, fc, alpha):
x = np.log(f_exp/fc)
R = ezero + 1/2 * (einf - ezero) * (1 + np.sinh((1 - alpha) * x) / (np.cosh((1 - alpha) * x) + np.sin(1/2 * alpha * math.pi)))
I = 1/2 * (einf - ezero) * np.cos(alpha * math.pi / 2) / (np.cosh((1 - alpha) * x) + np.sin(alpha * math.pi / 2))
iQ = I / R
return iQ
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
einf, ezero, and f_exp are all constant plus the variables I need to optimize are ig_fc and ig_alpha, in which ig stands for initial guess. In the code above, I get two different fc and alpha values because I solve them independently. I need however to solve them simultaneously so that fc and alpha are universal.
Is there a way to solve two different functions to provide universal solutions for fc and alpha?
The docs state on the second returned value from curve_fit:
pcov
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation
errors on the parameters use perr = np.sqrt(np.diag(pcov)).
So if you want to minimize the overall error, you need to combine the errors of both your fits.
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
# not sure if this the correct equation, but you can start with it
err_total = np.sum(np.sqrt(np.diag(pcovRI))) + np.sum(np.sqrt(np.diag(pcoviQ)))
return err_total
On total errors of 2d Gaussian functions:
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
Update:
Since you want poptRI and poptiQ to be the same, you need to minimize their distance.
This can be done like
from numpy import linalg
def objective(what, ever):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(ig_fc, ig_alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(ig_fc, ig_alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
Minimizing this function will (should) result in similar values for poptRI and poptiQ. You take the parameters as vectors, and try to minimize the length of their delta vector.
However, this approach assumes that poptRI and poptiQ (and their coefficients) are about in the same range since you are using some metric on them. If say one if them is in the range 2000 and the other in the range 2. Then the optimizer will favour tuning the first one. But maybe this is fine.
If you somehow want to treat them the same you need to normalize them.
One approach (assuming all coefficients are similar) could be
linalg.norm((poptiQ / linalg.norm(poptiQ)) - (poptRI / linalg.norm(poptRI))))
You normalize the results to unit vectors, then subtract them, then create the norm.
The same is true for the inputs to the function, but it might not be that important there. See the links below.
But this strongly depends on the problem you are trying to solve. There is no general solution.
Some links related to this:
Is normalization useful/necessary in optimization?
Why do we have to normalize the input for an artificial neural network?
Another objective function:
It this what you are trying to do?
You want to find the best fc and alpha so the fit results of both functions are as close as possible?
def objective(fc, alpha):
poptRI, pcovRI = curve_fit(CCRI, f_exp, e_exp, p0=(fc, alpha))
poptiQ, pcoviQ = curve_fit(CCiQ, f_exp, iq_exp, p0=(fc, alpha))
delta = linalg.norm(poptiQ - poptRI)
return delta
I'm trying to implement a multiclass logistic regression classifier that distinguishes between k different classes.
This is my code.
import numpy as np
from scipy.special import expit
def cost(X,y,theta,regTerm):
(m,n) = X.shape
J = (np.dot(-(y.T),np.log(expit(np.dot(X,theta))))-np.dot((np.ones((m,1))-y).T,np.log(np.ones((m,1)) - (expit(np.dot(X,theta))).reshape((m,1))))) / m + (regTerm / (2 * m)) * np.linalg.norm(theta[1:])
return J
def gradient(X,y,theta,regTerm):
(m,n) = X.shape
grad = np.dot(((expit(np.dot(X,theta))).reshape(m,1) - y).T,X)/m + (np.concatenate(([0],theta[1:].T),axis=0)).reshape(1,n)
return np.asarray(grad)
def train(X,y,regTerm,learnRate,epsilon,k):
(m,n) = X.shape
theta = np.zeros((k,n))
for i in range(0,k):
previousCost = 0;
currentCost = cost(X,y,theta[i,:],regTerm)
while(np.abs(currentCost-previousCost) > epsilon):
print(theta[i,:])
theta[i,:] = theta[i,:] - learnRate*gradient(X,y,theta[i,:],regTerm)
print(theta[i,:])
previousCost = currentCost
currentCost = cost(X,y,theta[i,:],regTerm)
return theta
trX = np.load('trX.npy')
trY = np.load('trY.npy')
theta = train(trX,trY,2,0.1,0.1,4)
I can verify that cost and gradient are returning values that are in the right dimension (cost returns a scalar, and gradient returns a 1 by n row vector), but i get the error
RuntimeWarning: divide by zero encountered in log
J = (np.dot(-(y.T),np.log(expit(np.dot(X,theta))))-np.dot((np.ones((m,1))-y).T,np.log(np.ones((m,1)) - (expit(np.dot(X,theta))).reshape((m,1))))) / m + (regTerm / (2 * m)) * np.linalg.norm(theta[1:])
why is this happening and how can i avoid this?
The proper solution here is to add some small epsilon to the argument of log function. What worked for me was
epsilon = 1e-5
def cost(X, y, theta):
m = X.shape[0]
yp = expit(X # theta)
cost = - np.average(y * np.log(yp + epsilon) + (1 - y) * np.log(1 - yp + epsilon))
return cost
You can clean up the formula by appropriately using broadcasting, the operator * for dot products of vectors, and the operator # for matrix multiplication — and breaking it up as suggested in the comments.
Here is your cost function:
def cost(X, y, theta, regTerm):
m = X.shape[0] # or y.shape, or even p.shape after the next line, number of training set
p = expit(X # theta)
log_loss = -np.average(y*np.log(p) + (1-y)*np.log(1-p))
J = log_loss + regTerm * np.linalg.norm(theta[1:]) / (2*m)
return J
You can clean up your gradient function along the same lines.
By the way, are you sure you want np.linalg.norm(theta[1:]). If you're trying to do L2-regularization, the term should be np.linalg.norm(theta[1:]) ** 2.
Cause:
This is happening because in some cases, whenever y[i] is equal to 1, the value of the Sigmoid function (theta) also becomes equal to 1.
Cost function:
J = (np.dot(-(y.T),np.log(expit(np.dot(X,theta))))-np.dot((np.ones((m,1))-y).T,np.log(np.ones((m,1)) - (expit(np.dot(X,theta))).reshape((m,1))))) / m + (regTerm / (2 * m)) * np.linalg.norm(theta[1:])
Now, consider the following part in the above code snippet:
np.log(np.ones((m,1)) - (expit(np.dot(X,theta))).reshape((m,1)))
Here, you are performing (1 - theta) when the value of theta is 1. So, that will effectively become log (1 - 1) = log (0) which is undefined.
I'm guessing your data has negative values in it. You can't log a negative.
import numpy as np
np.log(2)
> 0.69314718055994529
np.log(-2)
> nan
There are a lot of different ways to transform your data that should help, if this is the case.
def cost(X, y, theta):
yp = expit(X # theta)
cost = - np.average(y * np.log(yp) + (1 - y) * np.log(1 - yp))
return cost
The warning originates from np.log(yp) when yp==0 and in np.log(1 - yp) when yp==1. One option is to filter out these values, and not to pass them into np.log. The other option is to add a small constant to prevent the value from being exactly 0 (as suggested in one of the comments above)
Add epsilon value[which is a miniature value] to the log value so that it won't be a problem at all.
But i am not sure if it will give accurate results or not .
I am using PyMC to fit some data to a straight line. The data have outliers, so I adapted some code (third example at the link) written by Jake Vanderplas for his textbook. The method uses a vector variable qi to encode whether each individual data point belongs to the foreground model (which we are fitting to the line) or the background model, which we don't care about.
class lin_fit_ol(object):
'''
fit a straight line to one independent variable
(`xi`, with zero errors) and one dependent variable
(`yi`, with possibly heteroscedastic errors `dyi`)
Outliers in `yi` are permitted
Intended to be a complement to a straight-line fit, for model
testing purposes
Modified from Vanderplas's code
(found at http://www.astroml.\
org/book_figures/chapter8/fig_outlier_rejection.html)
'''
def __init__(self, xi, yi, dyi, value):
self.xi, self.yi, self.dyi, self.value = xi, yi, dyi, value
#pymc.stochastic
def beta(value=np.array([0.5, 1.0])):
"""Slope and intercept parameters for a straight line.
The likelihood corresponds to the prior probability of the parameters."""
slope, intercept = value
prob_intercept = 1 + 0 * intercept
# uniform prior on theta = arctan(slope)
# d[arctan(x)]/dx = 1 / (1 + x^2)
prob_slope = np.log(1. / (1. + slope ** 2))
return prob_intercept + prob_slope
#pymc.deterministic
def model(xi=xi, beta=beta):
slope, intercept = beta
return slope * xi + intercept
# uniform prior on Pb, the fraction of bad points
Pb = pymc.Uniform('Pb', 0, 1.0, value=0.1)
# uniform prior on Yb, the centroid of the outlier distribution
Yb = pymc.Uniform('Yb', -10000, 10000, value=0)
# uniform prior on log(sigmab), the spread of the outlier distribution
log_sigmab = pymc.Uniform('log_sigmab', -10, 10, value=5)
# qi is bernoulli distributed
# Note: this syntax requires pymc version 2.2
qi = pymc.Bernoulli('qi', p=1 - Pb, value=np.ones(len(xi)))
#pymc.deterministic
def sigmab(log_sigmab=log_sigmab):
return np.exp(log_sigmab)
def outlier_likelihood(yi, mu, dyi, qi, Yb, sigmab):
"""likelihood for full outlier posterior"""
Vi = dyi ** 2
Vb = sigmab ** 2
root2pi = np.sqrt(2 * np.pi)
logL_in = -0.5 * np.sum(
qi * (np.log(2 * np.pi * Vi) + (yi - mu) ** 2 / Vi))
logL_out = -0.5 * np.sum(
(1 - qi) * (np.log(2 * np.pi * (Vi + Vb)) +
(yi - Yb) ** 2 / (Vi + Vb)))
return logL_out + logL_in
OutlierNormal = pymc.stochastic_from_dist(
'outliernormal', logp=outlier_likelihood, dtype=np.float,
mv=True)
y_outlier = OutlierNormal(
'y_outlier', mu=model, dyi=dyi, Yb=Yb, sigmab=sigmab, qi=qi,
observed=True, value=yi)
self.M = dict(y_outlier=y_outlier, beta=beta, model=model,
qi=qi, Pb=Pb, Yb=Yb, log_sigmab=log_sigmab,
sigmab=sigmab)
self.sample_invoked = False
def sample(self, iter, burn, calc_deviance=True):
self.S0 = pymc.MCMC(self.M)
self.S0.sample(iter=iter, burn=burn)
self.trace = self.S0.trace('beta')
self.btrace = self.trace[:, 0]
self.mtrace = self.trace[:, 1]
self.sample_invoked = True
def triangle(self):
assert self.sample_invoked == True, \
'Must sample first! Use sample(iter, burn)'
corner(self.trace[:], labels=['$m$', '$b$'])
def plot(self, xlab='$x$', ylab='$y$'):
# plot the data points
plt.errorbar(self.xi, self.yi, yerr=self.dyi, fmt='.k')
# do some shimmying to get quantile bounds
xa = np.linspace(self.xi.min(), self.xi.max(), 100)
A = np.vander(xa, 2)
# generate all possible lines
lines = np.dot(self.trace[:], A.T)
quantiles = np.percentile(lines, [16, 84], axis=0)
plt.fill_between(xa, quantiles[0], quantiles[1],
color="#8d44ad", alpha=0.5)
# plot circles around points identified as outliers
qi = self.S0.trace('qi')[:]
Pi = qi.astype(float).mean(0)
outlier_x = self.xi[Pi < 0.32]
outlier_y = self.yi[Pi < 0.32]
plt.scatter(outlier_x, outlier_y, lw=1, s=400, alpha=0.5,
facecolors='none', edgecolors='red')
plt.xlabel(xlab)
plt.ylabel(ylab)
def ICs(self):
self.MAP = pymc.MAP(self.M)
self.MAP.fit()
self.BIC = self.MAP.BIC
self.AIC = self.MAP.AIC
self.logp = self.MAP.logp
self.logp_at_max = self.MAP.logp_at_max
return self.AIC, self.BIC
So, when we calculate the BIC and AIC using this model, we get very large values (since there are lots of points). This makes total sense. However, this disfavors having many data points, which irks me. Plus, the large AIC and BIC would make a casual observer believe that the other model (which fits poorly as a result of the outliers) is actually the better model.
Am I missing a subtlety of the BIC and AIC here, or is a harsh reality of using mixture models that you always have to use a bunch of extra binary parameters to denote the membership of your datapoints?
I recommend the book "Introduction to statistical learning"
On page 212 you find the formulas for AIC and BIC. In each of these formulas the sample number is in the denominator. Hence, the result should not be influenced by the number of samples. At least not in that obvious way.