I'm trying to implement a fast entropy calculation for a float list of probabilities.
Instead of looping through a list, checking if not zero each time, I'm attempting to mask zeros using numpy's built in masking functionality. It works absolutely fine, unless I try to put it into a function, at which point it breaks. Any suggestions?
# Works fine!!
distribution = np.array([0.20, 0.3, 0.25, 0.25, 0])
log_dist = np.log2(distribution, out=np.zeros_like(distribution), where=(distribution!=0))
entropy = -np.sum(distribution * log_dist)
print(entropy)
# Breaks!
def calculate_entropy(distribution):
log_dist = np.log2(distribution, out=np.zeros_like(distribution), where=(distribution!=0))
entropy = -np.sum(distribution * log_dist)
return entropy
calculate_entropy([0.20, 0.3, 0.25, 0.25, 0])
output:
nan
Error message:
/var/folders/bt/vk3t9rnn2jz5d1wgj2rc3v200000gn/T/ipykernel_61321/2272953976.py:3: RuntimeWarning: divide by zero encountered in log2
log_dist = np.log2(distribution, out=np.zeros_like(distribution), where=(distribution!=0))
/var/folders/bt/vk3t9rnn2jz5d1wgj2rc3v200000gn/T/ipykernel_61321/2272953976.py:4: RuntimeWarning: invalid value encountered in multiply
entropy = -np.sum(distribution * log_dist)
I was expecting the function to work exactly the same, what am I missing?
Ugh, I'm an idiot. I forgot to convert the list into a numpy array. fix:
def calculate_entropy(distribution):
distribution = np.array(distribution)
log_dist = np.log2(distribution, out=np.zeros_like(distribution), where=(distribution!=0))
entropy = -np.sum(distribution * log_dist)
return entropy
calculate_entropy([0.20, 0.3, 0.25, 0.25, 0])
Related
I have following function which I need to minimize utilizing least square method (I am using lmfit).
y = a * exp(-x/b) + c
I have for example following data:
profitlist = [-10000, 100.00, 1000.00, 100000.00, 1000000.00]
utilitylist = [0, 0.2, 0.4, 0.6, 1]
App returns the following error:
ValueError: NaN values detected in your input data or the output of your objective/model function - fitting algorithms cannot handle this! Please read https://lmfit.github.io/lmfit-py/faq.html#i-get-errors-from-nan-in-my-fit-what-can-i-do for more information.
Problem seems to be that: exp(-x/b) returns inf or -inf if profitList contains any bigger negative number (-1000 worked, -100000 not). So it overflows probably.
The values in the profitList can be very large float numbers and they are not always the same. So how can I optimize it with these huge numbers? It seems that lmfit does not support decimal numbers which would fix the issue... What can I do to make it work?
class LeastSquares:
def __init__(self, profitList, utilityList):
self.profitList = np.asarray(profitList)
self.utilityList = np.asanyarray(utilityList)
def function(self, params, x):
a = params["a"]
b = params["b"]
c = params["c"]
return a * np.exp(-x/b) + c
def residual(self, params, x, y):
return (y - self.function(params, x))**2
def setParameters(self, a_start, b_start, c_start):
parameters = Parameters()
parameters.add(name="a", value=a_start, min=None, max=0, vary=True)
parameters.add(name="b", value=b_start, vary=True, min=0.1, max=None)
parameters.add(name="c", value=c_start, vary=True)
return parameters
def startOptimalization(self):
parameters = self.setParameters(-1, 1, 1)
result = minimize(self.residual, parameters, args=(self.profitList, self.utilityList), method="leastsq")
result.params.pretty_print()
print(fit_report(result))
print("SSE")
print(np.sum(result.residual))
As you see, numpy.exp(arg) gives Infinity for any argument greater than ~709, and you will need to avoid such extreme values. The underlying solvers simply cannot solve them. Since your argument for arg is -x/b, you need to make sure that b is not so small as to blow up the argument to numpy.exp().
In fact, your code shows that you do set a lower bound on b of 0.1.
But with values of profitlist extending to 1e7, that lower bound is too small to prevent Infinity - your lower limit on b would have to be around 14,000.
If your values for profitlist are changing for each optimization run, you may need to do something like this (in your startOptimization):
parameters = self.setParameters(-1, 1, 1)
parameters['b'].min = max(abs(self.profitList))/700.0
result = minimize(self.residual, parameters, args=(self.profitList, self.utilityList), method="leastsq")
result.params.pretty_print()
Also, when fitting exponential changes, it is often helpful to compute your exponential model function, and then take the residual as the logarithm of your data and the logarithm of your model, effectively doing the fit in log-space, as you would likely plot the data.
And, finally, don't take the square or the sum of squares of the difference yourself, just return the residual array with sign in tact. That is, you will probably be better off using something like:
def residual(self, params, x, y):
return np.log(y) - np.log(self.function(params, x))
I'm trying to calculate the DOP values for a set of GPS satellites in Python 2.7.2 using numpy 1.9.3.
I found a guide on how to do this but I'm having trouble translating it to python.
Here's what I tried so far:
import numpy as np
# First I defined 3 variables for each satellite as described in the guide.
sat_1_1 = np.sin(np.deg2rad(136)) * np.cos(np.deg2rad(14))
sat_1_2 = np.cos(np.deg2rad(136)) * np.cos(np.deg2rad(14))
sat_1_3 = np.sin(np.deg2rad(14))
sat_2_1 = np.sin(np.deg2rad(329)) * np.cos(np.deg2rad(48))
sat_2_2 = np.cos(np.deg2rad(329)) * np.cos(np.deg2rad(48))
sat_2_3 = np.sin(np.deg2rad(48))
sat_3_1 = np.sin(np.deg2rad(253)) * np.cos(np.deg2rad(36))
sat_3_2 = np.cos(np.deg2rad(253)) * np.cos(np.deg2rad(36))
sat_3_3 = np.sin(np.deg2rad(36))
sat_4_1 = np.sin(np.deg2rad(188)) * np.cos(np.deg2rad(9))
sat_4_2 = np.cos(np.deg2rad(188)) * np.cos(np.deg2rad(9))
sat_4_3 = np.sin(np.deg2rad(9))
# Next I created the line-of-sight matrix:
LOS_Matrix = np.array([[sat_1_1, sat_1_2, sat_1_3, 1.0], [sat_2_1, sat_2_2, sat_2_3, 1.0], [sat_3_1, sat_3_2, sat_3_3, 1.0], [sat_4_1, sat_4_2, sat_4_3, 1.0]])
# Then its transpose:
LOS_Matrix_t = LOS_Matrix.transpose()
# Next the guide says to compute the covariance matrix which is said to be equal to the inverse of LOS_Matrix * LOS_Matrix_t, so:
cov_matrix = np.linalg.inv(LOS_Matrix * LOS_Matrix_t)
# This should now lets me calculate the DOP values such as GDOP, PDOP, etc
PDOP = np.sqrt(cov_matrix[0, 0] + cov_matrix[1, 1] + cov_matrix[2, 2])
# This comes out as 2.25575033021 which is possbile though it seems suspiciously low
# Also TDOP can't be computed since cov_matrix[3, 3] is a negative number so something must be wrong I guess?
I'm a python noob and math isn't my strong suit either, I only got this far by googling error message after error message.
I'm now at a point it runs without any error message but it doesn't seem correct either, otherwise the TDOP value should be computable for example .
Does anyone have an idea where the issue lies?
Cheers
cov_matrix = np.linalg.inv(LOS_Matrix * LOS_Matrix_t)
Should probably be
cov_matrix = np.linalg.inv(LOS_Matrix.dot(LOS_Matrix_t))
I know I know, it's confusing. But in numpy you have two different types, one is the ndarray which you should use and another is matrix which your should not use. For ndarray multiplication defaults to element-wise multiplication.
I am moving some code from R to Anaconda Python. The R code uses qnorm, documented as "quantile function for the normal distribution with mean equal to mean and standard deviation equal to sd."
The call and parameters are:
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
p vector of probabilities.
mean vector of means.
sd vector of standard deviations.
log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are
P[X≤x] otherwise, P[X].
I don't see any equivalent in pandas.Series. Have I missed it, is it on another object, or is there some equivalent in another library?
A lot of this equivalent functionality is found in scipy.stats. In this case, you're looking for scipy.stats.norm.ppf.
qnorm(p, mean = 0, sd = 1) is equivalent to scipy.stats.norm.ppf(q, loc=0, scale=1).
import scipy.stats as st
>>> st.norm.ppf([0.01, 0.99])
array([-2.32634787, 2.32634787])
>>> st.norm.ppf([0.01, 0.99], loc=10, scale=0.1)
array([ 9.76736521, 10.23263479])
Just to expand #miradulo answer. If you want to get also qnorm(lower.tail=FALSE) you can just multiply by -1:
In R:
qnorm(0.8, lower.tail = F)
-0.8416212
In python
from scipy.stats import norm
norm.ppf(0.8) * -1
-0.8416212
I have the following code which attempts to minimize a log likelihood function.
#!/usr/bin/python
import math
import random
import numpy as np
from scipy.optimize import minimize
def loglikelihood(params, data):
(mu, alpha, beta) = params
tlist = np.array(data)
r = np.zeros(len(tlist))
for i in xrange(1,len(tlist)):
r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
loglik = -tlist[-1]*mu
loglik = loglik+alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
loglik = loglik+np.sum(np.log(mu+alpha*r))
return -loglik
atimes = [ 148.98894201, 149.70253172, 151.13717804, 160.35968355,
160.98322609, 161.21331798, 163.60755544, 163.68994973,
164.26131871, 228.79436067]
a= 0.01
alpha = 0.5
beta = 0.6
print loglikelihood((a, alpha, beta), atimes)
res = minimize(loglikelihood, (0.01, 0.1,0.1), method = 'BFGS',args = (atimes,))
print res
It gives me
28.3136498357
./test.py:17: RuntimeWarning: invalid value encountered in log
loglik = loglik+np.sum(np.log(mu+alpha*r))
status: 2
success: False
njev: 14
nfev: 72
hess_inv: array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
fun: 32.131359359964378
x: array([ 0.01, 0.1 , 0.1 ])
message: 'Desired error not necessarily achieved due to precision loss.'
jac: array([ -2.8051672 , 13.06962156, -48.97879982])
Notice that it hasn't managed to optimize the parameters at all and the minimized value 32 is bigger than 28 which is what you get with a= 0.01, alpha = 0.5, beta = 0.6 . It's possible this problem could be avoided by choosing better initial guesses but if so, how can I do this automatically?
Nelder-Mead, TNC and SLSQP work as drop-in replacements. None of the other methods do.
I copied your example and tried a little bit. Looks like if you stick with BFGS solver, after a few iteration the mu+ alpha * r will have some negative numbers, and that's how you get the RuntimeWarning.
The easiest fix I can think of is to switch to Nelder Mead solver.
res = minimize(loglikelihood, (0.01, 0.1,0.1), method = 'Nelder-Mead',args = (atimes,))
And it will give you this result:
28.3136498357
status: 0
nfev: 159
success: True
fun: 27.982451280648817
x: array([ 0.01410906, 0.68346023, 0.90837568])
message: 'Optimization terminated successfully.'
nit: 92
Another solution (that worked for me) is to scale your function (and gradients) to values closer to 0. For example, my problem came up when I had to evaluate a log-likelihood of 60k points. This meant that my log-likelihood was a very large number. Conceptually, the log-likelihood was a very very spikey function.
The gradients started off large (to climb this spikey mountain), and then became moderately small, but never less than the default gtol parameter in the BGFS routine (which is the threshold that all gradients must be below for termination). Also, at this time I had essentially arrived at the correct values (I was using generated data so I knew the true values).
What was happening was that my gradients were approx. 60k * average individual gradient value, and even if the average individual gradient value was small, say less than 1e-8, 60k * 1e-8 > gtol. So I was never satisfying the threshold even though I had arrived at the solution.
Conceptually, because of this very spikey mountain, the algorithm was making small steps, but stepping over the true minimum and never achieved average individual gradient << 1e-8 which implies my gradients never went under gtol.
Two solutions:
1) Scale your log-likelihood and gradients by a factor, like 1/n where n is the number of samples.
2) Scale your gtol: for example "gtol": 1e-7 * n
Facing the same warning, I solved it by rewriting the log-likelihood function to get log(params) and log(data) as arguments, instead of params and data.
Thus, I avoid using np.log() in the likelihood function or Jacobian, if possible.
Watch out for negative values of the log() function, resolve them and tell the optimizer that they are bad, by adding a penalty:
#!/usr/bin/python
import math
import random
import numpy as np
from scipy.optimize import minimize
def loglikelihood(params, data):
(mu, alpha, beta) = params
tlist = np.array(data)
r = np.zeros(len(tlist))
for i in xrange(1,len(tlist)):
r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
loglik = -tlist[-1]*mu
loglik += alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
argument = mu + alpha * r
limit = 1e-6
if np.min(argument) < limit:
# add a penalty for too small argument of log
loglik += np.sum(np.minimum(0.0, argument - limit)) / limit
# keep argument of log above the limit
argument = np.maximum(argument, limit)
loglik += np.sum(np.log(argument))
return -loglik
atimes = [ 148.98894201, 149.70253172, 151.13717804, 160.35968355,
160.98322609, 161.21331798, 163.60755544, 163.68994973,
164.26131871, 228.79436067]
a= 0.01
alpha = 0.5
beta = 0.6
print loglikelihood((a, alpha, beta), atimes)
res = minimize(loglikelihood, (0.01, 0.1,0.1), method = 'BFGS',args = (atimes,))
print res
I know I am late but I do 3 optimizations in series. First I use a Nelder-Mead to get close. Without first getting close, I get way too many overflow errors. I then copy res.x to the starting parameters for the next optimizing routine. I have found that Powell is most reliable and it usually does a pretty good job. BUT, I then do another minimization using Nelder-Mead again to avoid falling in to local minimums.
Usually, there isn't much improvement after using the Powell minimization.
I'm using statsmodels' weighted least squares regression, but getting some really huge values.
Here's my code:
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
w = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
temp_g = sm.WLS(y, X, w).fit()
Now, what I understand is that in WLS regression, just like in any linear regression problem, we provide the endog vector and the exog vector and the function can find the line of the best fit and tell us what the coefficients/regression parameters for each observation ought to be. For example, in my data, where each observation consists of 3 features, I'm expecting there to be 3 parameters.
So I fetch them like this:
parameters = temp_g.params # I'm hoping I've got this right! Or do I need to use "fittedvalues" instead?
The issue is that I'm getting really huge values like this:
temp g params :
[ -7.66645036e+198 -9.01935337e+197 5.86257969e+198]
or this:
temp g params :
[-2.77777778 -0.44444444 1.88888889]
Which is creating problems in further usage of these parameters, especially since I have some exponents to work with as well, and I need to raise e to the power of some of the regression parameters, which is proving impossible, given such big numbers. Because I keep getting overflow errors when using exp().
Is this normal? Am I doing something wrong? Or is there a specific way to make them useful?