Python won't show me np.exp marketshare values - python

I'm trying to estimate marketshares with the following formula:
c = np.exp(-Mu*a)/(np.exp(-Mu*a)+np.exp(-Mu*b))
in which a and b are 9x9 matrices with cell values that can be larger than 1000. Because the numbers are so small, Python returns NaN values. In order to enhance precision of the estimation i have already tried np.float128 but all this does is raise the error that numpy doesn't have an attribute called float128. I have also tried longdouble, again without success. Are there other ways to make Python show the actual values of the cells instead of NaN?

You have:
c = np.exp(-Mu*a)/(np.exp(-Mu*a)+np.exp(-Mu*b))
Multipying the numerator and denominator by e^(Mu*a), you get:
c = 1/(1+np.exp(Mu*(a-b)))
This is just a reformulation of the same formula.
Now, if the exp term is still too small, and you do not need a more precise result, then your c is approximately very close to 1. And if you still need to control precision, you can take log on both sides and use the Taylor expansion of log(1+x).

Related

Numpy float mean calculation precision

I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.

np.divide Making Entire Vector Nan

I am using np.divide to divide two vectors. The numerator has all floats and the denominator is a mix of nice sized floats, extremely small floats, and np.inf. The resulting vector has a np.nan in every place, even though only a handful of entries should have that. How can I fix this to have np.nan where appropriate and floats everywhere else?
try this code
np.true_divide(A, B, where=(A!=0) | (B!=0))
or
C = A / B # may print warnings, suppress them with np.seterrstate if you want
C[np.isnan(C)] = 0

How to prevent division by zero or replace infinite values in Theano?

I'm using a cost function in Theano which involves a regularizer term that requires me to compute this term:
T.sum(c / self.squared_euclidean_distances)
as some values of self.squared_euclidean_distances might be zero this produces Nan values. How can i work around this problem? I tried to use T.isinf but were not successful. One solution would be to remove zeros in self.squared_euclidean_distances into a small number or replace infinite numbers in T.sum(c / self.squared_euclidean_distances) to zero. I just don't know how to replace those values in Theano.
Take a look at T.switch. You could do for example
T.switch(T.eq(self.squared_euclidean_distances, 0), 0, c / self.squared_euclidean_distances)
(Or, upstream, you make sure that you never compare a vector with itself using squared euclidean distance.)

normalization of the same vector gives different values at two cases?

r_capr
Out[148]: array([[-0.42300825, 0.90516059, 0.04181294]])
r_capr
np.linalg.norm(r_capr.T)
Out[149]: 0.99999999760432712
a.T
Out[150]: array([[-0.42300825, 0.90516059, 0.04181294]])
a.T
np.linalg.norm(a.T)
Out[151]: 1.0
In the above we can see for the same vector we have different norm? Why is it happening?
Machines are not 100% precise with numbers seeing as they are stored with finite precision (depending on architecture it could be 16 to 128 bits float point) so numbers that are very precise such as getting close to the limit of a float point mantissa are more prone to errors. Given the machine precision error, you can safely assume those numbers are actually the same. When computing norms it may make more sense to scale or otherwise modify your numbers to get less error prone results.
Also using dot(x,x) instead of an l2 norm can be much more accurate since it avoids the square root.
See http://en.wikipedia.org/wiki/Machine_epsilon for a better discussion since this is actually a fairly complex topic.
Your exact error is caused by machine errors but since your vectors are not actually equal (you are showing two logically equivalent vectors but their internal representation will be different) the calculation of the norm is probably being processed with different precision numbers.
See this:
a = mat('-0.42300825 ; 0.90516059 ; 0.04181294', np.float32)
r = mat('-0.42300825 ; 0.90516059 ; 0.04181294', np.float64)
print linalg.norm(a)
print linalg.norm(r)
and compare the results. It will get the exact results you are seeing. You can also verify this by checking the dtype property of your matrix.

harmonic mean in python

The Harmonic Mean function in Python (scipy.stats.hmean) requires that the input be positive numbers.
For example:
from scipy import stats
print stats.hmean([ -50.2 , 100.5 ])
results in:
ValueError: Harmonic mean only defined if all elements greater than zero
I don't mathematically see why this should be the case, except for the rare instance where you would end up dividing by zero. Instead of checking for a divide by zero, hmean() then throws an error upon inputing any positive number, whether a harmonic mean can be found or not.
Am I missing something here in the maths? Or is this really a limitation in SciPy?
How would you go about finding the harmonic mean of a set of numbers which might be positive or negative in python?
The harmonic mean is only defined for sets of positive real numbers. If you try and compute it for sets with negatives you get all kinds of strange and useless results even if you don't hit div by 0. For example, applying the formula to the set (3, -3, 4) gives a mean of 12!
You can just use the Harmonic Mean define equation:
len(a) / np.sum(1.0/a)
But, wikipedia says that harmonic mean is defined for positive real numbers:
http://en.wikipedia.org/wiki/Harmonic_mean
There is a statistics library if you are using Python >= 3.6:
https://docs.python.org/3/library/statistics.html
You may use its mean method like this. Let's say you have a list of numbers of which you want to find mean:
list = [11, 13, 12, 15, 17]
import statistics as s
s.harmonic_mean(list)
It has other methods too like stdev, variance, mode, mean, median etc which too are useful.
the mathematical definition of harmonic mean itself does not forbid applications to negative numbers (although you may not want to calculate the harmonic mean of +1 and -1), however, it is designed to calculate the mean for quantities like ratios so that it would give equal weight to each data point, while in arithmetic means or such the ratio of extreme data points would acquire much high weight and is thus undesired.
So you either could try to hardcode the definition by yourself like #HYRY suggested, or may have applied the harmonic mean in the wrong context.

Categories

Resources