Use of Inf on Matlab - python

I am currently translating a MATLAB program into Python. I successfully ported all the previous vector operations using numpy. However I am stuck in the following bit of code which is a cosine similarity measure.
% W and ind are different sized matrices
dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' + W(ind3(range),:)'));
for i=1:length(range)
dist(ind1(range(i)),i) = -Inf;
dist(ind2(range(i)),i) = -Inf;
dist(ind3(range(i)),i) = -Inf;
end
disp(dist)
[~, mx(range)] = max(dist);
I did not understand the following part.
dist(indx(range(i)),i) = -Inf;
What actuality is happening when you use
= -Inf;
on the right side?

In Matlab (see: Inf):
Inf returns the IEEE® arithmetic representation for positive infinity.
So Inf produces a value that is greater than all other numeric values. -Inf produces a value that is guaranteed to be less than any other numeric value. It's generally used when you want to iteratively find a maximum and need a first value to compare to that's always going to be less than your first comparison.
According to Wikipedia (see: IEEE 754 Inf):
Positive and negative infinity are represented thus:
sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.
Python has the same concept using '-inf' (see Note 6 here):
float also accepts the strings “nan” and “inf” with an optional prefix “+” or “-” for Not a Number (NaN) and positive or negative infinity.
>>> a=float('-inf')
>>> a
-inf
>>> b=-27983.444
>>> min(a,b)
-inf

It just assigns a minus infinity value to the left-hand side.
It may appear weird to assign that value, particularly because a distance cannot be negative. But it looks like it's used for effectively removing those entries from the max computation in the last line.
If Python doesn't have "infinity" (I don't know Python) and if dist is really a distance (hence nonnegative) , you could use any negative value instead of -inf to achieve the same effect, namely remove those entries from the max computation.

The -Inf is typically used to initialize a variable such that you later can use it to in a comparison in a loop.
For instance if I want to find the the maximum value in a function (and have forgotten the command max). Then I would have made something like:
function maxF = findMax(f,a,b)
maxF = -Inf;
x = a:0.001:b;
for i = 1:length(x)
if f(x) > maxF
maxF = f(x);
end
end
It is a method in matlab to make sure that any other value is larger than the current. So the comparison in Python would be -sys.maxint +1.
See for instance:
Maximum and Minimum values for ints

Related

Numpy float mean calculation precision

I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.

How to correct numerical error in numpy sum

I'm trying to return a vector (1-d numpy array) that has a sum of 1.
The key is that it has to equal 1.0 as it represents a percentage.
However, there seems to be a lot of cases where the sum does not equal to 1 even when I divided each element by the total.
In other words, the sum of 'x' does not equal to 1.0 even when x = x'/sum(x')
One of the cases where this occurred was the vector below.
x = np.array([0.090179377557090171, 7.4787182000074775e-05, 0.52465058646452456, 1.3594135000013591e-05, 0.38508165466138505])
The summation of this vector x.sum() is 1.0000000000000002 whereas the summation of the vector that is divided by this value is 0.99999999999999978.
From that point on that reciprocates.
What I did do was round the elements in the vector by the 10th decimal place (np.round(x, decimals = 10)) then divided this by the sum which results in a sum of exactly 1.0. This works when I know the size of the numerical error.
Unfortunately, that would not be the case in usual circumstances.
I'm wondering if there is a way to correct the numerical error of only when the vector is known so that the sum will equal to 1.0.
Edit:
Is floating point math broken?
This question doesn't answer my question as it states only 'why' the difference occurs and not how to resolve the issue.
A bit of a hacky solution:
x[-1] = 0
x[-1] = 1 - x.sum()
Essentially shoves the numerical errors into the last element of the array.
(No roundings beforehand are needed.)
Note: A mathematically simpler solution:
x[-1] = 1.0 - x[:-1].sum()
does not work, due to different behavior of numpy.sum on whole array vs a slice.

Why does numpy.std() use abs()?

I checked the numpy library and found the following definition for the standard deviation in numpy:
std = sqrt(mean(abs(x - x.mean())**2))
Why is the abs() function used? - Because mathematically the square of a number will be positive per definition.
So I thought:
abs(x - x.mean())**2 == (x - x.mean())**2
The square of a real number is always positive, but this is not true for complex numbers.
A very simple example: j**2=-1
A more complex (pun intended) example: (3-2j)**2=(5-12j)
From documentation:
Note that, for complex numbers, std takes the absolute value before squaring, so that the result is always real and nonnegative.
Note:
Python uses j for the imaginary unit, while mathematicians uses i.

How to get the correct accuracy with big integer division in python

I have a big integer below, as 'max'. How come the value dividing max by '27' is not equivalent to just completely omiting the first number '27'. Technically they should be equal, but in python they are not. How can I get the same answer by dividing the max value with '27', in this example?
max = 27*37*47*30*17*6*20*17*21*43*5*49*49*50*20*42*45*1*22*44
no27 = 37*47*30*17*6*20*17*21*43*5*49*49*50*20*42*45*1*22*44
div27 = (max/27)
modno27 = no27%40
moddiv27 = div27%40
The values printed are:
no27 = 35882855955274315680000000
div27 = 3.5882855955274316e+25
modno27 = 0
moddiv27 = 8.0
Assuming this is Python 3, you used true division, which computes float results, but float (based on C double) has representation limitations that Python ints do not (above ~2**53, it can't represent every integer value).
When you know the number is evenly divisible, use // to preserve int-ness. If it's not evenly divisible, you'll round down, e.g. 5 // 3 == 1. If that's unacceptable, you can use divmod to compute both quotient and remainder at once (so no information is lost) or the fractions.Fraction type or decimal.Decimal type (with appropriate precision) to get more precise results in a single result type.

How to generate exponential variate without negative numbers in Python?

How to generate exponential variate without negative numbers in Python?
I tried to use this code but it generates negative number
>>> import random
>>> int(random.expovariate(0.28)
5
I thought about using if statement, but it'll affect my randomness and my final result.
From the documentation of random.expovariate:
Exponential distribution. lambd is 1.0 divided by the desired mean. It should be nonzero. (The parameter would be called “lambda”, but that is a reserved word in Python.) Returned values range from 0 to positive infinity if lambd is positive, and from negative infinity to 0 if lambd is negative.
If you want non-negative results, use non-negative arguments.

Categories

Resources