Compute a list of rounded proportions

Compute a list of rounded proportions - python

Required:
[10,20,-30] -> [1,2,-3]
[19,-14,15] -> [2,-1,2]
[-1.09,-0.92,0.02] -> [-109,-92,2]
[501.6545,-1857.1,897.543] -> [5,-19,9]
The number closest to zero in each input set should be a single digit number in the output. The proportions must be kept approximately constant, rounding errors accepted.
Context: Converting the number of shares of securities to buy from a model to round lots of 100 using the smallest orders possible.
I can brute force this in a non-pythonic way but I'm looking for pointers on Python functions to use. My background is Java.

In Python you would use numpy for such calculations. I would suggest an algorithm like this:
def process(array):
order_of_magnitude = np.floor(np.log10(np.min(np.abs(array))))
return np.round(array*10**(-order_of_magnitude))
Explanation:
Find the order of magnitude of the smallest element in the array (regardless of sign).
Scale every element (or up) according to this.
Round the result
You will need to install numpy for this. For example with pip or via your linux distribution.
Turn your lists into numpy arrays like this:
array = np.array(your_list)

Ignoring your examples, I implemented the requirement
The number closest to zero in each input set should be a single digit number in the output. The proportions must be kept approximately constant, rounding errors accepted.
This algorithm normalizes the data by the absolute value of the value closest to zero, and multiplies that result by 9 to keep the smallest number one-digit, thus minimizing the subsequent rounding error.
def normalize(l):
import numpy as np
m = np.min(np.abs(l))
return np.round(l / m * 9).astype(int)

Here is the correct answer based on #user8408080 answer.
import numpy as np
def process(array):
order_of_magnitude = np.floor(np.log10(np.min(np.abs(array)))).astype(int).item()
return np.round(np.asarray(array)*10**(-order_of_magnitude)).astype(int)

Related

Why does NumPy's irfft2 of rfft2 lead to a matrix with one less column when the original matrix has an odd second index?

I am confused by the following behavior of rfft2 and irfft2 in NumPy. If I start with a real matrix that is m x n where n is odd, then if I take rfft2 followed by irfft2, I end up with an m x (n-1) matrix. Since irfft2 is the inverse of rfft2, I would have expected to get back a matrix of size m x n. In addition, the values in the matrix are not what I started with -- see output below.
>>> import numpy as np
>>> x = np.ones((4, 3))
>>> ix = np.fft.rfft2(x)
>>> rx = np.fft.irfft2(ix)
>>> rx.shape
(4, 2)
>>> rx
array([[1.5, 1.5],
[1.5, 1.5],
[1.5, 1.5],
[1.5, 1.5]])
I would appreciate any feedback as to whether I am misinterpreting the results somehow or could this even possibly be a bug? I noticed that the same issue does not occur if the first index is odd and also there is no equivalent issue for rfft and irfft.
Note that I am using Python 3.8.8 with Anaconda distribution on an iMac Pro (2017) running macOS Mojave.

In order to make sure that irfft2 is in fact the inverse of rfft2, you need to let it know the exact shape of your input data when reversing the transformation.
Like so:
import numpy as np
x = np.ones((4, 3))
ix = np.fft.rfft2(x)
rx = np.fft.irfft2(ix, x.shape)
This is necessary precisely for the reason you highlight with your question: The way the transformed data (the "spectrum", ix in your example) is represented for real-valued input data (x) depends on whether the number of samples is odd or even in any of the dimensions.
The (i)rfft* family of functions are all tailored to the common use case where the input data is a series of real numbers, i.e. not complex numbers. The discrete Fourier transform of such an input is usually complex-valued, but has a special symmetry: the negative-frequency components are the complex conjugates of the corresponding positive-frequency components. That is, the spectrum contains essentially the same numbers twice, and half the spectrum already contains the information necessary to reconstruct the input data. Which makes sense: The spectrum is a series of complex numbers, which can be represented as two real numbers each, but the input data does not have that "complexity", as it is real-valued.
Then again, "half the spectrum" is not that clear a term when the length of the data (and thus of the full spectrum) may be odd or even. Mathematically, these two cases must be treated slightly differently. Which is why the length of the data is needed when reconstructing the input signal.
As the NumPy documentation of rfft notes for the one-dimensional case:
If n is even, [the last array element of the spectrum] contains the term representing both positive and negative Nyquist frequency (+fs/2 and -fs/2), and must also be purely real. If n is odd, there is no term at fs/2; [the last array element of the spectrum] contains the largest positive frequency (fs/2*(n-1)/n), and is complex in the general case.
And the documentation of irfft further explains:
The correct interpretation of the hermitian input depends on the length of the original data, as given by n. This is because each input shape could correspond to either an odd or even length signal. By default, irfft assumes an even output length which puts the last entry at the Nyquist frequency; aliasing with its symmetric counterpart. By Hermitian symmetry, the value is thus treated as purely real. To avoid losing information, the correct length of the real input must be given.
So an even-length signal is the default. Which is why you only run into this issue for odd lengths of the array dimension. The documentation of irfftn notes specifically that it is the inverse of rfftn only if called like irfftn(rfftn(x), x.shape).

FFTs in general do not handle odd-length inputs. They actually want powers of 2. When you FFT an odd-length vector of reals, you lose some information. If you try your experiment with a (4,4), you'll see that the output exactly matches the input.

Numpy float mean calculation precision

I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.

Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.

This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.

The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.

Numpy Vectorization - Weird issue

I am performing some vectorized calculation using numpy. I was investigating a bug I am having and I ended with this line:
(vertices[:,:,:,0]+vertices[:,:,:,1]*256)*4
The result was expected to be 100728 for the index vertices[0,0,17], however, I am getting 35192.
When I tried to change it into 4.0 instead of 4, I ended getting the correct value of 100728 and thus fixing my bug.
I would like to understand why the floating point matters here especially that I am using python 3.7 and it is multiplication, not even division.
Extra information:
vertices.shape=(203759, 12, 32, 3)
python==3.7
numpy==1.16.1
Edit 1:
vertices type is "numpy.uint8"
vertices[0, 0, 17] => [94, 98, 63]

The issue here is that you are using too small integers, and the number overflows and wraps around because numpy uses fixed width integers rather than infinite precision like python int's. Numpy will "promote" the type of a result based on the inputs, but it won't promote the result based on whether an overflow happens or not (it's done before the actual calculation.
In this case when you multiply: vertices[:,:,:,1]*256 (I shall call this A), 256 cannot be held in a uint8, so it goes to the next higher type: uint16 this allows the result of the multiplication to hold the correct value in this case, because the maximum possible value of any element in verticies is 255, so the largest value possible is 255*256, which fits just fine in a 16 bit uint.
Then you add vertices[:,:,:,0] + A (I shall call this B). if the largest value of A was 255*256, and the largest value of vertices[:,:,:,0] is 255 (again the largest value of a uint8), the largest sum of the two is equal to 216-1 (the largest value you can hold in a 16 bit unsigned int). This is still fine right up until you go for your last multiplication.
When you get to B * 4, numpy again has to decide what the return type should be. The integer 4 easily fits in a uint16, so numpy does not promote the type higher still to a uint32 or uint64 because it does not preemptively avoid overflows as previously described. This results in any multiplication products greater than 216-1 being returned as modulo 216.
If you instead use a floating point number (4. or 4.0), numpy sees this as a "higher" value type that cannot fit inside a uint16, so it promotes the result to floating point, which can accomodate much higher numbers without overflowing.
If you don't want to change the entire array: verticies to a larger dtype, you could simply take the result B and convert that before you multiply by 4 as such: B.astype(np.uint64) * 4. This will allow you to hold much larger values without overflowing (though it does not actually eliminate the problem if the value is larger than 4 ever).

Sum of positive numbers results in a negative number

I am using numpy to do the always fun "count the triangles in an adjacency matrix" task. (Given an nxn Adjacency matrix, how can one compute the number of triangles in the graph (Matlab)?)
Given my matrix A, numpy.matmul() computes the cube of A without problem, but for a large matrix numpy.trace() returns a negative number.
I extracted the diagonal using numpy.diagonal() and summed the entries using math.sum() and also using a for loop -- both returned the same negative number as numpy.trace().
An attempt with math.fsum() finally returned (the assumably correct) number 4,088,103,618 -- a seemingly small number for both python and for my 64-bit operating system, especially since python documents claim integer values are unlimited.
Surely this is an overflow or undefined behavior issue, but where does the inconsistency come from? I have performed the test on the following post to successfully validate my system architecture as 64 bit, and therefore numpy should also be a 64 bit package.
Do I have Numpy 32 bit or 64 bit?
To visualize the summation process print statements were added to the for-loop, output appears as follows with an asterisk marking the interesting line.
.
.
.
adding diag val 2013124 to the running total 2140898426 = 2142911550
adding diag val 2043358 to the running total 2142911550 = 2144954908
adding diag val 2035410 to the running total 2144954908 = 2146990318
adding diag val 2000416 to the running total 2146990318 = -2145976562 *
adding diag val 2062276 to the running total -2145976562 = -2143914286
adding diag val 2092890 to the running total -2143914286 = -2141821396
adding diag val 2092854 to the running total -2141821396 = -2139728542
.
.
.
Why would adding 2000416 to 2146990318 create an overflow? The sum is only 2148990734 -- a very small number for python!

Numpy doesn't use the "python types" but rather underlying C types which you have to specify that meets your needs. By default, an array of integers will be given the "int_" type which from the docs:
int_ Default integer type (same as C long; normally either int64 or int32)
Hence why you're seeing the overflow. You'll have to specify some other type when you construct your array so that it doesn't overflow.

When you do the addition with scalars you probably get a Warning:
>>> import numpy as np
>>> np.int32(2146990318) + np.int32(2035410)
RuntimeWarning: overflow encountered in long_scalars
-2145941568
So yes, it is overflow related. The maximum 32-bit integer is 2.147.483.647!
To make sure your arrays support a bigger range of values you could cast the array (I assume you operate on an array) to int64 (or a floating point value):
array = array.astype('int64') # makes sure the values are 64 bit integers
or when creating the array:
import numpy as np
array = np.array(something, dtype=np.int64)
NumPy uses fixed-size integers and these aren't arbitary precision integers. By default it's either a 32 bit integer or a 64 bit integer, which one depends on your system. For example Windows uses int32 even when python + numpy is compiled for 64-bit.

Does this truely generate a random foating point number? (Python)

For an introduction to Python course, I'm looking at generating a random floating point number in Python, and I have seen a standard recommended code of
import random
lower = 5
upper = 10
range_width = upper - lower
x = random.random() * range_width + lower
for a random floating point from 5 up to but not including 10.
It seems to me that the same effect could be achieved by:
import random
x = random.randrange(5, 10) + random.random()
Since that would give an integer of 5, 6, 7, 8, or 9, and then tack on a decimal to it.
The question I have is would this second code still give a fully even probability distribution, or would it not keep the full randomness of the first version?

According to the documentation then yes random() is indeed a uniform distribution.
random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator.
So both code examples should be fine. To shorten your code, you can equally do:
random.uniform(5, 10)
Note that uniform(a, b) is simply a + (b - a) * random() so the same as your first example.
The second example depends on the version of Python you're using.
Prior to 3.2 randrange() could produce a slightly uneven distributions.

There is a difference. Your second method is theoretically superior, although in practice it only matters for large ranges. Indeed, both methods will give you a uniform distribution. But only the second method can return all values in the range that are representable as a floating point number.
Since your range is so small, there is no appreciable difference. But still there is a difference, which you can see by considering a larger range. If you take a random real number between 0 and 1, you get a floating-point representation with a given number of bits. Now suppose your range is, say, in the order of 2**32. By multiplying the original random number by this range, you lose 32 bits of precision in the result. Put differently, there will be gaps between the values that this method can return. The gaps are still there when you multiply by 4: You have lost the two least significant bits of the original random number.

The two methods can give different results, but you'll only notice the difference in fairly extreme situations (with very wide ranges). For instance, If you generate random numbers between 0 and 2/sys.float_info.epsilon (9007199254740992.0, or a little more than 9 quintillion), you'll notice that the version using multiplication will never give you any floats with fractional values. If you increase the maximum bound to 4/sys.float_info.epsilon, you won't get any odd integers, only even ones. That's because the 64-bit floating point type Python uses doesn't have enough precision to represent all integers at the upper end of that range, and it's trying to maintain a uniform distribution (so it omits small odd integers and fractional values even though those can be represented in parts of the range).
The second version of the calculation will give extra precision to the smaller random numbers generated. For instance, if you're generating numbers between 0 and 2/sys.float_info.epsilon and the randrange call returned 0, you can use the full precision of the random call to add a fractional part to the number. On the other hand if the randrange returned the largest number in the range (2/sys.float_info.epsilon - 1), very little of the precision of the fraction would be used (the number will round to the nearest integer without any fractional part remaining).
Adding a fractional value also can't help you deal with ranges that are too large for every integer to be represented. If randrange returns only even numbers, adding a fraction usually won't make odd numbers appear (it can in some parts of the range, but not for others, and the distribution may be very uneven). Even for ranges where all integers can be represented, the odds of a specific floating point number appearing will not be entirely uniform, since the smaller numbers can be more precisely represented. Large but imprecise numbers will be more common than smaller but more precisely represented ones.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.