numpy.sum() giving strange results on large arrays

numpy.sum() giving strange results on large arrays - python

I seem to have found a pitfall with using .sum() on numpy arrays but I'm unable to find an explanation. Essentially, if I try to sum a large array then I start getting nonsensical answers but this happens silently and I can't make sense of the output well enough to Google the cause.
For example, this works exactly as expected:
a = sum(xrange(2000))
print('a is {}'.format(a))
b = np.arange(2000).sum()
print('b is {}'.format(b))
Giving the same output for both:
a is 1999000
b is 1999000
However, this does not work:
c = sum(xrange(200000))
print('c is {}'.format(c))
d = np.arange(200000).sum()
print('d is {}'.format(d))
Giving the following output:
c is 19999900000
d is -1474936480
And on an even larger array, it's possible to get back a positive result. This is more insidious because I might not identify that something unusual was happening at all. For example this:
e = sum(xrange(100000000))
print('e is {}'.format(e))
f = np.arange(100000000).sum()
print('f is {}'.format(f))
Gives this:
e is 4999999950000000
f is 887459712
I guessed that this was to do with data types and indeed even using the python float seems to fix the problem:
e = sum(xrange(100000000))
print('e is {}'.format(e))
f = np.arange(100000000, dtype=float).sum()
print('f is {}'.format(f))
Giving:
e is 4999999950000000
f is 4.99999995e+15
I have no background in Comp. Sci. and found myself stuck (perhaps this is a dupe). Things I've tried:
numpy arrays have a fixed size. Nope; this seems to show I should hit a MemoryError first.
I might somehow have a 32-bit installation (probably not relevant); nope, I followed this and confirmed I have 64-bit.
Other examples of weird sum behaviour; nope (?) I found this but I can't see how it applies.
Can someone please explain briefly what I'm missing and tell me what I need to read up on? Also, other than remembering to define a dtype each time, is there a way to stop this happening or give a warning?
Possibly relevant:
Windows 7
numpy 1.11.3
Running out of Enthought Canopy on Python 2.7.9

On Windows (on 64-bit system too) the default integer NumPy uses if you convert from Python ints is 32-bit. On Linux and Mac it is 64-bit.
Specify a 64-bit integer and it will work:
d = np.arange(200000, dtype=np.int64).sum()
print('d is {}'.format(d))
Output:
c is 19999900000
d is 19999900000
While not most elegant, you can do some monkey patching, using functools.partial:
from functools import partial
np.arange = partial(np.arange, dtype=np.int64)
From now on np.arange works with 64-bit integers as default.

This is clearly numpy's integer type overflowing 32-bits. Normally you can configure numpy to fail in such situations using np.seterr:
>>> import numpy as np
>>> np.seterr(over='raise')
{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
>>> np.int8(127) + np.int8(2)
FloatingPointError: overflow encountered in byte_scalars
However, sum is explicitly documented with the behaviour "No error is raised on overflow", so you might be out of luck here. Using numpy is often a trade-off of performance for convenience!
You can however manually specify the dtype for the accumulator, like this:
>>> a = np.ones(129)
>>> a.sum(dtype=np.int8) # will overflow
-127
>>> a.sum(dtype=np.int64) # no overflow
129
Watch ticket #593, because this is an open issue and it might be fixed by numpy devs sometime.

I'm not a numpy expert, but can reproduce your arange(200000) result in pure Python:
>>> s = 0
>>> for i in range(200000):
... s += i
... s &= 0xffffffff
>>> s
2820030816
>>> s.bit_length()
32
>>> s - 2**32 # adjust for that "the sign bit" is set
-1474936480
In other words, the result you're seeing is what I expect if numpy is doing its arithmetic on signed 2's-complement 32-bit integers.
Since I'm not a numpy expert, I can't suggest a good approach to never getting surprised (I would have left this as a comment, but I couldn't show nicely formatted code then).

Numpy's default integer type is the same as the C long type. Now, this isn't guaranteed to be 64-bits on a 64-bit platform. In fact, on Windows, long is always 32-bits.
As a result, the numpy sum is overflowing the value and looping back around.
Unfortunately, as far as I know, there is no way to change the default dtype. You'll have to specify it as np.int64 every time.
You could try to create your own arange:
def arange(*args, **kw):
return np.arange(dtype=np.int64, *args, **kw)
and then use that version instead of numpy's.
EDIT: If you want to flag this, you could just put something like this in the top of your code:
assert np.array(0).dtype.name != 'int32', 'This needs to be run with 64-bit integers!'

Related

Numpy power returns negative value

I want to plot the Poisson distribution and get negative probabilities for lambda >= 9.
This code generates plots for different lambdas:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import factorial
for lambda_val in range(1, 12, 2):
plt.figure()
k = np.arange(0,20)
y = np.power(lambda_val, k)*np.exp(-lambda_val)/factorial(k)
plt.bar(k, y)
plt.title('lambda = ' + str(lambda_val))
plt.xlabel('k')
plt.ylabel('probability')
plt.ylim([-0.1, 0.4])
plt.grid()
plt.show()
Please see these two plots:
Lambda = 5 looks fine in my opinion.
Lambda = 9 not.
I'm quite sure it has something to do with np.power because
np.power(11, 9)
gives me: -1937019605, whereas
11**9
gives me: 2357947691 (same in WolframAlpha).
But if I avoid np.power and use
y = (lambda_val**k)*math.exp(-lambda_val)/factorial(k)
for calculating the probability, I get negative values as well. I am totally confused. Can anybody explain me the effect or what am I doing wrong? Thanks in advance. :)

Your problem is due to 32-bit integer overflows. This happens because Numpy is sometimes compiled with 32-bit integer even though the platform (OS + processor) is a 64-bit one. There is an overflow because Numpy automatically transform the unbounded integer of the Python interpreter to the native np.int_ type. You can check if this type is a 64-bit one using np.int_ is np.int64. AFAIK, the default Numpy binary package compiled for Windows available on Python Pip use 32-bit integers and the one of the Linux packages use 64-bit integers (assuming you are on a 64-bit platform).
The issue can be easily reproduced using:
In [546]: np.power(np.int32(11), np.int32(9))
Out[546]: -1937019605
It can also be solved using:
In [547]: np.power(np.int64(11), np.int64(9))
Out[547]: 2357947691
In the second expression, you use k which is of type np.int_ by default and this is certainly why you get the same problem. Hopefully, you can specify to Numpy that the integer should be bigger. Note that Numpy have some implicit rule to avoid overflow but this is hard to avoid them in all case without strongly impacting performance. Here is a fixed formula:
k = np.arange(0, 20, dtype=np.int64)
y = np.power(lambda_val, k) * np.exp(-lambda_val) / factorial(k)
The rule of thumb is to be very careful about implicit conversions when you get unexpected results.

datetime.timedelta(x,y) returns TypeError on CoCalc.com but works elsewhere -- Why?

My code works on onlinegdb.com but not on CoCalc.com.
import datetime
slowduration = datetime.timedelta(0,1)
print(slowduration)
Returns
TypeError: unsupported type for timedelta seconds component: sage.rings.integer.Integer
It isn't clear to me if this is a feature or a bug.

To complement #kcrisman's answer and the "int(0), int(1)" trick...
Two other options if one wants to stick to the Sage kernel are
(1) disable the preparser with preparser(False),
(2) append r (for "raw") to the integers, eg datetime.timedelta(0r, 1r).
See also similar questions and answers around Sage's preparsing of floats and integers:
(a) Stack Overflow question 40578746: Sage and NumPy
(b) Stack Overflow question 28426920: Unsized object with numpy.random.permutation
(c) Stack Overflow question 16289354: Why is range(0, log(len(list), 2)) slow?
Finally, note that code can be loaded into Sage from external files using either:
load('/path/to/filename.py')
load('/path/to/filename.sage')
where .sage files will get "Sage-preparsed" while .py files will not.
This gives a third option to bypass the preparser: load code from a .pyfile.

If anyone else has a problem like this - It turns out that I was using the Sage math kernel and not the Python math kernel. This website offers something like 15 different kernels.

Jacob's self-answer is correct; here are a few more details.
In SageMath there is something called a preparser which interprets things so that integers are mathematical integers, not Python ints. So for example:
sage: preparse('1+1')
'Integer(1)+Integer(1)'
There is a lot more that involves - try preparse('f(x)=x^2') for some real fun. But yes, it's a feature.
To fix your problem within the Sage kernel, though, you could just do this:
import datetime
slowduration = datetime.timedelta(int(0),int(1))
print(slowduration)
to get 0:00:01 as your answer.

Hypothesis strategy generating inf when specifically asked not to

from functools import partial
import hypothesis as h
import hypothesis.strategies as hs
import hypothesis.extra.numpy as hnp
import numpy as np
floats_notnull = partial(hs.floats, allow_nan=False, allow_infinity=False)
complex_notnull = partial(hs.complex_numbers, allow_nan=False, allow_infinity=False)
data_strategy_real = hnp.arrays(
np.float64,
hs.tuples(hs.integers(min_value=2, max_value=50),
hs.integers(min_value=2, max_value=5)),
floats_notnull()
)
data_strategy_complex = hnp.arrays(
np.complex64,
hs.tuples(hs.integers(min_value=2, max_value=50), hs.just(1)),
complex_notnull()
)
data_strategy = hs.one_of(data_strategy_real, data_strategy_complex)
If you run data_strategy.example() a couple times, you'll notice that some of the values in the result have infinite real or imaginary parts. My intention here was to specifically disallow infinite or NaN parts.
What am I doing wrong?
Update: if I use
data_strategy = hs.lists(complex_notnull, min_size=2, max_size=50)
and convert that to an array inside my test, the problem appears to go away. Are the complex numbers overflowing? I'm not getting the usual deprecation warning about overflow from Hypothesis.
And if I use
data_strategy = data_strategy_real
no infs appear.

The complex64 type is too small and it's overflowing. Somehow Hypothesis is failing to catch this.
Yep, the root cause of this problem is that you're generating 64-bit finite floats, then casting them to 32-bit (because complex64 is a pair of 32-bit floats). You can fix that with the width=32 argument to floats():
floats_notnull_32 = partial(hs.floats, allow_nan=False, allow_infinity=False, width=32)
And you're not getting the usual overflow check because it's only implemented for floats and integers at the moment. I've opened (edit: and fixed) issue #1591 to check complex and string types too.

The complex64 type is too small and it's overflowing. Somehow Hypothesis is failing to catch this.
Switching to complex128 fixed the problem for now.

Optimising python function with numba

I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.

Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.

numpy float: 10x slower than builtin in arithmetic operations?

I am getting really weird timings for the following code:
import numpy as np
s = 0
for i in range(10000000):
s += np.float64(1) # replace with np.float32 and built-in float
built-in float: 4.9 s
float64: 10.5 s
float32: 45.0 s
Why is float64 twice slower than float? And why is float32 5 times slower than float64?
Is there any way to avoid the penalty of using np.float64, and have numpy functions return built-in float instead of float64?
I found that using numpy.float64 is much slower than Python's float, and numpy.float32 is even slower (even though I'm on a 32-bit machine).
numpy.float32 on my 32-bit machine. Therefore, every time I use various numpy functions such as numpy.random.uniform, I convert the result to float32 (so that further operations would be performed at 32-bit precision).
Is there any way to set a single variable somewhere in the program or in the command line, and make all numpy functions return float32 instead of float64?
EDIT #1:
numpy.float64 is 10 times slower than float in arithmetic calculations. It's so bad that even converting to float and back before the calculations makes the program run 3 times faster. Why? Is there anything I can do to fix it?
I want to emphasize that my timings are not due to any of the following:
the function calls
the conversion between numpy and python float
the creation of objects
I updated my code to make it clearer where the problem lies. With the new code, it would seem I see a ten-fold performance hit from using numpy data types:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 34.56s
float32: 35.11s
float: 3.53s
Just for the hell of it, I also tried:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
s = np.float64(1)
for i in range(10000000):
s = float(s)
s = (s + 8) * s % 2399232
s = np.float64(s)
print(s)
print('Runtime:', datetime.now() - START_TIME)
The execution time is 13.28 s; it's actually 3 times faster to convert the float64 to float and back than to use it as is. Still, the conversion takes its toll, so overall it's more than 3 times slower compared to the pure-python float.
My machine is:
Intel Core 2 Duo T9300 (2.5GHz)
WinXP Professional (32-bit)
ActiveState Python 3.1.3.5
Numpy 1.5.1
EDIT #2:
Thank you for the answers, they help me understand how to deal with this problem.
But I still would like to know the precise reason (based on the source code perhaps) why the code below runs 10 times slow with float64 than with float.
EDIT #3:
I rerun the code under the Windows 7 x64 (Intel Core i7 930 # 3.8GHz).
Again, the code is:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 16.1s
float32: 16.1s
float: 3.2s
Now both np floats (either 64 or 32) are 5 times slower than the built-in float. Still, a significant difference. I'm trying to figure out where it comes from.
END OF EDITS

CPython floats are allocated in chunks
The key problem with comparing numpy scalar allocations to the float type is that CPython always allocates the memory for float and int objects in blocks of size N.
Internally, CPython maintains a linked list of blocks each large enough to hold N float objects. When you call float(1) CPython checks if there is space available in the current block; if not it allocates a new block. Once it has space in the current block it simply initializes that space and returns a pointer to it.
On my machine each block can hold 41 float objects, so there is some overhead for the first float(1) call but the next 40 run much faster as the memory is allocated and ready.
Slow numpy.float32 vs. numpy.float64
It appears that numpy has 2 paths it can take when creating a scalar type: fast and slow. This depends on whether the scalar type has a Python base class to which it can defer for argument conversion.
For some reason numpy.float32 is hard-coded to take the slower path (defined by the _WORK0 macro), while numpy.float64 gets a chance to take the faster path (defined by the _WORK1 macro). Note that scalartypes.c.src is a template which generates scalartypes.c at build time.
You can visualize this in Cachegrind. I've included screen captures showing how many more calls are made to construct a float32 vs. float64:
float64 takes the fast path
float32 takes the slow path
Updated - Which type takes the slow/fast path may depend on whether the OS is 32-bit vs 64-bit. On my test system, Ubuntu Lucid 64-bit, the float64 type is 10 times faster than float32.

Operating with Python objects in a heavy loop like that, whether they are float, np.float32, is always slow. NumPy is fast for operations on vectors and matrices, because all of the operations are performed on big chunks of data by parts of the library written in C, and not by the Python interpreter. Code run in the interpreter and/or using Python objects is always slow, and using non-native types makes it even slower. That's to be expected.
If your app is slow and you need to optimize it, you should try either converting your code to a vector solution that uses NumPy directly, and is fast, or you could use tools such as Cython to create a fast implementation of the loop in C.

Perhaps, that is why you should use Numpy directly instead of using loops.
s1 = np.ones(10000000, dtype=np.float)
s2 = np.ones(10000000, dtype=np.float32)
s3 = np.ones(10000000, dtype=np.float64)
np.sum(s1) <-- 17.3 ms
np.sum(s2) <-- 15.8 ms
np.sum(s3) <-- 17.3 ms

The answer is quite simple: the memory allocation might be part of it, but the biggest problem is that arithmetic operations for numpy scalars is done using "ufuncs" which are meant to be fast for several hundred values not just 1. There is some overhead in choosing the correct function to call and setting up the loops. Overhead which is un-necessary for scalars.
It was easier to just have the scalars be converted to 0-d arrays and then passed to the corresponding numpy ufunc then write separate calculation methods for each of the many different scalar types that NumPy supports.
The intent was that optimized versions of the scalar math would be added to the type-objects in C. This could still happen, but it never has happened because no-one has been motivated enough to do it. Possibly because the work-around is to convert numpy scalars to Python scalars which do have optimized arithmetic.

Summary
If an arithmetic expression contains both numpy and built-in numbers, Python arithmetics works slower. Avoiding this conversion removes almost all of the performance degradation I reported.
Details
Note that in my original code:
s = np.float64(1)
for i in range(10000000):
s = (s + 8) * s % 2399232
the types float and numpy.float64 are mixed up in one expression. Perhaps Python had to convert them all to one type?
s = np.float64(1)
for i in range(10000000):
s = (s + np.float64(8)) * s % np.float64(2399232)
If the runtime is unchanged (rather than increased), it would suggest that's what Python indeed was doing under the hood, explaining the performance drag.
Actually, the runtime fell by 1.5 times! How is it possible? Isn't the worst thing that Python could possibly have to do was these two conversions?
I don't really know. Perhaps Python had to dynamically check what needs to be converted into what, which takes time, and being told what precise conversions to perform makes it faster. Perhaps, some entirely different mechanism is used for arithmetics (which doesn't involve conversions at all), and it happens to be super-slow on mismatched types. Reading numpy source code might help, but it's beyond my skill.
Anyway, now we can obviously speed things up more by moving the conversions out of the loop:
q = np.float64(8)
r = np.float64(2399232)
for i in range(10000000):
s = (s + q) * s % r
As expected, the runtime is reduced substantially: by another 2.3 times.
To be fair, we now need to change the float version slightly, by moving the literal constants out of the loop. This results in a tiny (10%) slowdown.
Accounting for all these changes, the np.float64 version of the code is now only 30% slower than the equivalent float version; the ridiculous 5-fold performance hit is largely gone.
Why do we still see the 30% delay? numpy.float64 numbers take the same amount of space as float, so that won't be the reason. Perhaps the resolution of the arithmetic operators takes longer for user-defined types. Certainly not a major concern.

If you're after fast scalar arithmetic, you should be looking at libraries like gmpy rather than numpy (as others have noted, the latter is optimised more for vector operations rather than scalar ones).

I can confirm the results also. I tried to see what it would look like using all numpy types, and the difference persists. So then, my tests were:
def testStandard(length=100000):
s = 1.0
addend = 8.0
modulo = 2399232.0
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
def testNumpy(length=100000):
s = np.float64(1.0)
addend = np.float64(8.0)
modulo = np.float64(2399232.0)
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
So at this point, the numpy types are all interacting with each other, but the 10x difference persists (2 sec vs 0.2 sec).
If I had to guess, I would say that there are two possible reasons for why the default float types are much faster. The first possibility is that python performs significant optimizations under the hood for dealing with certain numeric operations or looping in general (e.g. loop unrolling). The second possibility is that the numpy types involves an extra layer of abstraction (i.e. having to read from an address). To look into the effects of each, I did a few extra checks.
One difference could be the result of python having to take extra steps to resolve the float64 types. Unlike compiled languages that generate efficient tables, python 2.6 (and maybe 3) has a significant cost for resolving things that you'd generally think of as free. Even a simple X.a resolution has to resolve the dot operator EVERY time it is called. (Which is why if you have a loop that calls instance.function() you're better off having a variable "function = instance.function" declared outside the loop).
From my understanding, when you use python standard operators, these are fairly similar to using the ones from "import operator." If you substitute add, mul, and mod in for your +, *, and %, you see a static performance hit of about 0.5 sec versus the standard operators (to both cases). This means that by wrapping the operators, the standard python float operations get 3x slower. If you do one further, using operator.add and those variants adds on 0.7 sec approximately (over 1m trials, starting with 2 sec and 0.2 sec respectively). That's verging on the 5x slowness. So basically, if each of these issues happens twice, you're basically at the 10x slower point.
So let's assume we're the python interpreter for a moment. Case 1, we do an operation on native types, let's say a+b. Under the hood, we can check the types of a and b and dispatch our addition to python's optimized code. Case 2, we have an operation of two other types (also a+b). Under the hood, we check if they're native types (they're not). We move on to the 'else' case. The else case sends us to something like a.add(b). a.add can then do a dispatch to numpy's optimized code. So at this point we have had additional overhead of an extra branch, one '.' get slots property, and a function call. And we've only got into the addition operation. We then have to use the result to create a new float64 (or alter an existing float64). Meanwhile, the python native code probably cheats by treating its types specially to avoid this sort of overhead.
Based on the above examination of the costliness of python function calls and scoping overhead, it would be pretty easy for numpy to incur a 9x penalty just getting to and from its c math functions. I can entirely imagine this process taking many times longer than a simple math operation call. For each operation, the numpy library will have to wade through layers of python to get to its C implementation.
So in my opinion, the reason for this is probably captured in this effect:
length = 10000000
class A():
X = 10
startTime = datetime.now()
for i in xrange(length):
x = A.X
print "Long Way", datetime.now() - startTime
startTime = datetime.now()
y = A.X
for i in xrange(length):
x = y
print "Short Way", datetime.now() - startTime
This simple case shows a difference of 0.2 sec vs 0.14 sec (short way faster, obviously). I think what you're seeing is mainly just a bunch of those issues adding up.
To avoid this, I can think of a a couple possible solutions that mainly echo what has been said. The first solution is to try to keep your evaluations inside NumPy as much as possible, as Selinap said. A large amount of the losses are probably due to the interfacing. I would look into ways to dispatch your job into numpy or some other numeric library optimized in C (gmpy has been mentioned). The goal should be to push as much into C at the same time as possible, then get the result(s) back. You want to put in big jobs, not lots of small jobs.
The second solution, of course, would be to do more of your intermediate and small operations in python if you can. Clearly, using the native objects are going to be faster. They're going to be the first options on all the branch statements and will always have the shortest path to C code. Unless you have a specific need for fixed precision calculation or other issues with the default operators, I don't see why one wouldn't use the straight python functions for many things.

Really strange...I confirm the results in Ubuntu 11.04 32bit, python 2.7.1, numpy 1.5.1 (official packages):
import numpy as np
def testfloat():
s = 0
for i in range(10000000):
s+= float(1)
def testfloat32():
s = 0
for i in range(10000000):
s+= np.float32(1)
def testfloat64():
s = 0
for i in range(10000000):
s+= np.float64(1)
%time testfloat()
CPU times: user 4.66 s, sys: 0.06 s, total: 4.73 s
Wall time: 4.74 s
%time testfloat64()
CPU times: user 11.43 s, sys: 0.07 s, total: 11.50 s
Wall time: 11.57 s
%time testfloat32()
CPU times: user 47.99 s, sys: 0.09 s, total: 48.08 s
Wall time: 48.23 s
I don't see why float32 should be 5 times slower that float64.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.