Efficient Python array with 100 million zeros? - python

What is an efficient way to initialize and access elements of a large array in Python?
I want to create an array in Python with 100 million entries, unsigned 4-byte integers, initialized to zero. I want fast array access, preferably with contiguous memory.
Strangely, NumPy arrays seem to be performing very slow. Are there alternatives I can try?
There is the array.array module, but I don't see a method to efficiently allocate a block of 100 million entries.
Responses to comments:
I cannot use a sparse array. It will be too slow for this algorithm because the array becomes dense very quickly.
I know Python is interpreted, but surely there is a way to do fast array operations?
I did some profiling, and I get about 160K array accesses (looking up or updating an element by index) per second with NumPy. This seems very slow.

I have done some profiling, and the results are completely counterintuitive.
For simple array access operations, numpy and array.array are 10x slower than native Python arrays.
Note that for array access, I am doing operations of the form:
a[i] += 1
Profiles:
[0] * 20000000
Access: 2.3M / sec
Initialization: 0.8s
numpy.zeros(shape=(20000000,), dtype=numpy.int32)
Access: 160K/sec
Initialization: 0.2s
array.array('L', [0] * 20000000)
Access: 175K/sec
Initialization: 2.0s
array.array('L', (0 for i in range(20000000)))
Access: 175K/sec, presumably, based upon the profile for the other array.array
Initialization: 6.7s

Just a reminder how Python's integers work: if you allocate a list by saying
a = [0] * K
you need the memory for the list (sizeof(PyListObject) + K * sizeof(PyObject*)) and the memory for the single integer object 0. As long as the numbers in the list stay below the magic number V that Python uses for caching, you are fine because those are shared, i.e. any name that points to a number n < V points to the exact same object. You can find this value by using the following snippet:
>>> i = 0
>>> j = 0
>>> while i is j:
... i += 1
... j += 1
>>> i # on my system!
257
This means that as soon as the counts go above this number, the memory you need is sizeof(PyListObject) + K * sizeof(PyObject*) + d * sizeof(PyIntObject), where d < K is the number of integers above V (== 256). On a 64 bit system, sizeof(PyIntObject) == 24 and sizeof(PyObject*) == 8, i.e. the worst case memory consumption is 3,200,000,000 bytes.
With numpy.ndarray or array.array, memory consumption is constant after initialization, but you pay for the wrapper objects that are created transparently, as Thomas Wouters said. Probably, you should think about converting the update code (which accesses and increases the positions in the array) to C code, either by using Cython or scipy.weave.

Try this:
x = [0] * 100000000
It takes just a few seconds to execute on my machine, and access is close to instant.

If you are are not able to vectorize your calculuations, Python/Numpy will be slow. Numpy is fast because vectorized calculations occur at a lower level than Python. The core numpy functions are all written in C or Fortran. Hence sum(a) is not a python loop with many accesses, it's a single low level C call.
Numpy's Performance Python demo page has a good example with different options. You can easily get 100x increase by using a lower level compiled language, Cython, or using vectorized functions if feasible. This blog post that shows a 43 fold increase using Cython for a numpy usecase.

It's unlikely you'll find anything faster than numpy's array. The implementation of the array itself is as efficient as it would be in, say, C (and basically the same as array.array, just with more usefulness.)
If you want to speed up your code, you'll have to do it by doing just that. Even though the array is implemented efficiently, accessing it from Python code has certain overhead; for example, indexing the array produces integer objects, which have to be created on the fly. numpy offers a number of operations implemented efficiently in C, but without seeing the actual code that isn't performing as well as you want it's hard to make any specific suggestions.

For fast creation, use the array module.
Using the array module is ~5 times faster for creation, but about twice as slow for accessing elements compared to a normal list:
# Create array
python -m timeit -s "from array import array" "a = array('I', '\x00'
* 100000000)"
10 loops, best of 3: 204 msec per loop
# Access array
python -m timeit -s "from array import array; a = array('I', '\x00'
* 100000000)" "a[4975563]"
10000000 loops, best of 3: 0.0902 usec per loop
# Create list
python -m timeit "a = [0] * 100000000"
10 loops, best of 3: 949 msec per loop
# Access list
python -m timeit -s "a = [0] * 100000000" "a[4975563]"
10000000 loops, best of 3: 0.0417 usec per loop

In addition to the other excellent solutions, another way is to use a dict instead of an array (elements which exist are non-zero, otherwise they're zero). Lookup time is O(1).
You might also check if your application is resident in RAM, rather than swapping out. It's only 381 MB, but the system may not be giving you it all for whatever reason.
However there are also some really fast sparse matrices (SciPy and ndsparse). They are done in low-level C, and might also be good.

If
access speed of array.array is acceptable for your application
compact storage is most important
you want to use standard modules (no NumPy dependency)
you are on platforms that have /dev/zero
the following may be of interest to you. It initialises array.array about 27 times faster than array.array('L', [0]*size):
myarray = array.array('L')
f = open('/dev/zero', 'rb')
myarray.fromfile(f, size)
f.close()
On How to initialise an integer array.array object with zeros in Python I'm looking for an even better way.

I would simply create your own data type that doesn't initialize ANY values.
If you want to read an index position that has NOT been initialized, you return zeroes. Still, do not initialize any storage.
If you want to read an index position that HAS been initialized, simply return the value.
If you want to write to an index position that has NOT been initialized, initialize it, and store the input.

NumPy is the appropriate tool for a large, fixed-size, homogeneous array. Accessing individual elements of anything in Python isn't going to be all that fast, though whole-array operations can often be conducted at speeds similar to C or Fortran. If you need to do operations on millions and millions of elements individually quickly, there is only so much you can get out of Python.
What sort of algorithm are you implementing? How do you know that using sparse arrays is too slow if you haven't tried it? What do you mean by "efficient"? You want quick initialization? That is the bottleneck of your code?

Related

Does NumPy array really take less memory than python list?

Please refer to below execution -
import sys
_list = [2,55,87]
print(f'1 - Memory used by Python List - {sys.getsizeof(_list)}')
narray = np.array([2,55,87])
size = narray.size * narray.itemsize
print(f'2 - Memory usage of np array using itemsize - {size}')
print(f'3 - Memory usage of np array using getsizeof - {sys.getsizeof(narray)}')
Here is what I get in result
1 - Memory used by Python List - 80
2 - Memory usage of np array using itemsize - 12
3 - Memory usage of np array using getsizeof - 116
One way of calculation suggests numpy array is consuming way too less memory but other says it is consuming more than regular python list? Shouldn't I be using getSizeOf with numpy array. What I am doing wrong here?
Edit - I just checked, an empty python list is consuming 56 bytes whereas an empty np array 104. Is this space being used in pointing to associated built-in methods and attributes?
The calculation using:
size = narray.size * narray.itemsize
does not include the memory consumed by non-element attributes of the array object. This can be verified by the documentation of ndarray.nbytes:
>>> x = np.zeros((3,5,2), dtype=np.complex128)
>>> x.nbytes
480
>>> np.prod(x.shape) * x.itemsize
480
In the above link, it can be read that ndarray.nbytes:
Does not include memory consumed by non-element attributes of the
array object.
Note that from the code above you can conclude that your calculation excludes non-element attributes given that the value is equal to the one from ndarray.nbytes.
A list of the non-element attributes can be found in the section Array Attributes, including here for completeness:
ndarray.flags Information about the memory layout of the array.
ndarray.shape Tuple of array dimensions.
ndarray.strides Tuple of
bytes to step in each dimension when traversing an array.
ndarray.ndim Number of array dimensions.
ndarray.data Python buffer object pointing to the start of the array’s data.
ndarray.size Number of elements in the array.
ndarray.itemsize Length of one array element in bytes.
ndarray.nbytes Total bytes consumed by the elements of the array.
ndarray.base Base object if memory is from some other object.
With regards to sys.getsizeof it can be read in the documentation (emphasis mine) that:
Only the memory consumption directly attributed to the object is
accounted for, not the memory consumption of objects it refers to.
Search on [numpy]getsizeof produces many potential duplicates.
The basic points are:
a list is a container, and getsizeof docs warns us that it returns only the size of the container, not the size of the elements that it references. So by itself it is an unreliable measure to the total size of a list (or tuple or dict).
getsizeof is a fairly good measure of arrays, if you take into account the roughly 100 bytes of "overhead". That overhead will be a big part of a small array, and a minor thing when looking at a large one. nbytes is the simpler way of judging array memory use.
But for views, the data-buffer is shared with the base, and doesn't count when using getsizeof.
object dtype arrays contain references like lists, to the same getsizeof caution applies.
Overall I think understanding how arrays and lists are stored is more useful way of judging their respective memory use. Focus more on the computational efficiency than memory use. For small stuff, and iterative uses, lists are better. Arrays are best when they are large, and you use array methods to do the calculations.
Because numpy arrays have shapes, strides, and other member variables that define the data layout it is reasonable that (might) require some extra memory for this!
A list on the other hand has no specific type, or shape, etc.
Although, if you start appending elements on a list instead of simply writing them as an array, and also go to larger numbers of elements, e.g. 1e7, you will see different behaviour!
Example case:
import numpy as np
import sys
N = int(1e7)
narray = np.zeros(N);
mylist = []
for i in range(N):
mylist.append(narray[i])
print("size of np.array:", sys.getsizeof(narray))
print("size of list :", sys.getsizeof(mylist))
On my (ASUS) Ubuntu 20.04 PC I get:
size of np.array: 80000104
size of list : 81528048
Note that is not only the memory footprint important in an application's efficiency! The data layout is sometimes way more important.

Best Way to Append C Arrays (or any arrays) in Cython

I have a function in an inner loop that takes two arrays and combines them. To get a feel for what it's doing look at this example using lists:
a = [[1,2,3]]
b = [[4,5,6],
[7,8,9]]
def combinearrays(a, b):
a = a + b
return a
def main():
print(combinearrays(a,b))
The output would be:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
The key thing here is that I always have the same number of columns, but I want to append rows together. Also, the values are always ints.
As an added wrinkle, I cheated and created a as a list within a list. But in reality, it might be a single dimensional array that I want to still combine into a 2D array.
I am currently doing this using Numpy in real life (i.e. not the toy problem above) and this works. But I really want to make this as fast as possible and it seems like c arrays should be faster. Obviously one problem with c arrays if I pass them as parameters, is I won't know the actual number of rows in the arrays passed. But I can always add additional parameters to pass that.
So it's not like I don't have a solution to this problem using Numpy, but I really want to know what the single fastest way to do this is in Cython. Since this is a call inside an inner loop, it's going to get called thousands of times. So every little savings is going to count big.
One obvious idea here would be to use malloc or something like that.
While I'm not convinced this is the only option, let me recommend the simple option of building a standard Python list using append and then using np.vstack or np.concatenate at the end to build a full Numpy array.
Numpy arrays store all the data essentially contiguously in memory (this isn't 100% true for if you're taking slices, but for freshly allocated memory it's basically true). When you resize the array it may get lucky and have unallocated memory after the array and then be able to reallocate in place. However, in general this won't happen and the entire contents of the array will need to be copied to the new location. (This will likely apply for any solution you devise yourself with malloc/realloc).
Python lists are good for two reasons:
They are internally a list of PyObject* (in this case to the Numpy arrays it contains). If copying is needed during the resize you are only copying the pointers to the arrays, and not the whole arrays.
They are designed to handle resizing/appending intelligently by over-allocating the space needed, so that they need only re-allocate more memory occasionally. Numpy arrays could have this feature, but it's less obviously a good thing for Numpy than it is for Python lists (if you have a 10GB data array that barely fits in memory do you really want it over-allocated?)
My proposed solution uses the flexibly, easily-resized list class to build your array, and then only finalizes to the inflexible but faster Numpy array at the end, therefore (largely) getting the best of both.
A completely untested outline of the same structure using C to allocate would look like:
from libc.stdlib cimport malloc, free, realloc
cdef int** ptr_array = NULL
cdef int* current_row = NULL
# just to be able to return a numpy array
cdef int[:,::1] out
rows_allocated = 0
try:
for row in range(num_rows):
ptr_array = realloc(ptr_array, sizeof(int*)*(row+1))
current_row = ptr_array[r] = malloc(sizeof(int)*row_length)
rows_allocated = row+1
# fill in data on current_row
# pass to numpy so we can access in Python. There are other
# way of transfering the data to Python...
out = np.empty((rows_allocated,row_length),dtype=int)
for row in range(rows_allocated):
for n in range(row_length):
out[row,n] = ptr_array[row][n]
return out.base
finally:
# clean up memory we have allocated
for row in range(rows_allocated):
free(ptr_array[row])
free(ptr_array)
This is unoptimized - a better version would over-allocate ptr_array to avoid resizing each time. Because of this I don't actually expect it to be quick, but it's meant as an indication of how to start.

Why is it more efficient to create a small array, and then expand it, rather than to create an array entirely from a large list?

Found the following code in a book but couldn't get the complete explanation.
x = array('d', [0] * 1000000)
x = array('d', [0]) * 1000000
The python code in the first case is creating an array of 1000000 lengths while in the 2nd part is creating a single size array and multiplying the size by the same factor.
The code in the 2nd case is 100 times faster than the first case.
What is the exact reason for the speed difference? How does python implementation of arrays play a role?
A Python list stores Python objects, but an array.array object stores raw C data types.
The first line requires individually processing every object in [0] * 1000000, following pointers and performing type checking and dynamic dispatch and reference counting and all that a million times to process every element and convert its data to a raw C double. Every element happens to be the same, but the array constructor doesn't know that. There's also the expense of building and cleaning up the million-element list.
The second line is way simpler. Python can just memcpy the array's contents a million times.

Put python long integers into memory without any space between them

I want to put many large long integers into memory without any space between them. How to do that with python 2.7 code in linux?
The large long integers all use the same number of bits. There is totally about 4 gb of data. Leaving spaces of a few bits to make each long integer uses multiples of 8 bits in memory is ok. I want to do bitwise operation on them later.
So far, I am using a python list. But I am not sure if that leaves no space in memory between the integers. Can ctypes help?
Thank you.
The old code uses bitarray (https://pypi.python.org/pypi/bitarray/0.8.1)
import bitarray
data = bitarray.bitarray()
with open('data.bin', 'rb') as f:
data.fromfile(f)
result = data[:750000] & data[750000:750000*2]
This works and the bitarray doesn't have gap in memory. But, the bitarray bitwise and is slower than the native python's bitwise operation of long integer by about 6 times on the computer. Slicing the bitarray in the old code and accessing an element in the list in the newer code use roughly the same amount of time.
Newer code:
import cPickle as pickle
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
# data is a list of python's (long) integers
result = data[0] & data[1]
Numpy:
In the above code. result = data[0] & data[1] creates a new long integer.
Numpy has the out option for numpy.bitwise_and. That would avoid creating a new numpy array. Yet, numpy's bool array seems to use one byte per bool instead of one bit per bool. While, converting the bool array into a numpy.uint8 array avoids this problem, counting the number of set bit is too slow.
The python's native array can't handle the large long integers:
import array
xstr = ''
for i in xrange(750000):
xstr += '1'
x = int(xstr, 2)
ar = array.array('l',[x,x,x])
# OverflowError: Python int too large to convert to C long
You can use the array module, for example:
import array
ar = array('l', [25L, 26L, 27L])
ar[1] # 26L
The primary cause for space inefficiency is the internal structure of a Python long. Assuming a 64-bit platform, Python only uses 30 out of 32 bits to store a value. The gmpy2 library provides access to the GMP (GNU Multiple Precision Arithmetic Library). The internal structure of the gmpy2.mpz type uses all available bits. Here is the size difference for storing a 750000-bit value.
>>> import gmpy2
>>> import sys
>>> a=long('1'*750000, 2)
>>> sys.getsizeof(a)
100024
>>> sys.getsizeof(gmpy2.mpz(a))
93792
The & operation with ``gmpy2.mpz` is also significantly faster.
$ python -m timeit -s "a=long('A'*93750,16);b=long('7'*93750)" "c=a & b"
100000 loops, best of 3: 7.78 usec per loop
$ python -m timeit -s "import gmpy2;a=gmpy2.mpz('A'*93750,16);b=gmpy2.mpz('7'*93750)" "c=a & b"
100000 loops, best of 3: 4.44 usec per loop
If all your operations are in-place, the gmpy2.xmpz type allows changing the internal value of an instance without creating a new instance. It is faster as long as all the operations are immediate.
$ python -m timeit -s "import gmpy2;a=gmpy2.xmpz('A'*93750,16);b=gmpy2.xmpz('7'*93750)" "a &= b"
100000 loops, best of 3: 3.31 usec per loop
Disclaimer: I maintain the gmpy2 library.

Binary array in python

How to create big array in python, how efficient creating that
in C/C++:
byte *data = (byte*)memalloc(10000);
or
byte *data = new byte[10000];
in python...?
Have a look at the array module:
import array
array.array('B', [0] * 10000)
Instead of passing a list to initialize it, you can pass a generator, which is more memory efficient.
You can pre-allocate a list with:
l = [0] * 10000
which will be slightly faster than .appending to it (as it avoids intermediate reallocations). However, this will generally allocate space for a list of pointers to integer objects, which will be larger than an array of bytes in C.
If you need memory efficiency, you could use an array object. ie:
import array, itertools
a = array.array('b', itertools.repeat(0, 10000))
Note that these may be slightly slower to use in practice, as there is an unboxing process when accessing elements (they must first be converted to a python int object).
You can efficiently create big array with array module, but using it won't be as fast as C. If you intend to do some math, you'd be better off with numpy.array
Check this question for comparison.
Typically with python, you'd just create a list
mylist = []
and use it as an array. Alternatively, I think you might be looking for the array module. See http://docs.python.org/library/array.html.

Categories

Resources