Python numpy array vs list - python

I need to perform some calculations a large list of numbers.
Do array.array or numpy.array offer significant performance boost over typical arrays?
I don't have to do complicated manipulations on the arrays, I just need to be able to access and modify values,
e.g.
import numpy
x = numpy.array([0] * 1000000)
for i in range(1,len(x)):
x[i] = x[i-1] + i
So I will not really be needing concatenation, slicing, etc.
Also, it looks like array throws an error if I try to assign values that don't fit in C long:
import numpy
a = numpy.array([0])
a[0] += 1232234234234324353453453
print(a)
On console I get:
a[0] += 1232234234234324353453453
OverflowError: Python int too large to convert to C long
Is there a variation of array that lets me put in unbounded Python integers?
Or would doing it that way take away the point of having arrays in the first place?

You first need to understand the difference between arrays and lists.
An array is a contiguous block of memory consisting of elements of some type (e.g. integers).
You cannot change the size of an array once it is created.
It therefore follows that each integer element in an array has a fixed size, e.g. 4 bytes.
On the other hand, a list is merely an "array" of addresses (which also have a fixed size).
But then each element holds the address of something else in memory, which is the actual integer that you want to work with. Of course, the size of this integer is irrelevant to the size of the array. Thus you can always create a new (bigger) integer and "replace" the old one without affecting the size of the array, which merely holds the address of an integer.
Of course, this convenience of a list comes at a cost: Performing arithmetic on the integers now requires a memory access to the array, plus a memory access to the integer itself, plus the time it takes to allocate more memory (if needed), plus the time required to delete the old integer (if needed). So yes, it can be slower, so you have to be careful what you're doing with each integer inside an array.

Your first example could be speed up. Python loop and access to individual items in a numpy array are slow. Use vectorized operations instead:
import numpy as np
x = np.arange(1000000).cumsum()
You can put unbounded Python integers to numpy array:
a = np.array([0], dtype=object)
a[0] += 1232234234234324353453453
Arithmetic operations compared to fixed-sized C integers would be slower in this case.

For most uses, lists are useful. Sometimes working with numpy arrays may be more convenient for example.
a=[1,2,3,4,5,6,7,8,9,10]
b=[5,8,9]
Consider a list 'a' and if you want access the elements in a list at discrete indices given in list 'b'
writing
a[b]
will not work.
but when you use them as arrays, you can simply write
a[b]
to get the output as array([6,9,10]).

Do array.array or numpy.array offer significant performance boost over
typical arrays?
I tried to test this a bit with the following code:
import timeit, math, array
from functools import partial
import numpy as np
# from the question
def calc1(x):
for i in range(1,len(x)):
x[i] = x[i-1] + 1
# a floating point operation
def calc2(x):
for i in range(0,len(x)):
x[i] = math.sin(i)
L = int(1e5)
# np
print('np 1: {:.5f} s'.format(timeit.timeit(partial(calc1, np.array([0] * L)), number=20)))
print('np 2: {:.5f} s'.format(timeit.timeit(partial(calc2, np.array([0] * L)), number=20)))
# np but with vectorized form
vfunc = np.vectorize(math.sin)
print('np 2 vectorized: {:.5f} s'.format(timeit.timeit(partial(vfunc, np.arange(0, L)), number=20)))
# with list
print('list 1: {:.5f} s'.format(timeit.timeit(partial(calc1, [0] * L), number=20)))
print('list 2: {:.5f} s'.format(timeit.timeit(partial(calc2, [0] * L), number=20)))
# with array
print('array 1: {:.5f} s'.format(timeit.timeit(partial(calc1, array.array("f", [0] * L)), number=20)))
print('array 2: {:.5f} s'.format(timeit.timeit(partial(calc2, array.array("f", [0] * L)), number=20)))
And the results were that list executes fastest here (Python 3.3, NumPy 1.8):
np 1: 2.14277 s
np 2: 0.77008 s
np 2 vectorized: 0.44117 s
list 1: 0.29795 s
list 2: 0.66529 s
array 1: 0.66134 s
array 2: 0.88299 s
Which seems to be counterintuitive. There doesn't seem to be any advantage in using numpy or array over list for these simple examples.

To OP: For your use case use lists.
My rules for when to use which, considering robustness and speed:
list: (most robust, fastest for mutable cases)
Ex. When your list is constantly mutating as in a physics simulation. When you are "creating" data from scratch that may be unpredictable in nature.
np.arrary: (less robust, fastest for linear algebra & data post processing)
Ex. When you are "post processing" a data set that you have already collected via sensors or a simulation; performing operations that can be vectorized.

Do array.array or numpy.array offer significant performance boost over typical arrays?
It can, depending on what you're doing.
Or would doing it that way take away the point of having arrays in the first place?
Pretty much, yeah.

use a=numpy.array(number_of_elements, dtype=numpy.int64) which should give you an array of 64-bit integers. These can store any integer number between -2^63 and (2^63)-1 (approximately between -10^19 and 10^19) which is usually more than enough.

Related

List comprehension for np matrixes

I have two np.matrixes, one of which I'm trying to normalize. I know, in general, list comprehensions are faster than for loops, so I'm trying to convert my double for loop into a list expression.
# normalize the rows and columns of A by B
for i in range(1,q+1):
for j in range(1,q+1):
A[i-1,j-1] = A[i-1,j-1] / (B[i-1] / B[j-1])
This is what I have gotten so far:
A = np.asarray([A/(B[i-1]/B[j-1]) for i, j in zip(range(1,q+1), range(1,q+1))])
but I think I'm taking the wrong approach because I'm not seeing any significant time difference.
Any help would be appreciated.
First, if you really do mean np.matrix, stop using np.matrix. It has all sorts of nasty incompatibilities, and its role is obsolete now that # for matrix multiplication exists. Even if you're stuck on a Python version without #, using the dot method with normal ndarrays is still better than dealing with np.matrix.
You shouldn't use any sort of Python-level iteration construct with NumPy arrays, whether for loops or list comprehensions, unless you're sure you have no better options. Assuming A is 2D and B is 1D with shapes (q, q) and (q,) respectively, what you should instead do for this case is
A *= B
A /= B[:, np.newaxis]
broadcasting the operation over A. This will allow NumPy to perform the iteration at C level directly over the arrays' underlying data buffers, without having to create wrapper objects and perform dynamic dispatch on every operation.

Python Typed Array of a Certain Size

This will create an empty array of type signed int:
import array
a = array.array('i')
What is an efficient (performance-wise) way to specify the array lengh (as well as the array's rank - number of dimensions)?
I understand that NumPy allows to specify array size at creation, but can it be done in standard Python?
Initialising an array of fixed size in python
This deals mostly with lists, as well as no consideration is given to performance. The main reason to use an array instead of a list is performance.
The array constructor accepts as a 2nd argument an iterable. So, the following works to efficiently create and initialize the array to 0..N-1:
x = array.array('i', range(N))
This does not create a separate N element vector or list.
(If using python 2, use xrange instead). Of course, if you need different initialization you may use generator object instead of range. For example, you can use generator expressions to fill the array with zeros:
a=array.array('i',(0 for i in range(N)))
Python has no 2D (or higher) array. You have to construct one from a list of 1D arrays.
The truth is, if you are looking for a high performance implementation, you should probably use Numpy.
It's simple and fast to just use:
array.array('i', [0]) * n
Timing of different ways to initialize an array in my machine:
n = 10 ** 7
array('i', [0]) * n # 21.9 ms
array('i', [0]*n) # 395.2 ms
array('i', range(n)) # 810.6 ms
array('i', (0 for _ in range(n))) # 1238.6 ms
You said
The main reason to use an array instead of a list is performance.
Surely arrays use less memory than lists.
But by my experiment, I found no evidence that an array is always faster than a normal list.

How to efficiently construct a numpy array from a large set of data?

If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?
# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)
Edit:
Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values?
I'm just puttign theis here as it's a bit long for a comment.
Have you read the numpy documentation for array?
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements (dtype,
order, etc.).
...
"""
When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with?
A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy, as it's not written in python.
If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains.
if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.
If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).
Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?
Here's a plot of the cpu usage and memory usage of a small sample program:
from __future__ import division
#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)
#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()
os.system("python moniter.py -p " + str(pid) + " &")
print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]
import numpy
from datetime import datetime
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)
1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.
Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object), but i wouldn't advise it.

How do I fill two (or more) numpy arrays from a single iterable of tuples?

The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.
Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.
Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.

numpy: efficient execution of a complex reshape of an array

I am reading a vendor-provided large binary array into a 2D numpy array tempfid(M, N)
# load data
data=numpy.fromfile(file=dirname+'/fid', dtype=numpy.dtype('i4'))
# convert to complex data
fid=data[::2]+1j*data[1::2]
tempfid=fid.reshape(I*J*K, N)
and then I need to reshape it into a 4D array useful4d(N,I,J,K) using non-trivial mappings for the indices. I do this with a for loop along the following lines:
for idx in range(M):
i=f1(idx) # f1, f2, and f3 are functions involving / and % as well as some lookups
j=f2(idx)
k=f3(idx)
newfid[:,i,j,k] = tempfid[idx,:] #SLOW! CAN WE IMPROVE THIS?
Converting to complex takes 33% of the time while the copying of these slices M slices takes the remaining 66%. Calculating the indices is fast irrespective of whether I do this one by one in a loop as shown or by numpy.vectorizing the operation and applying it to an arange(M).
Is there a way to speed this up? Any help on more efficient slicing, copying (or not) etc appreciated.
EDIT:
As learned in the answer to question "What's the fastest way to convert an interleaved NumPy integer array to complex64?" the conversion to complex can be sped up by a factor of 6 if a view is used instead:
fid = data.astype(numpy.float32).view(numpy.complex64)
idx = numpy.arange(M)
i = numpy.vectorize(f1)(idx)
j = numpy.vectorize(f2)(idx)
k = numpy.vectorize(f3)(idx)
# you can index arrays with other arrays
# that lets you specify this operation in one line.
newfid[:, i,j,k] = tempfid.T
I've never used numpy's vectorize. Vectorize just means that numpy will call your python function multiple times. In order to get speed, you need use array operations like the one I showed here and you used to get complex numbers.
EDIT
The problem is that the dimension of size 128 was first in newfid, but last in tempfid. This is easily by using .T which takes the transpose.
How about this. Set us your indicies using the vectorized versions of f1,f2,f3 (not necessarily using np.vectorize, but perhaps just writing a function that takes an array and returns an array), then use np.ix_:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html
to get the index arrays. Then reshape tempfid to the same shape as newfid and then use the results of np.ix_ to set the values. For example:
tempfid = np.arange(10)
i = f1(idx) # i = [4,3,2,1,0]
j = f2(idx) # j = [1,0]
ii = np.ix_(i,j)
newfid = tempfid.reshape((5,2))[ii]
This maps the elements of tempfid onto a new shape with a different ordering.

Categories

Resources