Preallocating ndarrays - python

How can I preallocate arrays of arrays so I can do appending a bit more efficiently. In Matlab there is a function called cell(required_length) which preallocates 'cells' which can store arrays.
I have an array that currently looks like:
a=np.array([[1,2],[1],[2,3,4]])
b=np.array([[20,2]])
However I wish to append 1000s of more array which are like the 'b' shown but varying in size.

This isn't just a quesiton of preallocating an array, such as np.empty((100,), dtype=int). It's as much a question about how to collect a large number of lists into one structure, whether it be a list or numpy array. The comparison with MATLAB cells is enough, in my opinion, to warrant further discussion.
I think you should be using Python lists. They can contain lists or other objects (incuding arrays) of varying sizes. You can easily append more items (or use extend to add multiple objects). Python has had them for ever; MATLAB added cells to approximate that flexibility.
np.arrays with dtype=object are similar - arrays of pointers to objects such as lists. For the most part they are just lists with an array wrapper. You can initial an array to some large size, and insert/set items.
A = np.empty((10,),dtype=object)
produces an array with 10 elements, each None.
A[0] = [1,2,3]
A[1] = [2,3]
...
You can also concatenate elements to an existing array, but the result is new one. There is a np.append function, but it is just a cover for concatenate; it should not be confused with the list append.
If it must be an array, you can easily construct it from the list at the end. That's what your np.array([[1,2],[1],[2,3,4]]) does.
How to add to numpy array entries of different size in a for loop (similar to Matlab's cell arrays)?
On the issue of speed, let's try simple time tests
def witharray(n):
result=np.empty((n,),dtype=object)
for i in range(n):
result[i]=list(range(i))
return result
def withlist(n):
result=[]
for i in range(n):
result.append(list(range(i)))
return result
which produce
In [111]: withlist(4)
Out[111]: [[], [0], [0, 1], [0, 1, 2]]
In [112]: witharray(4)
Out[112]: array([[], [0], [0, 1], [0, 1, 2]], dtype=object)
In [113]: np.array(withlist(4))
Out[113]: array([[], [0], [0, 1], [0, 1, 2]], dtype=object)
timetests
In [108]: timeit withlist(400)
1000 loops, best of 3: 1.87 ms per loop
In [109]: timeit witharray(400)
100 loops, best of 3: 2.13 ms per loop
In [110]: timeit np.array(withlist(400))
100 loops, best of 3: 8.95 ms per loop
Simply constructing a list of lists is fastest. But if the result must be an object type array, then assigning values to an empty array is faster.

Related

Iterate over numpy with index (numpy equivalent of python enumerate)

I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.

Multi-dimensional slicing a list of strings with numpy

Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?
No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')
Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view approach is much faster.
See my edit history for my earlier notes on np.char.
As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy just adds overhead.
Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as #hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])
Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])

How to extract those rows in a 2d array whose first element is in a different list? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Suppose I have this array
array([[100, 1],
[200, 2],
[300, 3],
[400, 4],
[440, 3]])
And I have this list or a 1d array [100,300].
I want my operation to output [1,3].
How can I do this in numpy.
I am actually using these numpy arrays within Theano (a machine learning library which speeds up computation using gpu). I will have lots of rows. Numpy arrays allow me to seamlessly use them as Tensor objects in Theano. But if I had to use a dictionary I would have to do that in plain Python, and I am not sure if that will hold up well, once I move on to large data. So I am actually looking for a numpy operation, some trick in indexing or something like that.
You could use np.in1d:
In [12]: arr
Out[12]:
array([[100, 1],
[200, 2],
[300, 3],
[400, 4],
[440, 3]])
In [14]: vals = [100, 300]
In [23]: np.in1d(arr[:,0], vals)
Out[23]: array([ True, False, True, False, False], dtype=bool)
In [24]: arr[np.in1d(arr[:,0], vals), 1]
Out[24]: array([1, 3])
If you need to call np.in1d for many different values of vals, then it may pay to prepare a dict as arshajii suggests, since after preparing the dict (a O(n) operation, where n = len(arr)), looking up the values would be a O(m) operation, where m = len(vals).
If n gets very large however, a dict may require too much memory. In that case you may need to use np.in1d.
If the index (key) values are all ints and of small magnitude, there is a NumPy indexing trick you could use to get O(m) performance without using a dict:
In [30]: big = np.full(arr[:,0].max()+1, np.nan)
In [31]: big[arr[:,0]] = arr[:,1]
In [32]: big[vals]
Out[32]: array([ 1., 3.])
Preparing big is an O(n) operation, but indexing big[vals] is O(m). If arr[:,0].max() is small and the key values are ints, the advantage of using big is that it requires less memory than using a dict.
In [33]: %timeit arr[np.in1d(arr[:,0], vals), 1]
10000 loops, best of 3: 21.5 µs per loop
In [34]: %timeit big[vals]
1000000 loops, best of 3: 1.23 µs per loop
Compare with arshajii's solution:
In [38]: d = dict(arr)
In [40]: %timeit [d[k] for k in vals]
1000000 loops, best of 3: 447 ns per loop
So the best method to use depends on the size of arr and vals, how many times you will be performing this operation, how much memory you have, and if the keys are small ints. You'll need to benchmark on data relevant to your use case to make a good decision.
I would simply convert your array to a dictionary:
>>> a = array([[100, 1],
... [200, 2],
... [300, 3],
... [400, 4],
... [440, 3]])
>>>
>>> keys = [100, 300]
>>>
>>> d = dict(a)
>>>
>>> [d[k] for k in keys]
[1, 3]
If you're sure that all values to search for are actually present in the search array, you could also use np.searchsorted. Seems faster compared to the other suggestions, for large arrays.
s = np.sort(A[:,0])
A[np.searchsorted(s, values), 1]
If the array to search in is already sorted, you can omit the sort off course and the operation will be even quicker.

numpy array partial sums with weights

I have a numpy array, say, [a,b,c,d,e,...], and would like to compute an array that would look like [x*a+y*b, x*b+y*c, x*c+y*d,...]. The idea that I have is to first split the original array into something like [[a,b],[b,c],[c,d],[d,e],...] and then attack this creature with np.average specifying the axis and weights (x+y=1 in my case), or even use np.dot. Unfortunately, I don't know how to create such array of [a,b],[b,c],... pairs. Any help, or completely different idea even to accomplish the major task, are much appreciated :-)
The quickest, simplest would be to manually extract two slices of your array and add them together:
>>> arr = np.arange(5)
>>> x, y = 10, 1
>>> x*arr[:-1] + y*arr[1:]
array([ 1, 12, 23, 34])
This will turn into a pain if you want to generalize it to triples, quadruples... But you can create your array of pairs from the original array with as_strided in a much more general form:
>>> from numpy.lib.stride_tricks import as_strided
>>> arr_pairs = as_strided(arr, shape=(len(arr)-2+1,2), strides=arr.strides*2)
>>> arr_pairs
array([[0, 1],
[1, 2],
[2, 3],
[3, 4]])
Of course the nice thing about using as_strided is that, just like with the array slices, there is no data copying involved, just messing with the way memory is viewed, so creating this array is virtually costless.
And now probably the fastest is to use np.dot:
>>> xy = [x, y]
>>> np.dot(arr_pairs, xy)
array([ 1, 12, 23, 34])
This looks like a correlate problem.
a
Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7])
b
Out[62]: array([1, 2])
np.correlate(a,b,mode='valid')
Out[63]: array([ 2, 5, 8, 11, 14, 17, 20])
Depending on array size and BLAS dot can be faster, your milage will vary greatly:
arr = np.random.rand(1E6)
b = np.random.rand(2)
np.allclose(jamie_dot(arr,b),np.convolve(arr,b[::-1],mode='valid'))
True
%timeit jamie_dot(arr,b)
100 loops, best of 3: 16.1 ms per loop
%timeit np.correlate(arr,b,mode='valid')
10 loops, best of 3: 28.8 ms per loop
This is with an intel mkl BLAS and 8 cores, np.correlate will likely be faster for most implementations.
Also an interesting observation from #Jamie's post:
%timeit b[0]*arr[:-1] + b[1]*arr[1:]
100 loops, best of 3: 8.43 ms per loop
His comment also eliminated the use of np.convolve(a,b[::-1],mode=valid) to the simpler correlate syntax.
If you have a small array, I would create a shifted copy:
shifted_array=numpy.append(original_array[1:],0)
result_array=x*original_array+y*shifted_array
Here you have to store your array twice in memory, so this solution is very memory inefficient, but you can get rid of the for loops.
If you have large arrays, you really need a loop (but much rather a list comprehension):
result_array=[x*original_array[i]+y*original_array[i+1] for i in xrange(len(original_array)-1)]
It gives you the same result as a python list, except for the last item, which should be treated differently anyway.
Based on some random trials, for arrays smaller than 2000 items. the first solution seems to be faster than the second one, but runs into MemoryError even for relatively small arrays (a few 10s of thousands on my PC).
So generally, use a list comprehension, but if you surely know that you will run this only on small (max. 1-2 thousand) arrays, you have a better shot.
Creating a new list like [[a,b],[b,c],[c,d],[d,e],...] would be both memory and time inefficient, as you also need a for loop (or similar) to create it, and you have to store every old value in a new array twice, so you would end up with storing your original array three times.
Another way is to create the right pairs in the array a = np.array([a,b,c,d,e,...]), reshape according to the size of array b = np.array([x, y, ...]) and then take advantage of numpy broadcasting rules:
a = np.arange(8)
b = np.array([1, 2])
a = a.repeat(2)[1:-1]
ans = a.reshape(-1, b.shape[0]).dot(b)
Timings (on my computer):
#Ophion's solution:
# 100000 loops, best of 3: 4.67 µs per loop
This solution:
# 100000 loops, best of 3: 9.78 µs per loop
So, it is slower. #Jaime's solution is better since it does not copy the data like this one.

How to find which numpy array contains the maximum value on an element by element basis?

Given a list of numpy arrays, each with the same dimensions, how can I find which array contains the maximum value on an element-by-element basis?
e.g.
import numpy as np
def find_index_where_max_occurs(my_list):
# d = ... something goes here ...
return d
a=np.array([1,1,3,1])
b=np.array([3,1,1,1])
c=np.array([1,3,1,1])
my_list=[a,b,c]
array_of_indices_where_max_occurs = find_index_where_max_occurs(my_list)
# This is what I want:
# >>> print array_of_indices_where_max_occurs
# array([1,2,0,0])
# i.e. for the first element, the maximum value occurs in array b which is at index 1 in my_list.
Any help would be much appreciated... thanks!
Another option if you want an array:
>>> np.array((a, b, c)).argmax(axis=0)
array([1, 2, 0, 0])
So:
def f(my_list):
return np.array(my_list).argmax(axis=0)
This works with multidimensional arrays, too.
For the fun of it, I realised that #Lev's original answer was faster than his generalized edit, so this is the generalized stacking version which is much faster than the np.asarray version, but it is not very elegant.
np.concatenate((a[None,...], b[None,...], c[None,...]), axis=0).argmax(0)
That is:
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
Some explanation:
I've added a new axis to each array: arr[None,...] is equivalent to arr[np.newaxis,...] which is the same as arr[np.newaxis,:,:,:] where the ... expands to be the appropriate number dimensions. The reason for this is because np.concatenate will then stack along the new dimension, which is 0 since the None is at the front.
So, for example:
In [286]: a
Out[286]:
array([[0, 1],
[2, 3]])
In [287]: b
Out[287]:
array([[10, 11],
[12, 13]])
In [288]: np.concatenate((a[None,...],b[None,...]),axis=0)
Out[288]:
array([[[ 0, 1],
[ 2, 3]],
[[10, 11],
[12, 13]]])
In case it helps to understand, this would work too:
np.concatenate((a[...,None], b[...,None], c[...,None]), axis=a.ndim).argmax(a.ndim)
where the new axis is now added at the end, so we must stack and maximize along that last axis, which will be a.ndim. For a, b, and c being 2d, we could do this:
np.concatenate((a[:,:,None], b[:,:,None], c[:,:,None]), axis=2).argmax(2)
Which is equivalent to the dstack I mentioned in my comment above (dstack adds a third axis to stack along if it doesn't exist in the arrays).
To test:
N = 10
M = 2
a = np.random.random((N,)*M)
b = np.random.random((N,)*M)
c = np.random.random((N,)*M)
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
def byarray(arrs):
return np.array(arrs).argmax(axis=0)
def byasarray(arrs):
return np.asarray(arrs).argmax(axis=0)
def bylist(arrs):
assert arrs[0].ndim == 1, "ndim must be 1"
return [np.argmax(x) for x in zip(*arrs)]
In [240]: timeit bystack((a,b,c))
100000 loops, best of 3: 18.3 us per loop
In [241]: timeit byarray((a,b,c))
10000 loops, best of 3: 89.7 us per loop
In [242]: timeit byasarray((a,b,c))
10000 loops, best of 3: 90.0 us per loop
In [259]: timeit bylist((a,b,c))
1000 loops, best of 3: 267 us per loop
[np.argmax(x) for x in zip(*my_list)]
Well, this is just a list, but you know how to make it an array if you want. :)
To explain what this does: zip(*my_list) is equivalent to zip(a,b,c), which gives you a generator to loop over. Each step in the loop gives you a tuple like (a[i], b[i], c[i]), where i is the step in the loop. Then, np.argmax gives you the index of that tuple for the element with the largest value.

Categories

Resources