Let's say I have a Numpy array called a:
a = np.array([2,3,8,11,30,39,44,49,55,61])
I would like to retrieve multiple intervals based on two other arrays:
l = np.array([2,5,42])
r = np.array([10,40,70])
Doing something equivalent to this:
a[(a > l) & (a < r)]
With this as the desired output:
Out[1]: [[3 8],[ 8 11 30 39],[44 49 55 61]]
Of course I could do a simple for loop iterating over l and r, but the real life dataset is huge, so I would like to prevent looping as much as possible.
You can't avoid looping given the ragged nature of output. But we should try to reduce compute when iterating. So, here's one way to simply slice into the input array while iterating, as we will most of the compute part with getting the start,stop indices per group with searchsorted -
lidx = np.searchsorted(a,l,'right')
ridx = np.searchsorted(a,r,'left')
out = [a[i:j] for (i,j) in zip(lidx,ridx)]
Here's one approach, broadcasting to obtain the indexing arrays, and using np.split to split the array:
# generates a (3,len(a)) where the windows are found in each column
w = (a[:,None] > l) & (a[:,None] < r)
# indices where in the (3,len(a)) array condition is satisfied
ix, _ = np.where(w)
# splits according to the sum along the columns
np.split(a[ix], np.cumsum(w.sum(0)))[:-1]
# [array([3, 8]), array([ 8, 11, 30, 39]), array([44, 49, 55, 61])]
Related
I have an array:
a = [1, 3, 5, 7, 29 ... 5030, 6000]
This array gets created from a previous process, and the length of the array could be different (it is depending on user input).
I also have an array:
b = [3, 15, 67, 78, 138]
(Which could also be completely different)
I want to use the array b to slice the array a into multiple arrays.
More specifically, I want the result arrays to be:
array1 = a[:3]
array2 = a[3:15]
...
arrayn = a[138:]
Where n = len(b).
My first thought was to create a 2D array slices with dimension (len(b), something). However we don't know this something beforehand so I assigned it the value len(a) as that is the maximum amount of numbers that it could contain.
I have this code:
slices = np.zeros((len(b), len(a)))
for i in range(1, len(b)):
slices[i] = a[b[i-1]:b[i]]
But I get this error:
ValueError: could not broadcast input array from shape (518) into shape (2253412)
You can use numpy.split:
np.split(a, b)
Example:
np.split(np.arange(10), [3,5])
# [array([0, 1, 2]), array([3, 4]), array([5, 6, 7, 8, 9])]
b.insert(0,0)
result = []
for i in range(1,len(b)):
sub_list = a[b[i-1]:b[i]]
result.append(sub_list)
result.append(a[b[-1]:])
You are getting the error because you are attempting to create a ragged array. This is not allowed in numpy.
An improvement on #Bohdan's answer:
from itertools import zip_longest
result = [a[start:end] for start, end in zip_longest(np.r_[0, b], b)]
The trick here is that zip_longest makes the final slice go from b[-1] to None, which is equivalent to a[b[-1]:], removing the need for special processing of the last element.
Please do not select this. This is just a thing I added for fun. The "correct" answer is #Psidom's answer.
Consider a matrix M1 giving values for all combinations x,y. Consider a partition f(x)->X and a partition g(y)->Y. Furthermore consider an operation p(A) on a set A of numbers, i.e. max(A) or sum(A).
The mappings f,g can be used to create from M1 a block matrix M2 where all x that are mapped to the same X are adjacent, and the same for all y.
This matrix M2 has a block for each combination of the 'sets' X,Y.
Now I would like to condense this matrix M2 into another matrix M3 by applying p on each block separately. M3 has one value for each combination of X,Y.
Ideally, I would like to skip the transformation of M1 into M2 using f and g on the fly.
What would be the most efficient way to perform such operation and would it be possible to deploy numpy or scipy for it?
Special case: Actually, in my case x and y are identical and there is only one function f applied to both of them. I only care about the part of M2 that is under the diagonal.
The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices X and Y. You can then group by the partition group indices and finally restructure the matrix back into its original form.
For example, if your matrix is
>>> M1 = np.arange(25).reshape((5,5))
>>> M1
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
and your partitions are
>>> def f(x):
... return np.array([1,1,1,2,2])[x]
>>> def g(y):
... return np.array([3,4,4,4,5])[y]
From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a DataFrame and using its stack() method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.
>>> st = pd.DataFrame(M1).stack().to_frame('M1')
>>> st
M1
0 0 0
1 1
2 2
3 3
4 4
1 0 5
...
4 3 23
4 24
(I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:
>>> st['X'] = f(st.index.get_level_values(0))
>>> st['Y'] = g(st.index.get_level_values(1))
Then you can group by those indices and apply your aggregation function of choice.
>>> stp = st.groupby(['X', 'Y']).agg(p)
You will have to define p (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like sum(), you can just use st.groupby(...).sum() because Pandas has built-in support for that and a few other standard functions, but agg is general and works for any reduction function p you can provide.
Finally, the unstack() method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the as_matrix() method to turn it back into a pure Numpy array.
>>> M3 = stp.unstack().as_matrix()
>>> M3
array([[ 15, 63, 27],
[ 35, 117, 43]])
If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.
Let M1 be a numpy n x m array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:
xs = sorted(set(f(i) for i in range(n)))
ys = sorted(set(g(i) for i in range(m)))
To build a block matrix for each X,Y you can use numpy boolean indexing along with the grid-construction helper ix_ to select only rows and columns belonging to X and Y, respectively. Finally, apply p to the selected submatrix:
from numpy import zeros, arange, ix_
ii, jj = arange(n), arange(m)
M3 = zeros((len(xs), len(ys)))
for k, X in enumerate(xs):
for l, Y in enumerate(ys):
M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])
The partitions f and g have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the numpy.vectorize decorator can be used to achieve this.
To give an example:
from __future__ import division
n = m = 5
M1 = np.arange(25).reshape(5,5)
f = lambda x: x // 3 # f(ii) = [0, 0, 0, 1, 1]
g = lambda x: (x+2) // 3 # g(jj) = [0, 1, 1, 1, 2]
p = numpy.sum
M3 = [[ 15., 63., 27.],
[ 35., 117., 43.]]
I've encountered with the same problem some years after and in my opinion, the best solution to do this is as follows:
M2 = np.zeros((n,m))
for i in range(n):
for j in range(m):
M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])
This assumes that f takes values on [0,1,..,n-1] and that g takes values on [0,1,..,m-1]
An example would be
import numpy as np
M1 = np.random.random((4,6))
print(M1)
x = range(4)
y = range(6)
p = np.sum
def f(x):
return np.array([0,0,1,2])[x]
def g(y):
return np.array([0,1,1,0,1,0])[y]
n = 3 # number of elements in partition f
m = 2 # number of elements in partition g
M2 = np.zeros((n,m))
for i in range(n):
for j in range(m):
M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])
print(M2)
To automate n and m you can use len(set(f(x))) and len(set(g(y)))
I'm looking for ways to speed up (or replace) my algorithm for grouping data.
I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)
This sounds kind of awkward, so have an example:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).
The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)
Here is my current algorithm:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.
I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?
Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Runtime test
Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Runtime results on a list with large arrays -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:
import numpy as np
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]
# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)
# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False
# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)
# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
if np.isnan(labelled[k]): # this element has not been labelled yet
labelled[k] = val # so label it
labelled[row] = val # and label every element satisfying the test
val += 1
print(labelled)
# outputs [ 0. 1. 2. 3. 0. 3. 4.]
It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.
The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:
import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)
The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:
unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)
If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.
So you actually want to hash on list(np.array(values).T).
This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!
>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))
I have converted your list of arrays into an array, and then into a string ...
I have a big 1D array of data. I have a starts array of indexes into that data where important things happened. I want to get an array of ranges so that I get windows of length L, one for each starting point in starts. Bogus sample data:
data = np.linspace(0,10,50)
starts = np.array([0,10,21])
length = 5
I want to instinctively do something like
data[starts:starts+length]
But really, I need to turn starts into 2D array of range "windows." Coming from functional languages, I would think of it as a map from a list to a list of lists, like:
np.apply_along_axis(lambda i: np.arange(i,i+length), 0, starts)
But that won't work because apply_along_axis only allows scalar return values.
You can do this:
pairs = np.vstack([starts, starts + length]).T
ranges = np.apply_along_axis(lambda p: np.arange(*p), 1, pairs)
data[ranges]
Or you can do it with a list comprehension:
data[np.array([np.arange(i,i+length) for i in starts])]
Or you can do it iteratively. (Bleh.)
Is there a concise, idiomatic way to slice into an array at certain start points like this? (Pardon the numpy newbie-ness.)
data = np.linspace(0,10,50)
starts = np.array([0,10,21])
length = 5
For a NumPy only way of doing this, you can use numpy.meshgrid() as described here
http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html
As hpaulj pointed out in the comments, meshgrid actually isn't needed for this problem as you can use array broadcasting.
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
# indices = sum(np.meshgrid(np.arange(length), starts))
indices = np.arange(length) + starts[:, np.newaxis]
# array([[ 0, 1, 2, 3, 4],
# [10, 11, 12, 13, 14],
# [21, 22, 23, 24, 25]])
data[indices]
returns
array([[ 0. , 0.20408163, 0.40816327, 0.6122449 , 0.81632653],
[ 2.04081633, 2.24489796, 2.44897959, 2.65306122, 2.85714286],
[ 4.28571429, 4.48979592, 4.69387755, 4.89795918, 5.10204082]])
If you need to do this a lot of time, you can use as_strided() to create a sliding windows array of data
data = np.linspace(0,10,50000)
length = 5
starts = np.random.randint(0, len(data)-length, 10000)
from numpy.lib.stride_tricks import as_strided
sliding_window = as_strided(data, (len(data) - length + 1, length),
(data.itemsize, data.itemsize))
Then you can use:
sliding_window[starts]
to get what you want.
It's also faster than creating the index array.
I have a 2D numpy array that I need to take the max of along a specific axis. I then need to later know which indexes were selected for this operation as a mask for another operation which is only done on those same indexes but on another array of the same shape.
Right how I'm doing it by using 2d array indexing, but it's slow and kind of convoluted, particularly the mgrid hack to generate the row indexes. It's just [0,1] for this example but I need the robustness to work with arbitrary shapes.
a = np.array([[0,0,5],[0,0,5]])
b = np.array([[1,1,1],[1,1,1]])
columnIndexes = np.argmax(a,axis=1)
rowIndexes = np.mgrid[0:a.shape[0],0:columnIdx.size-1][0].flatten()
b[rowIndexes,columnIndexes] = b[rowIndexes,columnIndexes]+1
B should now be array([[1,1,2],[1,1,2]]) since it preformed the operation on b for only the indexes of the max along the columns of a.
Anyone know a better way? Preferably using just boolean masking arrays so that I can port this code to run on a GPU without too much hassle. Thanks!
I will suggest an answer but with slightly different data.
c = np.array([[0,1,1],[2,1,0]]) # note that this data has dupes for max in row 1
d = np.array([[0,10,10],[20,10,0]]) # data to be chaged
c_argmax = np.argmax(c,axis=1)[:,np.newaxis]
b_map1 = c_argmax == np.arange(c.shape[1])
# now use the bool map as you described
d[b_map1] += 1
d
[out]
array([[ 0, 11, 10],
[21, 10, 0]])
Note that I created an original with a duplicate of the largest number. The above works with argmax as you requested but you might have wanted to increment all max values. as in:
c_max = np.max(c,axis=1)[:,np.newaxis]
b_map2 = c_max == c
d[b_map2] += 1
d
[out]
array([[ 0, 12, 11],
[22, 10, 0]])