How to pythonically get the max of a numpy argwhere function - python

I want to use numpy argwhere to find where a maximum in my data is. Below is a sample set that describes what I am doing:
bins = np.arange(10)
data = np.array([[6],[4],[8],[5]])
np.argwhere(bins<data)
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[2, 4],
[2, 5],
[2, 6],
[2, 7],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4]])
What I want from this data is
array([[0,5],
[1,3],
[2,7],
[3,4]])
This could be done with a for loop, but I was wondering if there was a more pythonic way to do this.
EDIT:
What I have now done was use Pandas and groupby. I am still wondering if this is the best method.
t = pd.DataFrame(np.argwhere(bins<data))
time = t.groupby(0)
time.max()
1
0
0 5
1 3
2 7
3 4
Now that I have this, I have a new problem. Lets say I have another set of data:
BigData = np.array([[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9]])
How can I use the array I achieved
array([[0,5],
[1,3],
[2,7],
[3,4]])
To be put in this new data to get BigData average up to the index in the second column. I.E
(0+1+2+3+4) / 5
(0+1+2) / 3
(0+1+2+3+4+5+6) / 7
(0+1+2+3) / 4
would be the return of BigData, assuming that we got the index value of where this happens in column two.

Here's a fairly short Numpy solution that's also pretty fast:
A = np.argwhere(bins<data)
print A[np.r_[A[1:,0] != A[:-1,0], True]]

Here's a NumPy solution. It is not as readable as the Pandas version, but timing suggests it is much faster:
>>> arr = np.argwhere(bins<data)
>>> arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])
>>> %timeit arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
10000 loops, best of 3: 32.7 µs per loop
>>> %%timeit
... t = pd.DataFrame(arr)
... time = t.groupby(0)
... time.max()
...
1000 loops, best of 3: 1 ms per loop

The following seems to be pretty fast for me, taking advantage of argmax working left -> right:
>>> bins[::-1][(bins[::-1] < data).argmax(axis=1)]
array([5, 3, 7, 4])
For me %timeit shows that this takes around 11µs.
However, manipulating the array to have the index as the first column (as follows) increases time to around 25µs:
>>> np.column_stack(
... [np.arange(data.shape[0]), bins[::-1][(bins[::-1] < data).argmax(axis=1)]])
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])

Related

Intersection of 2-d numpy arrays

I am looking for a way to get the intersection between two 2-dimensional numpy.array of shape (n_1, m) and (n_2, m). Note that n_1 and n_2 can differ but m is the same for both arrays. Here are two minimal examples with the expected results:
import numpy as np
array1a = np.array([[2], [2], [5], [1]])
array1b = np.array([[5], [2]])
array_intersect(array1a, array1b)
## array([[2],
## [5]])
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
array_intersect(array2a, array2b)
## array([[2, 1],
## [3, 3]])
If someone have a clue on how I should implement the array_intersect function, I would be very grateful!
How about using sets?
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
a = set((tuple(i) for i in array2a))
b = set((tuple(i) for i in array2b))
a.intersection(b) # {(2, 1), (3, 3)}
Another approach would be to harness the broadcasting feature
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
test = array2a[:, None] == array2b
print(array2b[np.all(test.mean(0) > 0, axis = 1)]) # [[2 1]
# [3 3]]
but this is less readable imo. [edit]: or use the unique and set combination. In short, there are many options!
Here's a way to do without any loops or list comprehensions, assuming you have scipy installed (I haven't tested for speed):
In [31]: from scipy.spatial.distance import cdist
In [32]: np.unique(array1a[np.where(cdist(array1a, array1b) == 0)[0]], axis=0)
Out[32]:
array([[2],
[5]])
In [33]: np.unique(array2a[np.where(cdist(array2a, array2b) == 0)[0]], axis=0)
Out[33]:
array([[2, 1],
[3, 3]])
Construct a set of tuples from the first array and test each line of the second array. Or vice versa.
def array_intersect(a, b):
s = {tuple(x) for x in a}
return np.unique([x for x in b if tuple(x) in s], axis=0)
The numpy-indexed package (disclaimer: I am its author) was created with the exact purpose of providing such functionality in an expressive and efficient manner:
import numpy_indexed as npi
npi.intersect(a, b)
Note that the implementation is fully vectorized; that is no loops over the arrays in python.
arr1 = np.arange(20000).reshape(-1,2)
arr2 = arr1.copy()
np.random.shuffle(arr2)
print(len(arr1)) #10000
%%timeit
res= np.array([x
for x in set(tuple(x) for x in arr1) & set(tuple(x) for x in arr2)
])
83.7 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How do I sorting a 2D numpy array? [duplicate]

How do I sort a NumPy array by its nth column?
For example, given:
a = array([[9, 2, 3],
[4, 5, 6],
[7, 0, 5]])
I want to sort the rows of a by the second column to obtain:
array([[7, 0, 5],
[9, 2, 3],
[4, 5, 6]])
To sort by the second column of a:
a[a[:, 1].argsort()]
#steve's answer is actually the most elegant way of doing it.
For the "correct" way see the order keyword argument of numpy.ndarray.sort
However, you'll need to view your array as an array with fields (a structured array).
The "correct" way is quite ugly if you didn't initially define your array with fields...
As a quick example, to sort it and return a copy:
In [1]: import numpy as np
In [2]: a = np.array([[1,2,3],[4,5,6],[0,0,1]])
In [3]: np.sort(a.view('i8,i8,i8'), order=['f1'], axis=0).view(np.int)
Out[3]:
array([[0, 0, 1],
[1, 2, 3],
[4, 5, 6]])
To sort it in-place:
In [6]: a.view('i8,i8,i8').sort(order=['f1'], axis=0) #<-- returns None
In [7]: a
Out[7]:
array([[0, 0, 1],
[1, 2, 3],
[4, 5, 6]])
#Steve's really is the most elegant way to do it, as far as I know...
The only advantage to this method is that the "order" argument is a list of the fields to order the search by. For example, you can sort by the second column, then the third column, then the first column by supplying order=['f1','f2','f0'].
You can sort on multiple columns as per Steve Tjoa's method by using a stable sort like mergesort and sorting the indices from the least significant to the most significant columns:
a = a[a[:,2].argsort()] # First sort doesn't need to be stable.
a = a[a[:,1].argsort(kind='mergesort')]
a = a[a[:,0].argsort(kind='mergesort')]
This sorts by column 0, then 1, then 2.
In case someone wants to make use of sorting at a critical part of their programs here's a performance comparison for the different proposals:
import numpy as np
table = np.random.rand(5000, 10)
%timeit table.view('f8,f8,f8,f8,f8,f8,f8,f8,f8,f8').sort(order=['f9'], axis=0)
1000 loops, best of 3: 1.88 ms per loop
%timeit table[table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop
import pandas as pd
df = pd.DataFrame(table)
%timeit df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop
So, it looks like indexing with argsort is the quickest method so far...
From the NumPy mailing list, here's another solution:
>>> a
array([[1, 2],
[0, 0],
[1, 0],
[0, 2],
[2, 1],
[1, 0],
[1, 0],
[0, 0],
[1, 0],
[2, 2]])
>>> a[np.lexsort(np.fliplr(a).T)]
array([[0, 0],
[0, 0],
[0, 2],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 2],
[2, 1],
[2, 2]])
As the Python documentation wiki suggests:
a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]);
a = sorted(a, key=lambda a_entry: a_entry[1])
print a
Output:
[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]
I had a similar problem.
My Problem:
I want to calculate an SVD and need to sort my eigenvalues in descending order. But I want to keep the mapping between eigenvalues and eigenvectors.
My eigenvalues were in the first row and the corresponding eigenvector below it in the same column.
So I want to sort a two-dimensional array column-wise by the first row in descending order.
My Solution
a = a[::, a[0,].argsort()[::-1]]
So how does this work?
a[0,] is just the first row I want to sort by.
Now I use argsort to get the order of indices.
I use [::-1] because I need descending order.
Lastly I use a[::, ...] to get a view with the columns in the right order.
import numpy as np
a=np.array([[21,20,19,18,17],[16,15,14,13,12],[11,10,9,8,7],[6,5,4,3,2]])
y=np.argsort(a[:,2],kind='mergesort')# a[:,2]=[19,14,9,4]
a=a[y]
print(a)
Desired output is [[6,5,4,3,2],[11,10,9,8,7],[16,15,14,13,12],[21,20,19,18,17]]
note that argsort(numArray) returns the indices of an numArray as it was supposed to be arranged in a sorted manner.
example
x=np.array([8,1,5])
z=np.argsort(x) #[1,3,0] are the **indices of the predicted sorted array**
print(x[z]) #boolean indexing which sorts the array on basis of indices saved in z
answer would be [1,5,8]
A little more complicated lexsort example - descending on the 1st column, secondarily ascending on the 2nd. The tricks with lexsort are that it sorts on rows (hence the .T), and gives priority to the last.
In [120]: b=np.array([[1,2,1],[3,1,2],[1,1,3],[2,3,4],[3,2,5],[2,1,6]])
In [121]: b
Out[121]:
array([[1, 2, 1],
[3, 1, 2],
[1, 1, 3],
[2, 3, 4],
[3, 2, 5],
[2, 1, 6]])
In [122]: b[np.lexsort(([1,-1]*b[:,[1,0]]).T)]
Out[122]:
array([[3, 1, 2],
[3, 2, 5],
[2, 1, 6],
[2, 3, 4],
[1, 1, 3],
[1, 2, 1]])
Here is another solution considering all columns (more compact way of J.J's answer);
ar=np.array([[0, 0, 0, 1],
[1, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 1],
[0, 0, 1, 0],
[1, 1, 0, 0]])
Sort with lexsort,
ar[np.lexsort(([ar[:, i] for i in range(ar.shape[1]-1, -1, -1)]))]
Output:
array([[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 1],
[1, 0, 1, 0],
[1, 1, 0, 0]])
Pandas Approach Just For Completeness:
a = np.array([[9, 2, 3],
[4, 5, 6],
[7, 0, 5]])
a = pd.DataFrame(a)
a.sort_values(1, ascending=True).to_numpy()
array([[7, 0, 5], # '1' means sort by second column
[9, 2, 3],
[4, 5, 6]])
prl900
Did the Benchmark, comparing with the accepted answer:
%timeit pandas_df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop
%timeit numpy_table[numpy_table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop
It is an old question but if you need to generalize this to a higher than 2 dimension arrays, here is the solution than can be easily generalized:
np.einsum('ij->ij', a[a[:,1].argsort(),:])
This is an overkill for two dimensions and a[a[:,1].argsort()] would be enough per #steve's answer, however that answer cannot be generalized to higher dimensions. You can find an example of 3D array in this question.
Output:
[[7 0 5]
[9 2 3]
[4 5 6]]
#for sorting along column 1
indexofsort=np.argsort(dataset[:,0],axis=-1,kind='stable')
dataset = dataset[indexofsort,:]
def sort_np_array(x, column=None, flip=False):
x = x[np.argsort(x[:, column])]
if flip:
x = np.flip(x, axis=0)
return x
Array in the original question:
a = np.array([[9, 2, 3],
[4, 5, 6],
[7, 0, 5]])
The result of the sort_np_array function as expected by the author of the question:
sort_np_array(a, column=1, flip=False)
[2]: array([[7, 0, 5],
[9, 2, 3],
[4, 5, 6]])
Thanks to this post: https://stackoverflow.com/a/5204280/13890678
I found a more "generic" answer using structured array.
I think one advantage of this method is that the code is easier to read.
import numpy as np
a = np.array([[9, 2, 3],
[4, 5, 6],
[7, 0, 5]])
struct_a = np.core.records.fromarrays(
a.transpose(), names="col1, col2, col3", formats="i8, i8, i8"
)
struct_a.sort(order="col2")
print(struct_a)
[(7, 0, 5) (9, 2, 3) (4, 5, 6)]
Simply using sort, use column number based on which you want to sort.
a = np.array([1,1], [1,-1], [-1,1], [-1,-1]])
print (a)
a = a.tolist()
a = np.array(sorted(a, key=lambda a_entry: a_entry[0]))
print (a)

Fill 2-D numpy array with index location

I've been trying to figure out a clean, pythonic way to fill each element of an empty numpy array with the index value(s) of that element, without using for loops. For 1-D, it's easy, you can just use something like np.arange or just a basic range. But at 2-D and higher dimensions, I'm stumped on how to easily do this.
(Edit: Or just build a regular list like this, then np.array(lst) it. I think I just answered my question - use a list comprehension?)
Example:
rows = 4
cols = 4
arr = np.empty((rows, cols, 2)) # 4x4 matrix with [x,y] location
for y in range(rows):
for x in range(cols):
arr[y, x] = [y, x]
'''
Expected output:
[[[0,0], [0,1], [0,2], [0,3]],
[[1,0], [1,1], [1,2], [1,3]],
[[2,0], [2,1], [2,2], [2,3]],
[[3,0], [3,1], [3,2], [3,3]]]
'''
What you are showing is a meshgrid of a 4X4 matrix; You can either use np.mgrid, then transpose the result:
np.moveaxis(np.mgrid[:rows,:cols], 0, -1)
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
Or use np.meshgrid with matrix indexing ij:
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
another way using np.indices and concatenate
np.concatenate([x.reshape(4,4,1) for x in np.indices((4,4))],2)
or with np.dstack
np.dstack(np.indices((4,4)))
Some bench marking since you have a ton of possibilities
def Psidom_mrgid(rows,cols):
np.mgrid[:rows, :cols].transpose((1, 2, 0))
def Psidom_mesh(rows,cols):
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
def Mad_tile(rows,cols):
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
def bora_comp(rows,cols):
x = [[[i, j] for j in range(rows)] for i in range(cols)]
def djk_ind(rows,cols):
np.concatenate([x.reshape(rows, cols, 1) for x in np.indices((rows, cols))], 2)
def devdev_mgrid(rows,cols):
index_tuple = np.mgrid[0:rows, 0:cols]
np.dstack(index_tuple).reshape((rows, cols, 2)
In[8]: %timeit Psidom_mrgid(1000,1000)
100 loops, best of 3: 15 ms per loop
In[9]: %timeit Psidom_mesh(1000,1000)
100 loops, best of 3: 9.98 ms per loop
In[10]: %timeit Mad_tile(1000,1000)
100 loops, best of 3: 15.3 ms per loop
In[11]: %timeit bora_comp(1000,1000)
1 loop, best of 3: 221 ms per loop
In[12]: %timeit djk_ind(1000,1000)
100 loops, best of 3: 9.72 ms per loop
In[13]: %timeit devdev_mgrid(1000,1000)
10 loops, best of 3: 20.6 ms per loop
I guess that's pretty pythonic:
[[[i,j] for j in range(5)] for i in range(5)]
Output:
[[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4]],
[[1, 0], [1, 1], [1, 2], [1, 3], [1, 4]],
[[2, 0], [2, 1], [2, 2], [2, 3], [2, 4]],
[[3, 0], [3, 1], [3, 2], [3, 3], [3, 4]],
[[4, 0], [4, 1], [4, 2], [4, 3], [4, 4]]]
Check out numpy.mgrid, which will return two arrays with the i and j indices. To combine them you can stack the arrays and reshape them. Something like this:
import numpy as np
def index_pair_array(rows, cols):
index_tuple = np.mgrid[0:rows, 0:cols]
return np.dstack(index_tuple).reshape((rows, cols, 2))
There are a few ways of doing this numpythonically.
One way is using np.tile and np.stack:
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
A better way of getting the coordinates might be np.meshgrid:
rc = np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij')
result = np.stack(rc, axis=-1)

Can numpy strides stride only within subarrays?

I have a really big numpy array(145000 rows * 550 cols). And I wanted to create rolling slices within subarrays. I tried to implement it with a function. The function lagged_vals behaves as expected but np.lib.stride_tricks does not behave the way I want it to -
def lagged_vals(series,l):
# Garbage implementation but still right
return np.concatenate([[x[i:i+l] for i in range(x.shape[0]) if i+l <= x.shape[0]] for x in series]
,axis = 0)
# Sample 2D numpy array
something = np.array([[1,2,2,3],[2,2,3,3]])
lagged_vals(something,2) # Works as expected
# array([[1, 2],
# [2, 2],
# [2, 3],
# [2, 2],
# [2, 3],
# [3, 3]])
np.lib.stride_tricks.as_strided(something,
(something.shape[0]*something.shape[1],2),
(8,8))
# array([[1, 2],
# [2, 2],
# [2, 3],
# [3, 2], <--- across subarray stride, which I do not want
# [2, 2],
# [2, 3],
# [3, 3])
How do I remove that particular row in the np.lib.stride_tricks implementation? And how can I scale this cross array stride removal for a big numpy array ?
Sure, that's possible with np.lib.stride_tricks.as_strided. Here's one way -
from numpy.lib.stride_tricks import as_strided
L = 2 # window length
shp = a.shape
strd = a.strides
out_shp = shp[0],shp[1]-L+1,L
out_strd = strd + (strd[1],)
out = as_strided(a, out_shp, out_strd).reshape(-1,L)
Sample input, output -
In [177]: a
Out[177]:
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
In [178]: out
Out[178]:
array([[0, 1],
[1, 2],
[2, 3],
[4, 5],
[5, 6],
[6, 7]])
Note that the last step of reshaping forces it to make a copy there. But that's can't be avoided if we need the final output to be a 2D. If we are okay with a 3D output, skip that reshape and thus achieve a view, as shown with the sample case -
In [181]: np.shares_memory(a, out)
Out[181]: False
In [182]: as_strided(a, out_shp, out_strd)
Out[182]:
array([[[0, 1],
[1, 2],
[2, 3]],
[[4, 5],
[5, 6],
[6, 7]]])
In [183]: np.shares_memory(a, as_strided(a, out_shp, out_strd) )
Out[183]: True

Search Numpy array with multiple values

I have numpy 2d array having duplicate values.
I am searching the array like this.
In [104]: import numpy as np
In [105]: array = np.array
In [106]: a = array([[1, 2, 3],
...: [1, 2, 3],
...: [2, 5, 6],
...: [3, 8, 9],
...: [4, 8, 9],
...: [4, 2, 3],
...: [5, 2, 3])
In [107]: num_list = [1, 4, 5]
In [108]: for i in num_list :
...: print(a[np.where(a[:,0] == num_list)])
...:
[[1 2 3]
[1 2 3]]
[[4 8 9]
[4 2 3]]
[[5 2 3]]
The input is list having number similar to column 0 values.
The end result I want is the resulting rows in any format like array, list or tuple for example
array([[1, 2, 3],
[1, 2, 3],
[4, 8, 9],
[4, 2, 3],
[5, 2, 3]])
My code works fine but doesn't seem pythonic. Is there any better searching strategy with multiple values?
like a[np.where(a[:,0] == l)] where only one time lookup is done to get all the values.
my real array is large
Approach #1 : Using np.in1d -
a[np.in1d(a[:,0], num_list)]
Approach #2 : Using np.searchsorted -
num_arr = np.sort(num_list) # Sort num_list and get as array
# Get indices of occurrences of first column in num_list
idx = np.searchsorted(num_arr, a[:,0])
# Take care of out of bounds cases
idx[idx==len(num_arr)] = 0
out = a[a[:,0] == num_arr[idx]]
You can do
a[numpy.in1d(a[:, 0], num_list), :]

Categories

Resources