I've been trying to figure out a clean, pythonic way to fill each element of an empty numpy array with the index value(s) of that element, without using for loops. For 1-D, it's easy, you can just use something like np.arange or just a basic range. But at 2-D and higher dimensions, I'm stumped on how to easily do this.
(Edit: Or just build a regular list like this, then np.array(lst) it. I think I just answered my question - use a list comprehension?)
Example:
rows = 4
cols = 4
arr = np.empty((rows, cols, 2)) # 4x4 matrix with [x,y] location
for y in range(rows):
for x in range(cols):
arr[y, x] = [y, x]
'''
Expected output:
[[[0,0], [0,1], [0,2], [0,3]],
[[1,0], [1,1], [1,2], [1,3]],
[[2,0], [2,1], [2,2], [2,3]],
[[3,0], [3,1], [3,2], [3,3]]]
'''
What you are showing is a meshgrid of a 4X4 matrix; You can either use np.mgrid, then transpose the result:
np.moveaxis(np.mgrid[:rows,:cols], 0, -1)
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
Or use np.meshgrid with matrix indexing ij:
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
#array([[[0, 0],
# [0, 1],
# [0, 2],
# [0, 3]],
# [[1, 0],
# [1, 1],
# [1, 2],
# [1, 3]],
# [[2, 0],
# [2, 1],
# [2, 2],
# [2, 3]],
# [[3, 0],
# [3, 1],
# [3, 2],
# [3, 3]]])
another way using np.indices and concatenate
np.concatenate([x.reshape(4,4,1) for x in np.indices((4,4))],2)
or with np.dstack
np.dstack(np.indices((4,4)))
Some bench marking since you have a ton of possibilities
def Psidom_mrgid(rows,cols):
np.mgrid[:rows, :cols].transpose((1, 2, 0))
def Psidom_mesh(rows,cols):
np.dstack(np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij'))
def Mad_tile(rows,cols):
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
def bora_comp(rows,cols):
x = [[[i, j] for j in range(rows)] for i in range(cols)]
def djk_ind(rows,cols):
np.concatenate([x.reshape(rows, cols, 1) for x in np.indices((rows, cols))], 2)
def devdev_mgrid(rows,cols):
index_tuple = np.mgrid[0:rows, 0:cols]
np.dstack(index_tuple).reshape((rows, cols, 2)
In[8]: %timeit Psidom_mrgid(1000,1000)
100 loops, best of 3: 15 ms per loop
In[9]: %timeit Psidom_mesh(1000,1000)
100 loops, best of 3: 9.98 ms per loop
In[10]: %timeit Mad_tile(1000,1000)
100 loops, best of 3: 15.3 ms per loop
In[11]: %timeit bora_comp(1000,1000)
1 loop, best of 3: 221 ms per loop
In[12]: %timeit djk_ind(1000,1000)
100 loops, best of 3: 9.72 ms per loop
In[13]: %timeit devdev_mgrid(1000,1000)
10 loops, best of 3: 20.6 ms per loop
I guess that's pretty pythonic:
[[[i,j] for j in range(5)] for i in range(5)]
Output:
[[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4]],
[[1, 0], [1, 1], [1, 2], [1, 3], [1, 4]],
[[2, 0], [2, 1], [2, 2], [2, 3], [2, 4]],
[[3, 0], [3, 1], [3, 2], [3, 3], [3, 4]],
[[4, 0], [4, 1], [4, 2], [4, 3], [4, 4]]]
Check out numpy.mgrid, which will return two arrays with the i and j indices. To combine them you can stack the arrays and reshape them. Something like this:
import numpy as np
def index_pair_array(rows, cols):
index_tuple = np.mgrid[0:rows, 0:cols]
return np.dstack(index_tuple).reshape((rows, cols, 2))
There are a few ways of doing this numpythonically.
One way is using np.tile and np.stack:
r = np.tile(np.arange(rows).reshape(rows, 1), (1, cols))
c = np.tile(np.arange(cols), (rows, 1))
result = np.stack((r, c), axis=-1)
A better way of getting the coordinates might be np.meshgrid:
rc = np.meshgrid(np.arange(rows), np.arange(cols), indexing='ij')
result = np.stack(rc, axis=-1)
Related
I am looking for a way to get the intersection between two 2-dimensional numpy.array of shape (n_1, m) and (n_2, m). Note that n_1 and n_2 can differ but m is the same for both arrays. Here are two minimal examples with the expected results:
import numpy as np
array1a = np.array([[2], [2], [5], [1]])
array1b = np.array([[5], [2]])
array_intersect(array1a, array1b)
## array([[2],
## [5]])
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
array_intersect(array2a, array2b)
## array([[2, 1],
## [3, 3]])
If someone have a clue on how I should implement the array_intersect function, I would be very grateful!
How about using sets?
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
a = set((tuple(i) for i in array2a))
b = set((tuple(i) for i in array2b))
a.intersection(b) # {(2, 1), (3, 3)}
Another approach would be to harness the broadcasting feature
import numpy as np
array2a = np.array([[1, 2], [3, 3], [2, 1], [1, 3], [2, 1]])
array2b = np.array([[2, 1], [1, 4], [3, 3]])
test = array2a[:, None] == array2b
print(array2b[np.all(test.mean(0) > 0, axis = 1)]) # [[2 1]
# [3 3]]
but this is less readable imo. [edit]: or use the unique and set combination. In short, there are many options!
Here's a way to do without any loops or list comprehensions, assuming you have scipy installed (I haven't tested for speed):
In [31]: from scipy.spatial.distance import cdist
In [32]: np.unique(array1a[np.where(cdist(array1a, array1b) == 0)[0]], axis=0)
Out[32]:
array([[2],
[5]])
In [33]: np.unique(array2a[np.where(cdist(array2a, array2b) == 0)[0]], axis=0)
Out[33]:
array([[2, 1],
[3, 3]])
Construct a set of tuples from the first array and test each line of the second array. Or vice versa.
def array_intersect(a, b):
s = {tuple(x) for x in a}
return np.unique([x for x in b if tuple(x) in s], axis=0)
The numpy-indexed package (disclaimer: I am its author) was created with the exact purpose of providing such functionality in an expressive and efficient manner:
import numpy_indexed as npi
npi.intersect(a, b)
Note that the implementation is fully vectorized; that is no loops over the arrays in python.
arr1 = np.arange(20000).reshape(-1,2)
arr2 = arr1.copy()
np.random.shuffle(arr2)
print(len(arr1)) #10000
%%timeit
res= np.array([x
for x in set(tuple(x) for x in arr1) & set(tuple(x) for x in arr2)
])
83.7 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I've come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
with shift values of S = array([0, 1, 1]), the output needs to be
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
This is easy to implement by the following code:
Y = []
for i in range(X.shape[0]):
if (S[i] > 0):
Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
else:
Y.append(X[i,:].tolist())
Y = np.array(Y)
This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?
Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array
array([[[8, 1, 8],
[8, 6, 2],
[5, 3, 7]],
[[4, 1, 0],
[5, 9, 5],
[5, 1, 7]],
[[9, 8, 6],
[5, 1, 0],
[5, 5, 4]]])
Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to
array([[[8, 1, 7],
[8, 6, 8],
[5, 3, 2]],
[[4, 1, 7],
[5, 9, 0],
[5, 1, 5]],
[[9, 8, 4],
[5, 1, 6],
[5, 5, 0]]])
Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so -
def cyclic_slice(X, S):
m,n = X.shape
idx = np.mod(np.arange(n) + S[:,None],n)
return X[np.arange(m)[:,None], idx]
Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so -
def cyclic_slice_strided(X, S):
X2 = np.column_stack((X,X[:,:-1]))
s0,s1 = X2.strides
strided = np.lib.stride_tricks.as_strided
m,n1 = X.shape
n2 = X2.shape[1]
X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
return X2_3D[np.arange(len(S)),S]
Sample run -
In [34]: X
Out[34]:
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
In [35]: S
Out[35]: array([0, 1, 1])
In [36]: cyclic_slice(X, S)
Out[36]:
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
Runtime test -
In [75]: X = np.random.rand(10000,100)
...: S = np.random.randint(0,100,(10000))
# #Moses Koledoye's soln
In [76]: %%timeit
...: Y = []
...: for i, x in zip(S, X):
...: Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop
In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop
In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop
Adaption for 3D case
Adapting approach #1 for the 3D case, we would have -
shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
Sa = S
else:
Sa = -S
# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]
# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]
# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]
There could be a way to have a generic solution for a generic axis, but I will keep it to this point.
You could shift each row using np.roll and use the new rows to build the output array:
Y = []
for i, x in zip(S, X):
Y.append(np.roll(x, -i))
print(np.array(Y))
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
I have a really big numpy array(145000 rows * 550 cols). And I wanted to create rolling slices within subarrays. I tried to implement it with a function. The function lagged_vals behaves as expected but np.lib.stride_tricks does not behave the way I want it to -
def lagged_vals(series,l):
# Garbage implementation but still right
return np.concatenate([[x[i:i+l] for i in range(x.shape[0]) if i+l <= x.shape[0]] for x in series]
,axis = 0)
# Sample 2D numpy array
something = np.array([[1,2,2,3],[2,2,3,3]])
lagged_vals(something,2) # Works as expected
# array([[1, 2],
# [2, 2],
# [2, 3],
# [2, 2],
# [2, 3],
# [3, 3]])
np.lib.stride_tricks.as_strided(something,
(something.shape[0]*something.shape[1],2),
(8,8))
# array([[1, 2],
# [2, 2],
# [2, 3],
# [3, 2], <--- across subarray stride, which I do not want
# [2, 2],
# [2, 3],
# [3, 3])
How do I remove that particular row in the np.lib.stride_tricks implementation? And how can I scale this cross array stride removal for a big numpy array ?
Sure, that's possible with np.lib.stride_tricks.as_strided. Here's one way -
from numpy.lib.stride_tricks import as_strided
L = 2 # window length
shp = a.shape
strd = a.strides
out_shp = shp[0],shp[1]-L+1,L
out_strd = strd + (strd[1],)
out = as_strided(a, out_shp, out_strd).reshape(-1,L)
Sample input, output -
In [177]: a
Out[177]:
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
In [178]: out
Out[178]:
array([[0, 1],
[1, 2],
[2, 3],
[4, 5],
[5, 6],
[6, 7]])
Note that the last step of reshaping forces it to make a copy there. But that's can't be avoided if we need the final output to be a 2D. If we are okay with a 3D output, skip that reshape and thus achieve a view, as shown with the sample case -
In [181]: np.shares_memory(a, out)
Out[181]: False
In [182]: as_strided(a, out_shp, out_strd)
Out[182]:
array([[[0, 1],
[1, 2],
[2, 3]],
[[4, 5],
[5, 6],
[6, 7]]])
In [183]: np.shares_memory(a, as_strided(a, out_shp, out_strd) )
Out[183]: True
Given an array:
arr = np.array([[1, 3, 7], [4, 9, 8]]); arr
array([[1, 3, 7],
[4, 9, 8]])
And given its indices:
np.indices(arr.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
How would I be able to stack them neatly one against the other to form a new 2D array? This is what I'd like:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
This is my current solution:
def foo(arr):
return np.hstack((np.indices(arr.shape).reshape(2, arr.size).T, arr.reshape(-1, 1)))
It works, but is there something shorter/more elegant to carry this operation out?
Using array-initialization and then broadcasted-assignment for assigning indices and the array values in subsequent steps -
def indices_merged_arr(arr):
m,n = arr.shape
I,J = np.ogrid[:m,:n]
out = np.empty((m,n,3), dtype=arr.dtype)
out[...,0] = I
out[...,1] = J
out[...,2] = arr
out.shape = (-1,3)
return out
Note that we are avoiding the use of np.indices(arr.shape), which could have slowed things down.
Sample run -
In [10]: arr = np.array([[1, 3, 7], [4, 9, 8]])
In [11]: indices_merged_arr(arr)
Out[11]:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
Performance
arr = np.random.randn(100000, 2)
%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 4.97 ms per loop
%timeit pd.DataFrame(indices_merged_arr_divakar(arr), columns=['x', 'y', 'value'])
100 loops, best of 3: 3.82 ms per loop
%timeit pd.DataFrame(indices_merged_arr_eric(arr), columns=['x', 'y', 'value'], dtype=np.float32)
100 loops, best of 3: 5.59 ms per loop
Note: Timings include conversion to pandas dataframe, that is the eventual use case for this solution.
A more generic answer for nd arrays, that handles other dtypes correctly:
def indices_merged_arr(arr):
out = np.empty(arr.shape, dtype=[
('index', np.intp, arr.ndim),
('value', arr.dtype)
])
out['value'] = arr
for i, l in enumerate(arr.shape):
shape = (1,)*i + (-1,) + (1,)*(arr.ndim-1-i)
out['index'][..., i] = np.arange(l).reshape(shape)
return out.ravel()
This returns a structured array with an index column and a value column, which can be of different types.
I want to use numpy argwhere to find where a maximum in my data is. Below is a sample set that describes what I am doing:
bins = np.arange(10)
data = np.array([[6],[4],[8],[5]])
np.argwhere(bins<data)
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[2, 4],
[2, 5],
[2, 6],
[2, 7],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4]])
What I want from this data is
array([[0,5],
[1,3],
[2,7],
[3,4]])
This could be done with a for loop, but I was wondering if there was a more pythonic way to do this.
EDIT:
What I have now done was use Pandas and groupby. I am still wondering if this is the best method.
t = pd.DataFrame(np.argwhere(bins<data))
time = t.groupby(0)
time.max()
1
0
0 5
1 3
2 7
3 4
Now that I have this, I have a new problem. Lets say I have another set of data:
BigData = np.array([[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9],
[0,1,2,3,4,5,6,7,8,9]])
How can I use the array I achieved
array([[0,5],
[1,3],
[2,7],
[3,4]])
To be put in this new data to get BigData average up to the index in the second column. I.E
(0+1+2+3+4) / 5
(0+1+2) / 3
(0+1+2+3+4+5+6) / 7
(0+1+2+3) / 4
would be the return of BigData, assuming that we got the index value of where this happens in column two.
Here's a fairly short Numpy solution that's also pretty fast:
A = np.argwhere(bins<data)
print A[np.r_[A[1:,0] != A[:-1,0], True]]
Here's a NumPy solution. It is not as readable as the Pandas version, but timing suggests it is much faster:
>>> arr = np.argwhere(bins<data)
>>> arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])
>>> %timeit arr[np.where(np.diff(np.vstack((arr, [arr[-1][0]+1, arr[-1][1]])), axis=0)[:,0] > 0)[0]]
10000 loops, best of 3: 32.7 µs per loop
>>> %%timeit
... t = pd.DataFrame(arr)
... time = t.groupby(0)
... time.max()
...
1000 loops, best of 3: 1 ms per loop
The following seems to be pretty fast for me, taking advantage of argmax working left -> right:
>>> bins[::-1][(bins[::-1] < data).argmax(axis=1)]
array([5, 3, 7, 4])
For me %timeit shows that this takes around 11µs.
However, manipulating the array to have the index as the first column (as follows) increases time to around 25µs:
>>> np.column_stack(
... [np.arange(data.shape[0]), bins[::-1][(bins[::-1] < data).argmax(axis=1)]])
array([[0, 5],
[1, 3],
[2, 7],
[3, 4]])