Related
Sorry if the title is a little confusing, but I'll explain more here. Say I have a large array with a small number of unique elements that looks like this:
arr = np.array([0,0,1,1,1,1,1],
[0,2,0,0,1,1,1],
[0,2,0,0,1,1,1],
[0,2,1,1,1,0,0],
[0,3,2,2,0,2,1])
In this case, the array is 5x6 for example purposes, but in reality, I could be working with something as large as a 10000x10000 array (still with a small amount of unique elements).
I was wondering how to iterate through each rows and 'count' the amount of times the array element changes as you move from right to left, as well as the number of constant elements between transitions.
For example, in the above array, the first row has 1 transition, and lengths 2 and 5 for the values 0 and 1, respectively. In the second-to-last row, there are 3 transitions, with lengths 1, 1, 2, and 2, for the values 0, 2, 1, and 0, respectively.
Ideally, some function transition_count would take arr above and return the something like:
row0: [1, (0,2), (1,5)]
row1: [3, (0,1), (2,1), (0,2), (1,3)]
row2: ...
and so forth.
My thinking for this is to iterate through each row of the array, arr[i,:], and analyze it separately (maybe as a list?). But even for just a single row, I'm not sure how to 'count' the number of transitions and the obtain length of each constant element.
Any help would be appreciated, thank you!
This works on a per-row basis. Not sure we can readily vectorize further given the jagged nature of the output.
for row in arr:
d = np.diff(row) != 0
idx = np.concatenate(([0], np.flatnonzero(d) + 1))
c = np.diff(np.concatenate((idx, [len(row)])))
print(len(c))
print('v', row[idx])
print('c', c)
Here is a fully vectorized solution, if you are willing to accept a slightly different output format:
d = np.diff(arr, axis=1) != 0
t = np.ones(shape=arr.shape, dtype=np.bool)
t[:, 1:] = d
e = np.ones(shape=arr.shape, dtype=np.bool)
e[:, :-1] = d
sr, sc = np.nonzero(t)
er, ec = np.nonzero(e)
v = arr[sr, sc]
print(sr)
print(sc)
print(v)
print(ec-sc + 1)
Note: you can group and split there outputs by sr to arrive at your original stated format; but usually it is best to stay away from jagged arrays entirely if you can (and you almost always can!), also in any downstream processing.
Here's a vectorized way to get all values and counts -
# Look for interval changes and pad with bool 1s on either sides to set the
# first interval for each row and for setting boundary wrt the next row
p = np.ones((len(a),1), dtype=bool)
m = np.hstack((p, a[:,:-1]!=a[:,1:], p))
# Look for interval change indices in flattened array version
intv = m.sum(1).cumsum()-1
# Get index and counts
idx = np.diff(np.flatnonzero(m.ravel()))
count = np.delete(idx, intv[:-1])
val = a[m[:,:-1]]
To get to the final split ones split based on rows -
# Get couples and setup offsetted interval change indices
grps = np.c_[val,count]
intvo = np.r_[0,intv-np.arange(len(intv))]
# Finally slice and get output
out = [grps[i:j] for (i,j) in zip(intvo[:-1], intvo[1:])]
Benchmarking
Solution to get counts and values as functions :
# #Eelco Hoogendoorn's soln
def eh(arr):
d = np.diff(arr, axis=1) != 0
t = np.ones(shape=arr.shape, dtype=np.bool)
t[:, 1:] = d
e = np.ones(shape=arr.shape, dtype=np.bool)
e[:, :-1] = d
sr, sc = np.nonzero(t)
er, ec = np.nonzero(e)
v = arr[sr, sc]
return ec-sc + 1,v
# Function form of proposed solution from this post
def grouped_info(a):
p = np.ones((len(a),1), dtype=bool)
m = np.hstack((p, a[:,:-1]!=a[:,1:], p))
intv = m.sum(1).cumsum()-1
idx = np.diff(np.flatnonzero(m.ravel()))
count = np.delete(idx, intv[:-1])
val = a[m[:,:-1]]
return count,val
We will try to get closer to your actual use-case scenario of 10000x10000 by tiling the given sample along the two axes and time the proposed solutions.
In [48]: a
Out[48]:
array([[0, 0, 1, 1, 1, 1, 1],
[0, 2, 0, 0, 1, 1, 1],
[0, 2, 0, 0, 1, 1, 1],
[0, 2, 1, 1, 1, 0, 0],
[0, 3, 2, 2, 0, 2, 1]])
In [49]: a = np.repeat(np.repeat(a,1000,axis=0),1000,axis=1)
In [50]: %timeit grouped_info(a)
126 ms ± 7.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit eh(a)
389 ms ± 41.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Say I have two arrays, A and B.
An element wise multiplication is defined as follows:
I want to do an element-wise multiplication in a convolutional-like manner, i.e., move every column one step right, for example, column 1 will be now column 2 and column 3 will be now column 1.
This should yield a ( 2 by 3 by 3 ) array (2x3 matrix for all 3 possibilities)
We can concatenate A with one of it's own slice and then get those sliding windows. To get those windows, we can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows. Then, multiply those windows with B for the final output. More info on use of as_strided based view_as_windows.
Hence, we will have one vectorized solution like so -
In [70]: from skimage.util.shape import view_as_windows
In [71]: A1 = np.concatenate((A,A[:,:-1]),axis=1)
In [74]: view_as_windows(A1,A.shape)[0]*B
Out[74]:
array([[[1, 0, 3],
[0, 0, 6]],
[[2, 0, 1],
[0, 0, 4]],
[[3, 0, 2],
[0, 0, 5]]])
We can also leverage multi-cores with numexpr module for the final step of broadcasted-multiplication, which should be better on larger arrays. Hence, for the sample case, it would be -
In [53]: import numexpr as ne
In [54]: w = view_as_windows(A1,A.shape)[0]
In [55]: ne.evaluate('w*B')
Out[55]:
array([[[1, 0, 3],
[0, 0, 6]],
[[2, 0, 1],
[0, 0, 4]],
[[3, 0, 2],
[0, 0, 5]]])
Timings on large arrays comparing the proposed two methods -
In [56]: A = np.random.rand(500,500)
...: B = np.random.rand(500,500)
In [57]: A1 = np.concatenate((A,A[:,:-1]),axis=1)
...: w = view_as_windows(A1,A.shape)[0]
In [58]: %timeit w*B
...: %timeit ne.evaluate('w*B')
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 228 ms per loop
Squeezing out the best off strided-based method
If you really squeeze out the best off the strided-view-based approach, go with the original np.lib.stride_tricks.as_strided based one to avoid the functional overhead off view_as_windows -
def vaw_with_as_strided(A,B):
A1 = np.concatenate((A,A[:,:-1]),axis=1)
s0,s1 = A1.strides
S = (A.shape[1],)+A.shape
w = np.lib.stride_tricks.as_strided(A1,shape=S,strides=(s1,s0,s1))
return w*B
Comparing against #Paul Panzer's array-assignment based one, the crossover seems to be at 19x19 shaped arrays -
In [33]: n = 18
...: A = np.random.rand(n,n)
...: B = np.random.rand(n,n)
In [34]: %timeit vaw_with_as_strided(A,B)
...: %timeit pp(A,B)
10000 loops, best of 3: 22.4 µs per loop
10000 loops, best of 3: 21.4 µs per loop
In [35]: n = 19
...: A = np.random.rand(n,n)
...: B = np.random.rand(n,n)
In [36]: %timeit vaw_with_as_strided(A,B)
...: %timeit pp(A,B)
10000 loops, best of 3: 24.5 µs per loop
10000 loops, best of 3: 24.5 µs per loop
So, for anything smaller than 19x19, array-assignment seems to be better and for larger than those, strided-based one should be the way to go.
Just a note on view_as_windows/as_strided. Neat as these functions are, it is useful to know that they have a rather pronounced constant overhead. Here is comparison between #Divakar's view_as_windows based solution (vaw) and a copy-reshape based approach by me.
As you can see vaw is not very fast on small to medium sized operands and only begins to shine above array size 30x30.
Code:
from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from skimage.util.shape import view_as_windows
B = BenchmarkBuilder()
#B.add_function()
def vaw(A,B):
A1 = np.concatenate((A,A[:,:-1]),axis=1)
w = view_as_windows(A1,A.shape)[0]
return w*B
#B.add_function()
def pp(A,B):
m,n = A.shape
aux = np.empty((n,m,2*n),A.dtype)
AA = np.concatenate([A,A],1)
aux.reshape(-1)[:-n].reshape(n,-1)[...] = AA.reshape(-1)[:-1]
return aux[...,:n]*B
#B.add_arguments('array size')
def argument_provider():
for exp in range(4, 16):
dim_size = int(1.4**exp)
a = np.random.rand(dim_size,dim_size)
b = np.random.rand(dim_size,dim_size)
yield dim_size, MultiArgument([a,b])
r = B.run()
r.plot()
import pylab
pylab.savefig('vaw.png')
Run a for loop for the number of columns and use np.roll() around axis =1, to shift your columns and do the matrix multiplication.
refer to the accepted answer in this reference.
Hope this helps.
I can actually pad the array from its two sides with 2 columns (to get 2x5 array)
and run a conv2 with 'b' as a kernel, I think it's more efficient
I have the following numpy row matrix.
X = np.array([1,2,3])
I want to create a block matrix as follows:
1 0 0
2 1 0
3 2 1
0 3 2
0 0 3
How can I do this using numpy?
If you read the desired output matrix top-down, then left-right, you see the pattern 1,2,3, 0,0,0, 1,2,3, 0,0,0, 1,2,3. You can use that pattern to easily create a linear array, and then reshape it into the two-dimensional form:
import numpy as np
X = np.array([1,2,3])
N = len(X)
zeros = np.zeros_like(X)
m = np.hstack((np.tile(np.hstack((X,zeros)),N-1),X)).reshape(N,-1).T
print m
gives
[[1 0 0]
[2 1 0]
[3 2 1]
[0 3 2]
[0 0 3]]
Approach #1 : Using np.lib.stride_tricks.as_strided -
from numpy.lib.stride_tricks import as_strided as strided
def zeropad_arr_v1(X):
n = len(X)
z = np.zeros(len(X)-1,dtype=X.dtype)
X_ext = np.concatenate(( z, X, z))
s = X_ext.strides[0]
return strided(X_ext[n-1:], (2*n-1,n), (s,-s), writeable=False)
Note that this would create a read-only output. If you need to write into later on, simply make a copy by appending .copy() at the end.
Approach #2 : Using concatenation with zeros and then clipping/slicing -
def zeropad_arr_v2(X):
n = len(X)
X_ext = np.concatenate((X, np.zeros(n,dtype=X.dtype)))
return np.tile(X_ext, n)[:-n].reshape(-1,n,order='F')
Approach #1 being a strides-based method should be very efficient on performance.
Sample runs -
In [559]: X = np.array([1,2,3])
In [560]: zeropad_arr_v1(X)
Out[560]:
array([[1, 0, 0],
[2, 1, 0],
[3, 2, 1],
[0, 3, 2],
[0, 0, 3]])
In [561]: zeropad_arr_v2(X)
Out[561]:
array([[1, 0, 0],
[2, 1, 0],
[3, 2, 1],
[0, 3, 2],
[0, 0, 3]])
Runtime test
In [611]: X = np.random.randint(0,9,(1000))
# Approach #1 (read-only)
In [612]: %timeit zeropad_arr_v1(X)
100000 loops, best of 3: 8.74 µs per loop
# Approach #1 (writable)
In [613]: %timeit zeropad_arr_v1(X).copy()
1000 loops, best of 3: 1.05 ms per loop
# Approach #2
In [614]: %timeit zeropad_arr_v2(X)
1000 loops, best of 3: 705 µs per loop
# #user8153's solution
In [615]: %timeit hstack_app(X)
100 loops, best of 3: 2.26 ms per loop
An other writable solution :
def block(X):
n=X.size
zeros=np.zeros((2*n-1,n),X.dtype)
zeros[::2]=X
return zeros.reshape(n,-1).T
try :
In [2]: %timeit block(X)
600 µs ± 33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For example, I have two ndarrays, the shape of train_dataset is (10000, 28, 28) and the shape of val_dateset is (2000, 28, 28).
Except for using iterations, is there any efficient way to use the numpy array functions to find the overlap between two ndarrays?
One trick I learnt from Jaime's excellent answer here is to use an np.void dtype in order to view each row in the input arrays as a single element. This allows you to treat them as 1D arrays, which can then be passed to np.in1d or one of the other set routines.
import numpy as np
def find_overlap(A, B):
if not A.dtype == B.dtype:
raise TypeError("A and B must have the same dtype")
if not A.shape[1:] == B.shape[1:]:
raise ValueError("the shapes of A and B must be identical apart from "
"the row dimension")
# reshape A and B to 2D arrays. force a copy if neccessary in order to
# ensure that they are C-contiguous.
A = np.ascontiguousarray(A.reshape(A.shape[0], -1))
B = np.ascontiguousarray(B.reshape(B.shape[0], -1))
# void type that views each row in A and B as a single item
t = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))
# use in1d to find rows in A that are also in B
return np.in1d(A.view(t), B.view(t))
For example:
gen = np.random.RandomState(0)
A = gen.randn(1000, 28, 28)
dupe_idx = gen.choice(A.shape[0], size=200, replace=False)
B = A[dupe_idx]
A_in_B = find_overlap(A, B)
print(np.all(np.where(A_in_B)[0] == np.sort(dupe_idx)))
# True
This method is much more memory-efficient than Divakar's, since it doesn't require broadcasting out to an (m, n, ...) boolean array. In fact, if A and B are row-major then no copying is required at all.
For comparison I've slightly adapted Divakar and B. M.'s solutions.
def divakar(A, B):
A.shape = A.shape[0], -1
B.shape = B.shape[0], -1
return (B[:,None] == A).all(axis=(2)).any(0)
def bm(A, B):
t = 'S' + str(A.size // A.shape[0] * A.dtype.itemsize)
ma = np.frombuffer(np.ascontiguousarray(A), t)
mb = np.frombuffer(np.ascontiguousarray(B), t)
return (mb[:, None] == ma).any(0)
Benchmarks:
In [1]: na = 1000; nb = 200; rowshape = 28, 28
In [2]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
divakar(A, B)
....:
1 loops, best of 3: 244 ms per loop
In [3]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
....:
100 loops, best of 3: 2.81 ms per loop
In [4]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
....:
100 loops, best of 3: 15 ms per loop
As you can see, B. M.'s solution is slightly faster than mine for small n, but np.in1d scales better than testing equality for all elements (O(n log n) rather than O(n²) complexity).
In [5]: na = 10000; nb = 2000; rowshape = 28, 28
In [6]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
....:
1 loops, best of 3: 271 ms per loop
In [7]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
....:
10 loops, best of 3: 123 ms per loop
Divakar's solution is intractable on my laptop for arrays of this size, since it requires generating a 15GB intermediate array whereas I only have 8GB RAM.
Memory permitting you could use broadcasting, like so -
val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]
Sample run -
In [55]: train_dataset
Out[55]:
array([[[1, 1],
[1, 1]],
[[1, 0],
[0, 0]],
[[0, 0],
[0, 1]],
[[0, 1],
[0, 0]],
[[1, 1],
[1, 0]]])
In [56]: val_dateset
Out[56]:
array([[[0, 1],
[1, 0]],
[[1, 1],
[1, 1]],
[[0, 0],
[0, 1]]])
In [57]: val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]
Out[57]:
array([[[1, 1],
[1, 1]],
[[0, 0],
[0, 1]]])
If the elements are integers, you could collapse every block of axis=(1,2) in the input arrays into a scalar assuming them as linearly index-able numbers and then efficiently use np.in1d or np.intersect1d to find the matches.
Full broadcasting generate here a 10000*2000*28*28 =150 Mo boolean array.
For efficiency, you can :
pack data, for a 200 ko array:
from pylab import *
N=10000
a=rand(N,28,28)
b=a[[randint(0,N,N//5)]]
packedtype='S'+ str(a.size//a.shape[0]*a.dtype.itemsize) # 'S6272'
ma=frombuffer(a,packedtype) # ma.shape=10000
mb=frombuffer(b,packedtype) # mb.shape=2000
%timeit a[:,None]==b : 102 s
%timeit ma[:,None]==mb : 800 ms
allclose((a[:,None]==b).all((2,3)),(ma[:,None]==mb)) : True
less memory is helped here by lazy string comparison, breaking at first difference :
In [31]: %timeit a[:100]==b[:100]
10000 loops, best of 3: 175 µs per loop
In [32]: %timeit a[:100]==a[:100]
10000 loops, best of 3: 133 µs per loop
In [34]: %timeit ma[:100]==mb[:100]
100000 loops, best of 3: 7.55 µs per loop
In [35]: %timeit ma[:100]==ma[:100]
10000 loops, best of 3: 156 µs per loop
Solutions are given here with (ma[:,None]==mb).nonzero().
use in1d, for a (Na+Nb) ln(Na+Nb) complexity, against
Na*Nb on full comparison :
%timeit in1d(ma,mb).nonzero() : 590ms
Not a big gain here, but asymptotically better.
Solution
def overlap(a,b):
"""
returns a boolean index array for input array b representing
elements in b that are also found in a
"""
a.repeat(b.shape[0],axis=0)
b.repeat(a.shape[0],axis=0)
c = aa == bb
c = c[::a.shape[0]]
return c.all(axis=1)[:,0]
You can use the returned index array to index b to extract the elements which are also found in a
b[overlap(a,b)]
Explanation
For simplicity's sake I assume you have imported everything from numpy for this example:
from numpy import *
So, for example, given two ndarrays
a = arange(4*2*2).reshape(4,2,2)
b = arange(3*2*2).reshape(3,2,2)
we repeat a and b so that they have the same shape
aa = a.repeat(b.shape[0],axis=0)
bb = b.repeat(a.shape[0],axis=0)
we can then simply compare the elements of aa and bb
c = aa == bb
Finally, to get the indices of the elements in b which are also found in a by looking at every 4th, or actually, every shape(a)[0]th element of c
cc == c[::a.shape[0]]
Finally, we extract an index array with only the elements where all elements in the sub-arrays are True
c.all(axis=1)[:,0]
In our example we get
array([True, True, True], dtype=bool)
To check, change the first element of b
b[0] = array([[50,60],[70,80]])
and we get
array([False, True, True], dtype=bool)
This question comes form Google's online deep learning course?
The following is my solution:
sum = 0 # number of overlapping rows
for i in range(val_dataset.shape[0]): # iterate over all rows of val_dataset
overlap = (train_dataset == val_dataset[i,:,:]).all(axis=1).all(axis=1).sum()
if overlap:
sum += 1
print(sum)
Automatic broadcasting is used instead of iteration. You may test the performance difference.
Looking for tips on how one would write a function (or could recommend a function that already exists) that calculates the difference between all entries in the array i.e. an implementation of diff() but for all entry combinations in the array not just consecutive pairs.
Here is an example of what I want:
# example array
a = [3, 2, 5, 1]
Now we want to apply a function which will return the difference between all combinations of entries. Now given that length(a) == 4 that means that the total number of combinations is, for N = 4; N*(N-1)*0.5 = 6 (if the length of a was 5 then the total number of combinations would be 10 and so on). So the function should return the following for vector a:
result = some_function(a)
print result
array([-1, 2, -2, 3, -1, -4])
So the 'function' would be similar to pdist but instead of calculating the Euclidean distance, it should simply calculate the difference between the Cartesian coordinate along one axis e.g. the z-axis if we assume that the entries in a are coordinates. As can be noted I need the sign of each difference to understand what side of the axis each point is located.
Thanks.
Something like this?
>>> import itertools as it
>>> a = [3, 2, 5, 1]
>>> [y - x for x, y in it.combinations(a, 2)]
[-1, 2, -2, 3, -1, -4]
So I tried out the methods proposed by wim and Joe (and Joe and wim's combined suggestion), and this is what I came up with:
import itertools as it
import numpy as np
a = np.random.randint(10, size=1000)
def cartesian_distance(x):
return np.subtract.outer(x,x)[np.tril_indices(x.shape[0],k=-1)]
%timeit cartesian_distance(a)
%timeit [y - x for x, y in it.combinations(a, 2)]
10 loops, best of 3: 97.9 ms per loop
1 loops, best of 3: 333 ms per loop
For smaller entries:
a = np.random.randint(10, size=10)
def cartesian_distance(x):
return np.subtract.outer(x,x)[np.tril_indices(x.shape[0],k=-1)]
%timeit cartesian_distance(a)
%timeit [y - x for x, y in it.combinations(a, 2)]
10000 loops, best of 3: 78.6 µs per loop
10000 loops, best of 3: 40.1 µs per loop