I have a large numpy array. Is there a way to subtract each element with the elements below it, and store the result in a new list/array, without using a loop.
A simple example of what I mean:
a = numpy.array([4,3,2,1])
result = [4-3, 4-2, 4-1, 3-2, 3-1, 2-1] = [1, 2, 3, 1, 2 ,1]
Note that the 'real' array I am working with doesn't contain numbers in sequence. This is just to make the example simple.
I know the result should have (n-1)! elements, where n is the size of the array.
Is there a way to do this without using a loop, but by repeating the array in a 'smart' way?
Thanks!
temp = a[:, None] - a
result = temp[np.triu_indices(len(a), k=1)]
Perform all pairwise subtractions to produce temp, including subtracting elements from themselves and subtracting earlier elements from later elements, then use triu_indices to select the results we want. (a[:, None] adds an extra length-1 axis to a.)
Note that almost all of the runtime is spent constructing result from temp (because triu_indices is slow and using indices to select the upper triangle of an array is slow). If you can use temp directly, you can save a lot of time:
In [13]: a = numpy.arange(2000)
In [14]: %%timeit
....: temp = a[:, None] - a
....:
100 loops, best of 3: 6.99 ms per loop
In [15]: %%timeit
....: temp = a[:, None] - a
....: result = temp[numpy.triu_indices(len(a), k=1)]
....:
10 loops, best of 3: 51.7 ms per loop
Here's a masking based approach for the extraction after broadcasted subtractions and for the mask creation we are again making use of broadcasting (double broadcasting powered so to speak) -
r = np.arange(a.size)
out = (a[:, None] - a)[r[:,None] < r]
Runtime test
Vectorized approaches -
# #user2357112's solution
def pairwise_diff_triu_indices_based(a):
return (a[:, None] - a)[np.triu_indices(len(a), k=1)]
# Proposed in this post
def pairwise_diff_masking_based(a):
r = np.arange(a.size)
return (a[:, None] - a)[r[:,None] < r]
Timings -
In [109]: a = np.arange(2000)
In [110]: %timeit pairwise_diff_triu_indices_based(a)
10 loops, best of 3: 36.1 ms per loop
In [111]: %timeit pairwise_diff_masking_based(a)
100 loops, best of 3: 11.8 ms per loop
Closer look at involved performance parameters
Let's dig deep a bit through the timings on this setup to study how much mask based approach helps. Now, for comparison there are two parts - Mask creation vs. indices creation and Mask based boolean indexing vs. integer based indexing.
How much mask creation helps?
In [37]: r = np.arange(a.size)
In [38]: %timeit np.arange(a.size)
1000000 loops, best of 3: 1.88 µs per loop
In [39]: %timeit r[:,None] < r
100 loops, best of 3: 3 ms per loop
In [40]: %timeit np.triu_indices(len(a), k=1)
100 loops, best of 3: 14.7 ms per loop
About 5x improvement on mask creation over index setup.
How much boolean indexing helps against integer based indexing?
In [41]: mask = r[:,None] < r
In [42]: idx = np.triu_indices(len(a), k=1)
In [43]: subs = a[:, None] - a
In [44]: %timeit subs[mask]
100 loops, best of 3: 4.15 ms per loop
In [45]: %timeit subs[idx]
100 loops, best of 3: 10.9 ms per loop
About 2.5x improvement here.
a = [4, 3, 2, 1]
differences = ((x - y) for i, x in enumerate(a) for y in a[i+1:])
for diff in differences:
# do something with difference.
pass
Check out itertools.combinations:
from itertools import combinations
l = [4, 3, 2, 1]
result = []
for n1, n2 in combinations(l, 2):
result.append(n1 - n2)
print result
Results in:
[1, 2, 3, 1, 2, 1]
combinations returns a generator, so this is good for very large lists :)
Related
simple example:
a = array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]])
result = []
for i in a:
result.append(i.sum())
result = [6, 3]
Is there a numpy function doing this faster? If it helps: a contains only diagonal matrices.
Edit:
I just realized that a contains scipy csc_sparse matrices, i.e. its a numpy 1D array containing matrices and i can not apply the sum function with axis=(1, 2)
A proper use of the axis parameter of np.sum() would do:
import numpy as np
np.sum(a, axis=(1, 2))
# [6, 3]
While the above should be generic preferred method, if your input is actually diagonal over axis 1 and 2, then summing all the zeros is bound to be inefficient (read O(n² k) with same n and k as the gen_a() function below). Using np.sum() after np.diag() inside a loop can be much better (read O(n k) with same n and k as before). Possibly, using a list comprehension is the way to go:
import numpy as np
np.array([np.sum(np.diag(x)) for x in a])
# [3, 6]
To give some idea of the relative speed, let's write a function to generate inputs of arbitrary size:
def gen_a(n, k):
return np.array([
np.diag(np.ones(n, dtype=int))
if i % 2 else
np.diag(np.arange(1, n + 1, dtype=int))
for i in range(k)])
print(gen_a(3, 2))
# [[[1 0 0]
# [0 2 0]
# [0 0 3]]
# [[1 0 0]
# [0 1 0]
# [0 0 1]]]
Now, we can time for different input size. I have also included a list comprehension without the np.diag() call, which is fundamentally a slightly more concise version of your approach.
a = gen_a(3, 2)
%timeit np.array([np.sum(np.diag(x)) for x in a])
# 100000 loops, best of 3: 16 µs per loop
%timeit np.sum(a, axis=(1, 2))
# 100000 loops, best of 3: 4.51 µs per loop
%timeit np.array([np.sum(x) for x in a])
# 100000 loops, best of 3: 10 µs per loop
a = gen_a(3000, 2)
%timeit np.array([np.sum(np.diag(x)) for x in a])
# 10000 loops, best of 3: 20.5 µs per loop
%timeit np.sum(a, axis=(1, 2))
# 100 loops, best of 3: 17.8 ms per loop
%timeit np.array([np.sum(x) for x in a])
# 100 loops, best of 3: 17.8 ms per loop
a = gen_a(3, 2000)
%timeit np.array([np.sum(np.diag(x)) for x in a])
# 100 loops, best of 3: 14.8 ms per loop
%timeit np.sum(a, axis=(1, 2))
# 10000 loops, best of 3: 34 µs per loop
%timeit np.array([np.sum(x) for x in a])
# 100 loops, best of 3: 8.93 ms per loop
a = gen_a(300, 200)
%timeit np.array([np.sum(np.diag(x)) for x in a])
# 1000 loops, best of 3: 1.67 ms per loop
%timeit np.sum(a, axis=(1, 2))
# 100 loops, best of 3: 17.8 ms per loop
%timeit np.array([np.sum(x) for x in a])
# 100 loops, best of 3: 19.3 ms per loop
And we observe that depending on the value of n and k one or the other solution gets faster.
For larger n, the list comprehension gets faster, but only if np.diag() is used.
On the contrary, for smaller n and larger k, np.sum() raw speed can outperform the explicit looping.
Let say I have two large 2-d numpy array of same dimensions (say 2000x2000). I want to sum them element wise. I was wondering if there is a faster way than np.add()
Edit: I am adding a similar example of what I am using now. Is there a way to speed up this?
#a and b are the two matrices I already have.Dimension is 2000x2000
#shift is also a list that is previously known
for j in range(100000):
b=np.roll(b, shift[j] , axis=0)
a=np.add(a,b)
Approach #1 (Vectorized)
We can use modulus to simulate the circulating behavior of roll/circshift and with broadcasted indices to cover all rows, we would have a fully vectorized approach, like so -
n = b.shape[0]
idx = n-1 - np.mod(shift.cumsum()[:,None]-1 - np.arange(n), n)
a += b[idx].sum(0)
Approach #2 (Loopy one)
b_ext = np.row_stack((b, b[:-1] ))
start_idx = n-1 - np.mod(shift.cumsum()-1,n)
for j in range(start_idx.size):
a += b_ext[start_idx[j]:start_idx[j]+n]
Colon notation vs using indices for slicing
The idea here to do minimal work once we are inside the loop. We are pre-computing the start row index of each iteration before going into the loop. So, all we need to do once inside the loop is slicing using colon notation, which is a view into the array and adding up. This should be much better than rolling that needs to compute all of those row indices that results in a copy that is expensive.
Here's a bit more into the view and copy concepts when slicing with colon and indices -
In [11]: a = np.random.randint(0,9,(10))
In [12]: a
Out[12]: array([8, 0, 1, 7, 5, 0, 6, 1, 7, 0])
In [13]: a[3:8]
Out[13]: array([7, 5, 0, 6, 1])
In [14]: a[[3,4,5,6,7]]
Out[14]: array([7, 5, 0, 6, 1])
In [15]: np.may_share_memory(a, a[3:8])
Out[15]: True
In [16]: np.may_share_memory(a, a[[3,4,5,6,7]])
Out[16]: False
Runtime test
Function defintions -
def original_loopy_app(a,b):
for j in range(shift.size):
b=np.roll(b, shift[j] , axis=0)
a += b
def vectorized_app(a,b):
n = b.shape[0]
idx = n-1 - np.mod(shift.cumsum()[:,None]-1 - np.arange(n), n)
a += b[idx].sum(0)
def modified_loopy_app(a,b):
n = b.shape[0]
b_ext = np.row_stack((b, b[:-1] ))
start_idx = n-1 - np.mod(shift.cumsum()-1,n)
for j in range(start_idx.size):
a += b_ext[start_idx[j]:start_idx[j]+n]
Case #1:
In [5]: # Setup input arrays
...: N = 200
...: M = 1000
...: a = np.random.randint(11,99,(N,N))
...: b = np.random.randint(11,99,(N,N))
...: shift = np.random.randint(0,N,M)
...:
In [6]: original_loopy_app(a1,b1)
...: vectorized_app(a2,b2)
...: modified_loopy_app(a3,b3)
...:
In [7]: np.allclose(a1, a2) # Verify results
Out[7]: True
In [8]: np.allclose(a1, a3) # Verify results
Out[8]: True
In [9]: %timeit original_loopy_app(a1,b1)
...: %timeit vectorized_app(a2,b2)
...: %timeit modified_loopy_app(a3,b3)
...:
10 loops, best of 3: 107 ms per loop
10 loops, best of 3: 137 ms per loop
10 loops, best of 3: 48.2 ms per loop
Case #2:
In [13]: # Setup input arrays (datasets are exactly 1/10th of original sizes)
...: N = 200
...: M = 10000
...: a = np.random.randint(11,99,(N,N))
...: b = np.random.randint(11,99,(N,N))
...: shift = np.random.randint(0,N,M)
...:
In [14]: %timeit original_loopy_app(a1,b1)
...: %timeit modified_loopy_app(a3,b3)
...:
1 loops, best of 3: 1.11 s per loop
1 loops, best of 3: 481 ms per loop
So, we are looking at 2x+ speedup there with the modified loopy approach!
I'm using numpy einsum to calculate the dot products of an array of column vectors pts, of shape (3,N), with itself, resulting on a matrix dotps, of shape (N,N), with all the dot products. This is the code I use:
dotps = np.einsum('ij,ik->jk', pts, pts)
This works, but I only need the values above the main diagonal. ie. the upper triangular part of the result without the diagonal. Is it possible to compute only these values with einsum? or in any other way that is faster than using einsum to compute the whole matrix?
My pts array can be quite large so if I could calculate only the values I need that would double my computation speed.
You can slice relevant columns and then use np.einsum -
R,C = np.triu_indices(N,1)
out = np.einsum('ij,ij->j',pts[:,R],pts[:,C])
Sample run -
In [109]: N = 5
...: pts = np.random.rand(3,N)
...: dotps = np.einsum('ij,ik->jk', pts, pts)
...:
In [110]: dotps
Out[110]:
array([[ 0.26529103, 0.30626052, 0.18373867, 0.13602931, 0.51162729],
[ 0.30626052, 0.56132272, 0.5938057 , 0.28750708, 0.9876753 ],
[ 0.18373867, 0.5938057 , 0.84699103, 0.35788749, 1.04483158],
[ 0.13602931, 0.28750708, 0.35788749, 0.18274288, 0.4612556 ],
[ 0.51162729, 0.9876753 , 1.04483158, 0.4612556 , 1.82723949]])
In [111]: R,C = np.triu_indices(N,1)
...: out = np.einsum('ij,ij->j',pts[:,R],pts[:,C])
...:
In [112]: out
Out[112]:
array([ 0.30626052, 0.18373867, 0.13602931, 0.51162729, 0.5938057 ,
0.28750708, 0.9876753 , 0.35788749, 1.04483158, 0.4612556 ])
Optimizing further -
Let's time our approach and see if there's any scope for improvement performance-wise.
In [126]: N = 5000
In [127]: pts = np.random.rand(3,N)
In [128]: %timeit np.triu_indices(N,1)
1 loops, best of 3: 413 ms per loop
In [129]: R,C = np.triu_indices(N,1)
In [130]: %timeit np.einsum('ij,ij->j',pts[:,R],pts[:,C])
1 loops, best of 3: 1.47 s per loop
Staying within the memory constraints, it doesn't look like we can do much about optimizing np.einsum. So, let's shift the focus to np.triu_indices.
For N = 4, we have :
In [131]: N = 4
In [132]: np.triu_indices(N,1)
Out[132]: (array([0, 0, 0, 1, 1, 2]), array([1, 2, 3, 2, 3, 3]))
It seems to be creating a regular pattern, sort of like a shifting one though. This could be written with a cumulative sum that has shifts at those 3 and 5 positions. Thinking generically, we would end up coding it something like this -
def triu_indices_cumsum(N):
# Length of R and C index arrays
L = (N*(N-1))/2
# Positions along the R and C arrays that indicate
# shifting to the next row of the full array
shifts_idx = np.arange(2,N)[::-1].cumsum()
# Initialize "shift" arrays for finally leading to R and C
shifts1_arr = np.zeros(L,dtype=int)
shifts2_arr = np.ones(L,dtype=int)
# At shift positions along the shifts array set appropriate values,
# such that when cumulative summed would lead to desired R and C arrays.
shifts1_arr[shifts_idx] = 1
shifts2_arr[shifts_idx] = -np.arange(N-2)[::-1]
# Finall cumsum to give R, C
R_arr = shifts1_arr.cumsum()
C_arr = shifts2_arr.cumsum()
return R_arr, C_arr
Let's time it for various N's!
In [133]: N = 100
In [134]: %timeit np.triu_indices(N,1)
10000 loops, best of 3: 122 µs per loop
In [135]: %timeit triu_indices_cumsum(N)
10000 loops, best of 3: 61.7 µs per loop
In [136]: N = 1000
In [137]: %timeit np.triu_indices(N,1)
100 loops, best of 3: 17 ms per loop
In [138]: %timeit triu_indices_cumsum(N)
100 loops, best of 3: 16.3 ms per loop
Thus, it looks like for decent N's, the customized cumsum based triu_indices might be worth a look!
I've got a numpy array of strictly increasing "cutoff" values of length m, and a pandas series of values (thought the index isn't important and this could be cast to a numpy array) of values of length n.
I need to come up with an efficient way of spitting out a length m vector of counts of the number of elements in the pandas series less than the jth element of the "cutoff" array.
I could do this via a list iterator:
output = array([(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar])
but I was wondering if there were any way to do this that leveraged more of numpy's magic speed, as I have to do this quite a few times inside multiple loops and it keeps crasshing my computer.
Thanks!
Is this what you are looking for?
In [36]: a = np.random.random(20)
In [37]: a
Out[37]:
array([ 0.68574307, 0.15743428, 0.68006876, 0.63572484, 0.26279663,
0.14346269, 0.56267286, 0.47250091, 0.91168387, 0.98915746,
0.22174062, 0.11930722, 0.30848231, 0.1550406 , 0.60717858,
0.23805205, 0.57718675, 0.78075297, 0.17083826, 0.87301963])
In [38]: b = np.array((0.3,0.7))
In [39]: np.sum(a[:,None]<b[None,:], axis=0)
Out[39]: array([ 8, 16])
In [40]: np.sum(a[:,None]<b, axis=0) # b's new axis above is unnecessary...
Out[40]: array([ 8, 16])
In [41]: (a[:,None]<b).sum(axis=0) # even simpler
Out[41]: array([ 8, 16])
Timings are always well received (for a longish, 2E6 elements array)
In [47]: a = np.random.random(2000000)
In [48]: %timeit (a[:,None]<b).sum(axis=0)
10 loops, best of 3: 78.2 ms per loop
In [49]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
1 loop, best of 3: 448 ms per loop
For a smaller array
In [50]: a = np.random.random(2000)
In [51]: %timeit (a[:,None]<b).sum(axis=0)
10000 loops, best of 3: 89 µs per loop
In [52]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 141 µs per loop
Edit
Divakar says that things may be different for lenghty bs, let's see
In [71]: a = np.random.random(2000)
In [72]: b =np.random.random(200)
In [73]: %timeit (a[:,None]<b).sum(axis=0)
1000 loops, best of 3: 1.44 ms per loop
In [74]: %timeit np.searchsorted(a, b, 'right',sorter=a.argsort())
10000 loops, best of 3: 172 µs per loop
quite different indeed! Thank you for prompting my curiosity.
Probably the OP should test for his use case, very long sample with respect to cutoff sequences or not? and where there is a balance?
Edit #2
I made a blooper in my timings, I forgot the axis=0 argument to .sum()...
I've edited the timings with the corrected statement and, of course, the corrected timing. My apologies.
You can use np.searchsorted for some NumPy magic -
# Convert to numpy array for some "magic"
pan_series_arr = np.array(pan_series)
# Let the magic begin!
sortidx = pan_series_arr.argsort()
out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
Explanation
You are performing [(pan_series < cutoff_val).sum() for cutoff_val in cutoff_ar] i.e. for each
element in cutoff_ar, we are counting the number of pan_series elements that are lesser than it. Now with np.searchsorted, we are looking for cutoff_ar to be put in a sorted pan_series_arr and get the indices of such positions compared to whom the current element in cutoff_ar is at 'right' position . These indices essentially represent the number of pan_series elements below the current cutoff_ar element, thus giving us our desired output.
Sample run
In [302]: cutoff_ar
Out[302]: array([ 1, 3, 9, 44, 63, 90])
In [303]: pan_series_arr
Out[303]: array([ 2, 8, 69, 55, 97])
In [304]: [(pan_series_arr < cutoff_val).sum() for cutoff_val in cutoff_ar]
Out[304]: [0, 1, 2, 2, 3, 4]
In [305]: sortidx = pan_series_arr.argsort()
...: out = np.searchsorted(pan_series_arr,cutoff_ar,'right',sorter=sortidx)
...:
In [306]: out
Out[306]: array([0, 1, 2, 2, 3, 4])
In numpy, is there a nice idiomatic way of testing if all rows are equal in a 2d array?
I can do something like
np.all([np.array_equal(M[0], M[i]) for i in xrange(1,len(M))])
This seems to mix python lists with numpy arrays which is ugly and presumably also slow.
Is there a nicer/neater way?
One way is to check that every row of the array arr is equal to its first row arr[0]:
(arr == arr[0]).all()
Using equality == is fine for integer values, but if arr contains floating point values you could use np.isclose instead to check for equality within a given tolerance:
np.isclose(a, a[0]).all()
If your array contains NaN and you want to avoid the tricky NaN != NaN issue, you could combine this approach with np.isnan:
(np.isclose(a, a[0]) | np.isnan(a)).all()
Simply check if the number if unique items in the array are 1:
>>> arr = np.array([[1]*10 for _ in xrange(5)])
>>> len(np.unique(arr)) == 1
True
A solution inspired from unutbu's answer:
>>> arr = np.array([[1]*10 for _ in xrange(5)])
>>> np.all(np.all(arr == arr[0,:], axis = 1))
True
One problem with your code is that you're creating an entire list first before applying np.all() on it. Due to that there's no short-circuiting happening in your version, instead of that it would be better if you use Python's all() with a generator expression:
Timing comparisons:
>>> M = arr = np.array([[3]*100] + [[2]*100 for _ in xrange(1000)])
>>> %timeit np.all(np.all(arr == arr[0,:], axis = 1))
1000 loops, best of 3: 272 µs per loop
>>> %timeit (np.diff(M, axis=0) == 0).all()
1000 loops, best of 3: 596 µs per loop
>>> %timeit np.all([np.array_equal(M[0], M[i]) for i in xrange(1,len(M))])
100 loops, best of 3: 10.6 ms per loop
>>> %timeit all(np.array_equal(M[0], M[i]) for i in xrange(1,len(M)))
100000 loops, best of 3: 11.3 µs per loop
>>> M = arr = np.array([[2]*100 for _ in xrange(1000)])
>>> %timeit np.all(np.all(arr == arr[0,:], axis = 1))
1000 loops, best of 3: 330 µs per loop
>>> %timeit (np.diff(M, axis=0) == 0).all()
1000 loops, best of 3: 594 µs per loop
>>> %timeit np.all([np.array_equal(M[0], M[i]) for i in xrange(1,len(M))])
100 loops, best of 3: 9.51 ms per loop
>>> %timeit all(np.array_equal(M[0], M[i]) for i in xrange(1,len(M)))
100 loops, best of 3: 9.44 ms per loop
It is worth mentioning that the above version will not work for multidimensional arrays.
For example: for a three-dimensional square image tensor img [256, 256, 3] , we need to check whether the same RGB [256, 256] layers in the image or not.
In this case, we need to use broadcasting
(img == img[:, :, 0, np.newaxis]).all()
Because simple img[:, :, 0] gives us [256, 256], but we need [256, 256, 1] to broadcast through layers.
For Alex's answer about nan, we have now,
np.isclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)