Related
Short intro
I have two paired lists of 2D numpy arrays (see below) - paired in the sense that index 0 in array1 corresponds to index 0 in array2. For each of the pairs I want to get all the combinations of all rows in the 2D numpy arrays, like answered by Divakar here.
Array example
arr1 = [
np.vstack([[1,6,3,9], [8,5,6,7]]),
np.vstack([[1,6,3,9]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
arr2 = [
np.vstack([[8,8,8,8]]),
np.vstack([[8,8,8,8]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
Working code
Note, unlike the linked answer my columns are fixed (always 4) hence I replaced using shape by the hardcode value 4 (or 8 in np.zeros).
def merge(a1, a2):
# From: https://stackoverflow.com/questions/47143712/combination-of-all-rows-in-two-numpy-arrays
m1 = a1.shape[0]
m2 = a2.shape[0]
out = np.zeros((m1, m2, 8), dtype=int)
out[:, :, :4] = a1[:, None, :]
out[:, :, 4:] = a2
out.shape = (m1 * m2, -1)
return out
total = np.concatenate([merge(arr1[i], arr2[i]) for i in range(len(arr1))])
print(total)
Question
While the above works fine, it looks inefficient to me as it:
involves looping through the arrays
"appends" (in list list comprehsion) to the total array, requiring it to allocate memory each time
creates multiple zero arrays (in the merge function), whereas I could create an empty one at the start? related to the point above
I perform this operation thousands of times on arrays with millions of elements, so any suggestions on how to transform this code into something more efficient?
To be honest, this seems pretty hard to optimize. Each step in the loop has a different size, so likely there isn't any purely vectorized way of doing these things. You can try pre-allocating the memory and writing in place, rather than allocating many pieces and finally concatenating the results, but I'd bet that doesn't help you much (unless you are under such restrained conditions that you don't have enough RAM to store everything twice, of course).
Feel free to try the following approach on your larger data, but I'd be surprised if you get any significant speedup (or even that you don't get slower results!).
# Use scalar product to get the final size
result = np.zeros((np.dot([len(x) for x in arr1], [len(x) for x in arr2]), 8), dtype=int)
start = 0
for a1, a2 in zip(arr1, arr2):
end = start + len(a1) * len(a2)
result[start:end, :4] = np.repeat(a1, len(a2), axis=0)
result[start:end, 4:] = np.tile(a2, (len(a1), 1))
start = end
This is what I wanted to see - the list and the merge results:
In [60]: arr1
Out[60]:
[array([[1, 6, 3, 9],
[8, 5, 6, 7]]),
array([[1, 6, 3, 9]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [61]: arr2
Out[61]:
[array([[8, 8, 8, 8]]),
array([[8, 8, 8, 8]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [63]: merge(arr1[0],arr2[0]) # a (2,4) with (1,4) => (2,8)
Out[63]:
array([[1, 6, 3, 9, 8, 8, 8, 8],
[8, 5, 6, 7, 8, 8, 8, 8]])
In [64]: merge(arr1[1],arr2[1]) # a (1,4) with (1,4) => (1,8)
Out[64]: array([[1, 6, 3, 9, 8, 8, 8, 8]])
In [65]: merge(arr1[2],arr2[2]) # a (3,4) with (3,4) => (9,8)
Out[65]:
array([[1, 6, 3, 9, 1, 6, 3, 9],
[1, 6, 3, 9, 8, 5, 6, 7],
[1, 6, 3, 9, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7]])
And total is (12,8), combing all "rows".
The list comprehension is, more cleanly stated:
[merge(a,b) for a,b in zip(arr1,arr2)]
The lists, while the same length, have arrays with different numbers of rows, and the merge is also different.
People often ask about making an array iteratively, and we consistently say, collect the results in a list, and do one concatenate (like) construction at the end. The equivalent loop is:
In [70]: alist = []
...: for a,b in zip(arr1,arr2):
...: alist.append(merge(a,b))
This is usually competitive with predefining the total array, and assigning rows. And in your case to get the final shape of total you'd have to iterate through the lists and record the number of rows, etc.
Unless the computation is trivial, the iteration mechanism is a minor part of the total time. I'm pretty sure that here, it's calling merge 3 times that's taking most of the time. For a task like this I wouldn't worry too much about memory use, including the creation of the zeros. You have to, in one way or other use memory for a (12,8) final result. Building that from a (2,8),(1,8), and (9,8) isn't a big issue.
The list comprehension with concatenate and without:
In [72]: timeit total = np.concatenate([merge(a,b) for a,b in zip(arr1,arr2)])
22.4 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit [merge(a,b) for a,b in zip(arr1,arr2)]
16.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Calling merge 3 times with any of the pairs takes about the same time.
Oh, another thing, don't try to 'reuse' the out array across merge calls. When accumulating results like this in a list, reuse of the arrays is dangerous. Each merge call must return its own array, not a "recycled" one.
I have two array's:
In [32]: a
Out[32]:
array([[1, 2, 3],
[2, 3, 4]])
In [33]: b
Out[33]:
array([[ 8, 9],
[ 9, 10]])
I would like to get the following:
In [35]: c
Out[35]:
array([[ 1, 2, 3, 8, 9],
[ 2, 3, 4, 9, 10]])
i.e. apped the first and second value of b[0] = array([8, 9]) as the last two values of a[0]
and append the first and second value of b[1] = array([9,10]) as the last two values of a[1].
The second answer in this link: How to add multiple extra columns to a NumPy array does not work and I do not understand the accepted answer.
You could try with np.hstack:
a=np.array([[1, 2, 3],
[2, 3, 4]])
b=np.array([[ 8, 9],
[ 9, 10]])
print(np.hstack((a,b)))
output:
[[ 1 2 3 8 9]
[ 2 3 4 9 10]]
Or since the first answer of link you attached is faster than concatenate, and as you can see G.Anderson's timings, the fastest was concatenate, here is an explanation, so you can use that first answer:
#So you create an array of the same shape that the expected concatenate output:
res = np.zeros((2,5),int)
res
[[0 0 0 0 0]
[0 0 0 0 0]]
#Then you assign res[:,:3] to fisrt array, where res[:,:3] that is the first 3 elements of each row
res[:,:3]
[[0 0 0]
[0 0 0]]
res[:,:3]=a #assign
res[:,:3]
[[1, 2, 3]
[2, 3, 4]]
#Then you assign res[:,3:] to fisrt array, where res[:,3:] that is the last two elements of eah row
res[:,3:]
[[0 0]
[0 0]]
res[:,3:]=b #assign
res[:,3:]
[[ 8, 9]
[ 9, 10]]
#And finally:
res
[[ 1 2 3 8 9]
[ 2 3 4 9 10]]
You can do concatenate:
np.concatenate([a,b], axis=1)
Output:
array([[ 1, 2, 3, 8, 9],
[ 2, 3, 4, 9, 10]])
You can use np.append with the axis parameter for joining two arrays on a given axis
np.append(a,b, axis=1)
array([[ 1, 2, 3, 8, 9],
[ 2, 3, 4, 9, 10]])
Adding timings for the top three answers, for completeness sake. Note that these timings will vary based on the machine running the code, and may scale at different rates for different sizes of array
%timeit np.append(a,b, axis=1)
2.81 µs ± 438 ns per loop
%timeit np.concatenate([a,b], axis=1)
2.32 µs ± 375 ns per loop
%timeit np.hstack((a,b))
4.41 µs ± 489 ns per loop
from numpy documentation about numpy.concatenate
Join a sequence of arrays along an existing axis.
and from the question, I understood is that what you want
import numpy as np
a = np.array([[1, 2, 3],
[2, 3, 4]])
b = np.array([[ 8, 9],
[ 9, 10]])
c = np.concatenate((a, b), axis=1)
print ("a: ", a)
print ("b: ", b)
print ("c: ", c)
output:
a: [[1 2 3]
[2 3 4]]
b: [[ 8 9]
[ 9 10]]
c: [[ 1 2 3 8 9]
[ 2 3 4 9 10]]
Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).
>>> arr1
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
>>> arr2
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
For every row-vector in arr1, I want to count the occurrence of that row vector in arr2 and generate a vector of these counts. So for this example, the result would be
[3, 0, 1, 1, 3]
What is an efficient way to do this?
First approach:
The obvious approach of just using looping over the row-vectors of arr1 and generating a corresponding boolean vector on arr2 seems very slow.
np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)
And it seems like a bad algorithm, as I have to check the same rows multiple times.
Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis.
cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.
Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted alongwith np.bincount for the counting -
def searchsorted_based(a,b):
dims = np.maximum(a.max(0), b.max(0))+1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b.T,dims)
unq_a1D, IDs = np.unique(a1D, return_inverse=1)
fidx = np.searchsorted(unq_a1D, b1D)
fidx[fidx==unq_a1D.size] = 0
mask = unq_a1D[fidx] == b1D
count = np.bincount(fidx[mask])
out = count[IDs]
return out
Sample run -
In [308]: a
Out[308]:
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
In [309]: b
Out[309]:
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4],
[2, 1, 5]])
In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])
Runtime test -
In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]
In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]
In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True
# #Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop
In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop
numpy:
Start off with gathering the linear index equivalents to row and column subscripts of a2 using np.ravel_multi_index. Add 1 to account for the 0-based indexing of numpy. Get the respective counts for the unique rows present through np.unique(). Next, find matching rows between the unique rows of a2 and a1 by extending a1 to a new dimension towards the right-axis (also known as broadcasting) and extract indices of non-zero rows for both the arrays.
Initialize an array of zeros and fill it's values by slicing based on the obtained indices.
def comp2D_vect(a1, a2):
midx = np.ravel_multi_index(a2.T, a2.max(0)+1)
a, idx, cnt = np.unique(midx, return_counts=True, return_index=True)
m1, m2 = (a1[:, None] == a2[idx]).all(-1).nonzero()
out = np.zeros(a1.shape[0], dtype=int)
out[m1] = cnt[m2]
return out
benchmarks:
For: a2 = a2.repeat(100000, axis=0)
%%timeit
df = pd.DataFrame(a2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(a1.T.tolist(), fill_value=0).values
10 loops, best of 3: 67.2 ms per loop # # Ted Petrou's solution
%timeit comp2D_vect(a1, a2)
10 loops, best of 3: 34 ms per loop # Posted solution
%timeit searchsorted_based(a1,a2)
10 loops, best of 3: 27.6 ms per loop # # Divakar's solution (winner)
Pandas would be a good tool for this. You can put arr2 into a dataframe and use the groupby method to count the number of occurences of each row and then reindex the result with arr1.
arr1=np.array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
arr2 = np.array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
df = pd.DataFrame(arr2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
Output
a b c
3 1 4 3
4 3 5 0
6 5 4 1
4 1
3 1 4 3
dtype: int64
Timings
Create a lot more data first
arr2_2 = arr2.repeat(100000, axis=0)
Now time it:
%%timeit
cnter = Counter(tuple(row) for row in arr2_2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
1 loop, best of 3: 704 ms per loop
%%timeit
df = pd.DataFrame(arr2_2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
10 loops, best of 3: 53.8 ms per loop
Is there something like numpy.argmin(x), but for median?
a quick approximation:
numpy.argsort(data)[len(data)//2]
In general, this is an ill-posed question because an array does not necessarily contain its own median for numpy's definition of the median. For example:
>>> np.median([1, 2])
1.5
But when the length of the array is odd, the median will generally be in the array, so asking for its index does make sense:
>>> np.median([1, 2, 3])
2
For odd-length arrays, an efficient way to determine the index of the median value is by using the np.argpartition function. For example:
import numpy as np
def argmedian(x):
return np.argpartition(x, len(x) // 2)[len(x) // 2]
# Works for odd-length arrays, where the median is in the array:
x = np.random.rand(101)
print("median in array:", np.median(x) in x)
# median in array: True
print(x[argmedian(x)], np.median(x))
# 0.5819150016674371 0.5819150016674371
# Doesn't work for even-length arrays, where the median is not in the array:
x = np.random.rand(100)
print("median in array:", np.median(x) in x)
# median in array: False
print(x[argmedian(x)], np.median(x))
# 0.6116799104572843 0.6047559243909065
This is quite a bit faster than the accepted sort-based solution as the size of the array grows:
x = np.random.rand(1000)
%timeit np.argsort(x)[len(x)//2]
# 10000 loops, best of 3: 25.4 µs per loop
%timeit np.argpartition(x, len(x) // 2)[len(x) // 2]
# 100000 loops, best of 3: 6.03 µs per loop
It seems old question, but i found a nice way to make it so:
import random
import numpy as np
#some random list with 20 elements
a = [random.random() for i in range(20)]
#find the median index of a
medIdx = a.index(np.percentile(a,50,interpolation='nearest'))
The neat trick here is the percentile builtin option for nearest interpolation, which return a "real" median value from the list, so it is safe to search for it afterwards.
You can keep the indices with the elements (zip) and sort and return the element on the middle or two elements on the middle, however sorting will be O(n.logn). The following method is O(n) in terms of time complexity.
import numpy as np
def arg_median(a):
if len(a) % 2 == 1:
return np.where(a == np.median(a))[0][0]
else:
l,r = len(a) // 2 - 1, len(a) // 2
left = np.partition(a, l)[l]
right = np.partition(a, r)[r]
return [np.where(a == left)[0][0], np.where(a == right)[0][0]]
print(arg_median(np.array([ 3, 9, 5, 1, 15])))
# 1 3 5 9 15, median=5, index=2
print(arg_median(np.array([ 3, 9, 5, 1, 15, 12])))
# 1 3 5 9 12 15, median=5,9, index=2,1
Output:
2
[2, 1]
The idea is if there is only one median (array has a odd length), then it returns the index of the median. If we need to average to elements (array has even length) then it returns the indices of these two elements in an list.
The problem with the accepted answer numpy.argsort(data)[len(data)//2] is that it only works for 1-dimensional arrays. For n-dimensional arrays we need to use a different solution which is based on the answer proposed by #Hagay.
import numpy as np
# Initialize random 2d array, a
a = np.random.randint(0, 7, size=16).reshape(4,4)
array([[3, 1, 3, 4],
[5, 2, 1, 4],
[4, 2, 4, 2],
[6, 1, 0, 6]])
# Get the argmedians
np.stack(np.nonzero(a == np.percentile(a,50,interpolation='nearest')), axis=1)
array([[0, 0],
[0, 2]])
# Initialize random 3d array, a
a = np.random.randint(0, 10, size=27).reshape(3,3,3)
array([[[3, 5, 3],
[7, 4, 3],
[8, 3, 0]],
[[2, 6, 1],
[7, 8, 8],
[0, 6, 5]],
[[0, 7, 8],
[3, 1, 0],
[9, 6, 7]]])
# Get the argmedians
np.stack(np.nonzero(a == np.percentile(a,50,interpolation='nearest')), axis=1)
array([[0, 0, 1],
[1, 2, 2]])
The accepted answer numpy.argsort(data)[len(data)//2] can not handle arrays with NaNs.
For 2-D array, to get the median column index in the axis=1 (along row):
df = pd.DataFrame({'a': [1, 2, 3.3, 4],
'b': [80, 23, np.nan, 88],
'c': [75, 45, 76, 67],
'd': [5, 4, 6, 7]})
data = df.to_numpy()
# data
array([[ 1. , 80. , 75. , 5. ],
[ 2. , 23. , 45. , 4. ],
[ 3.3, nan, 76. , 6. ],
[ 4. , 88. , 67. , 7. ]])
# median, ignoring NaNs
amedian = np.nanmedian(data, axis=1)
aabs = np.abs(data.T-amedian).T
idx = np.nanargmin(aabs, axis=1)
idx
array([2, 1, 3, 2])
# the accepted answer, please note the third index is 2, the correspnoding cell value is 76, which should not be the median value in row [ 3.3, nan, 76. , 6. ]
idx = np.argsort(data)[:, len(data[0])//2]
idx
array([2, 1, 2, 2])
Since this is a 4*4 array with even columns, the column index of median value for row No.3 should be 6 instead of 76.
Consider two lists:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
I want a resulting list c where
c = [0, 0, 2, 0, 4, 5, 0 ,0 ,0 ,0]
is a list of length len(b) with values taken from b defined by indices specified in a and zeros elsewhere.
What is the most elegant way of doing this?
Use a list comprehension with the conditional expression and enumerate.
This LC will iterate over the index and the value of the list b and if the index i is found within a then it will set the element to v, otherwise it'll set it to 0.
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [v if i in a else 0 for i, v in enumerate(b)]
print(c)
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]
Note: If a is large then you may be best converting to a set first, before using in. The time complexity for using in with a list is O(n) whilst for a set it is O(1) (in the average case for both).
The list comprehension is roughly equivalent to the following code (for explanation):
c = []
for i, v in enumerate(b):
if i in a:
c.append(v)
else:
c.append(0)
As you have the option of using numpy I've included a simple method below which uses initialises an array filled with zeros and then uses list indexing to replace the elements.
import numpy as np
a2 = np.array(a)
b2 = np.array(b)
c = np.zeros(len(b2))
c[a2] = b[a2]
When timing the three methods (my list comp, my numpy, and Jon's method) the following results are given for N = 1000, a = list(range(0, N, 10)), and b = list(range(N)).
In [170]: %timeit lc_func(a,b)
100 loops, best of 3: 3.56 ms per loop
In [171]: %timeit numpy_func(a2,b2)
100000 loops, best of 3: 14.8 µs per loop
In [172]: %timeit jon_func(a,b)
10000 loops, best of 3: 22.8 µs per loop
This is to be expected. The numpy function is fastest, but both Jon's function and the numpy are much faster than a list comprehension. If I increased the number of elements to 100,000 then the gap between numpy and Jon's method gets even larger.
Interestingly enough though, for small N Jon's function is the best! I suspect this is to do with the overhead of creating numpy arrays being trumped by the overhead of lists.
Moral of the story: large N? Go with numpy. Small N? Go with Jon.
The other option is to pre-initialise the target list with 0s - a fast operation, then over-write the value to the suitable index, eg:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [0] * len(b)
for el in a:
c[el] = b[el]
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]