Related
Suppose we are given two 2D numpy arrays a and b with the same number of rows. Assume furthermore that we know that each row i of a and b has at most one element in common, though this element may occur multiple times. How can we find this element as efficiently as possible?
An example:
import numpy as np
a = np.array([[1, 2, 3],
[2, 5, 2],
[5, 4, 4],
[2, 1, 3]])
b = np.array([[4, 5],
[3, 2],
[1, 5],
[0, 5]])
desiredResult = np.array([[np.nan],
[2],
[5],
[np.nan]])
It is easy to come up with a streightforward implementation by applying intersect1d along the first axis:
from intertools import starmap
desiredResult = np.array(list(starmap(np.intersect1d, zip(a, b))))
Apperently, using python's builtin set operations is even quicker. Converting the result to the desired form is easy.
However, I need an implementation as efficient as possible. Hence, I do not like the starmap, as I suppose that it requires a python call for every row. I would like a purely vectorized option, and would be happy, if this even exploitet our additional knowledge that there is at most one common value per row.
Does anyone have ideas how I could speed up the task and implement the solution more elegantly? I would be okay with using C code or cython, but coding effort should be not too much.
Approach #1
Here's a vectorized one based on searchsorted2d -
# Sort each row of a and b in-place
a.sort(1)
b.sort(1)
# Use 2D searchsorted row-wise between a and b
idx = searchsorted2d(a,b)
# "Clip-out" out of bounds indices
idx[idx==a.shape[1]] = 0
# Get mask of valid ones i.e. matches
mask = np.take_along_axis(a,idx,axis=1)==b
# Use argmax to get first match as we know there's at most one match
match_val = np.take_along_axis(b,mask.argmax(1)[:,None],axis=1)
# Finally use np.where to choose between valid match
# (decided by any one True in each row of mask)
out = np.where(mask.any(1)[:,None],match_val,np.nan)
Approach #2
Numba-based one for memory efficiency -
from numba import njit
#njit(parallel=True)
def numba_f1(a,b,out):
n,a_ncols = a.shape
b_ncols = b.shape[1]
for i in range(n):
for j in range(a_ncols):
for k in range(b_ncols):
m = a[i,j]==b[i,k]
if m:
break
if m:
out[i] = a[i,j]
break
return out
def find_first_common_elem_per_row(a,b):
out = np.full(len(a),np.nan)
numba_f1(a,b,out)
return out
Approach #3
Here's another vectorized one based on stacking and sorting -
r = np.arange(len(a))
ab = np.hstack((a,b))
idx = ab.argsort(1)
ab_s = ab[r[:,None],idx]
m = ab_s[:,:-1] == ab_s[:,1:]
m2 = (idx[:,1:]*m)>=a.shape[1]
m3 = m & m2
out = np.where(m3.any(1),b[r,idx[r,m3.argmax(1)+1]-a.shape[1]],np.nan)
Approach #4
For an elegant one, we can make use of broadcasting for a resource-hungry method -
m = (a[:,None]==b[:,:,None]).any(2)
out = np.where(m.any(1),b[np.arange(len(a)),m.argmax(1)],np.nan)
Doing some research, I found that checking whether two lists are disjoint runs in O(n+m), whereby n and m are the lengths of the lists (see here). The idea is that instertion and lookup of elements run in constant time for hash maps. Therefore, inserting all elements from the first list into a hashmap takes O(n) operations, and checking for each element in the second list whether it is already in the hash map takes O(m) operations. Therefore, solutions based on sorting, which run in O(n log(n) + m log(m)), are not optimal asymptotically.
Though the solutions by #Divakar are highly efficient in many use cases, they are less efficient, if the second dimension is large. Then, a solution based on hash maps is better suited. I have implemented it as follows in cython:
import numpy as np
cimport numpy as np
import cython
from libc.math cimport NAN
from libcpp.unordered_map cimport unordered_map
np.import_array()
#cython.boundscheck(False)
#cython.wraparound(False)
def get_common_element2d(np.ndarray[double, ndim=2] arr1,
np.ndarray[double, ndim=2] arr2):
cdef np.ndarray[double, ndim=1] result = np.empty(arr1.shape[0])
cdef int dim1 = arr1.shape[1]
cdef int dim2 = arr2.shape[1]
cdef int i, j
cdef unordered_map[double, int] tmpset = unordered_map[double, int]()
for i in range(arr1.shape[0]):
for j in range(dim1):
# insert arr1[i, j] as key without assigned value
tmpset[arr1[i, j]]
for j in range(dim2):
# check whether arr2[i, j] is in tmpset
if tmpset.count(arr2[i,j]):
result[i] = arr2[i,j]
break
else:
result[i] = NAN
tmpset.clear()
return result
I have created test cases as follows:
import numpy as np
import timeit
from itertools import starmap
from mycythonmodule import get_common_element2d
m, n = 3000, 3000
a = np.random.rand(m, n)
b = np.random.rand(m, n)
for i, row in enumerate(a):
if np.random.randint(2):
common = np.random.choice(row, 1)
b[i][np.random.choice(np.arange(n), np.random.randint(min(n,20)), False)] = common
# we need to copy the arrays on each test run, otherwise they
# will remain sorted, which would bias the results
%timeit [set(aa).intersection(bb) for aa, bb in zip(a.copy(), b.copy())]
# returns 3.11 s ± 56.8 ms
%timeit list(starmap(np.intersect1d, zip(a.copy(), b.copy)))
# returns 1.83 s ± 55.4
# test sorting method
# divakarsMethod1 is the appraoch #1 in #Divakar's answer
%timeit divakarsMethod1(a.copy(), b.copy())
# returns 1.88 s ± 18 ms
# test hash map method
%timeit get_common_element2d(a.copy(), b.copy())
# returns 1.46 s ± 22.6 ms
These results seem to indicate that the naive approach is actually better than some vectorized versions. However, the vectorized algorithms play out their strengths, if many rows with fewer columns are considered (a different use case). In these cases, the vectorized approaches are more than 5 times faster than the naive appraoch and the sorting method turns out to be best.
Conclusion: I will go with the HashMap-based cython version, because it is among the most efficient variants in both use cases. If I had to set up cython first, I would use the sorting-based method.
Not sure if this is faster, but we can try a couple things here:
Method 1 np.intersect1d with list comprehension
[np.intersect1d(arr[0], arr[1]) for arr in list(zip(a,b))]
# Out
[array([], dtype=int32), array([2]), array([5]), array([], dtype=int32)]
Or to list:
[np.intersect1d(arr[0], arr[1]).tolist() for arr in list(zip(a,b))]
# Out
[[], [2], [5], []]
Method 2 set with list comprehension:
[list(set(arr[0]) & set(arr[1])) for arr in list(zip(a,b))]
# Out
[[], [2], [5], []]
I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?
In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])
After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)
If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2
Suppose I have 2 matrices M and N (both have > 1 columns). I also have an index matrix I with 2 columns -- 1 for M and one for N. The indices for N are unique, but the indices for M may appear more than once. The operation I would like to perform is,
for i,j in w:
M[i] += N[j]
Is there a more efficient way to do this other than a for loop?
For completeness, in numpy >= 1.8 you can also use np.add's at method:
In [8]: m, n = np.random.rand(2, 10)
In [9]: m_idx, n_idx = np.random.randint(10, size=(2, 20))
In [10]: m0 = m.copy()
In [11]: np.add.at(m, m_idx, n[n_idx])
In [13]: m0 += np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
In [14]: np.allclose(m, m0)
Out[14]: True
In [15]: %timeit np.add.at(m, m_idx, n[n_idx])
100000 loops, best of 3: 9.49 us per loop
In [16]: %timeit np.bincount(m_idx, weights=n[n_idx], minlength=len(m))
1000000 loops, best of 3: 1.54 us per loop
Aside of the obvious performance disadvantage, it has a couple of advantages:
np.bincount converts its weights to double precision floats, .at will operate with you array's native type. This makes it the simplest option for dealing e.g. with complex numbers.
np.bincount only adds weights together, you have an at method for all ufuncs, so you can repeatedly multiply, or logical_and, or whatever you feel like.
But for your use case, np.bincount is probably the way to go.
Using also m_ind, n_ind = w.T, just do M += np.bincount(m_ind, weights=N[n_ind], minlength=len(M))
For clarity, let's define
>>> m_ind, n_ind = w.T
Then the for loop
for i, j in zip(m_ind, n_ind):
M[i] += N[j]
updates the entries M[np.unique(m_ind)]. The values that get written to it are N[n_ind], which must be grouped by m_ind. (The fact that there's an n_ind in addition to m_ind is actually tangential to the question; you could just set N = N[n_ind].) There happens to be a SciPy class that does exactly this: scipy.sparse.csr_matrix.
Example data:
>>> m_ind, n_ind = array([[0, 0, 1, 1], [2, 3, 0, 1]])
>>> M = np.arange(2, 6)
>>> N = np.logspace(2, 5, 4)
The result of the for loop is that M becomes [110002 1103 4 5]. We get the same result with a csr_matrix as follows. As I said earlier, n_ind isn't relevant, so we get rid of that first.
>>> N = N[n_ind]
>>> from scipy.sparse import csr_matrix
>>> update = csr_matrix((N, m_ind, [0, len(N)])).toarray()
The CSR constructor builds a matrix with the required values at the required indices; the third part of its argument is a compressed column index, meaning that the values N[0:len(N)] have the indices m_ind[0:len(N)]. Duplicates are summed:
>>> update
array([[ 110000., 1100.]])
This has shape (1, len(np.unique(m_ind))) and can be added in directly:
>>> M[np.unique(m_ind)] += update.ravel()
>>> M
array([110002, 1103, 4, 5])
Given a list of numpy arrays, each with the same dimensions, how can I find which array contains the maximum value on an element-by-element basis?
e.g.
import numpy as np
def find_index_where_max_occurs(my_list):
# d = ... something goes here ...
return d
a=np.array([1,1,3,1])
b=np.array([3,1,1,1])
c=np.array([1,3,1,1])
my_list=[a,b,c]
array_of_indices_where_max_occurs = find_index_where_max_occurs(my_list)
# This is what I want:
# >>> print array_of_indices_where_max_occurs
# array([1,2,0,0])
# i.e. for the first element, the maximum value occurs in array b which is at index 1 in my_list.
Any help would be much appreciated... thanks!
Another option if you want an array:
>>> np.array((a, b, c)).argmax(axis=0)
array([1, 2, 0, 0])
So:
def f(my_list):
return np.array(my_list).argmax(axis=0)
This works with multidimensional arrays, too.
For the fun of it, I realised that #Lev's original answer was faster than his generalized edit, so this is the generalized stacking version which is much faster than the np.asarray version, but it is not very elegant.
np.concatenate((a[None,...], b[None,...], c[None,...]), axis=0).argmax(0)
That is:
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
Some explanation:
I've added a new axis to each array: arr[None,...] is equivalent to arr[np.newaxis,...] which is the same as arr[np.newaxis,:,:,:] where the ... expands to be the appropriate number dimensions. The reason for this is because np.concatenate will then stack along the new dimension, which is 0 since the None is at the front.
So, for example:
In [286]: a
Out[286]:
array([[0, 1],
[2, 3]])
In [287]: b
Out[287]:
array([[10, 11],
[12, 13]])
In [288]: np.concatenate((a[None,...],b[None,...]),axis=0)
Out[288]:
array([[[ 0, 1],
[ 2, 3]],
[[10, 11],
[12, 13]]])
In case it helps to understand, this would work too:
np.concatenate((a[...,None], b[...,None], c[...,None]), axis=a.ndim).argmax(a.ndim)
where the new axis is now added at the end, so we must stack and maximize along that last axis, which will be a.ndim. For a, b, and c being 2d, we could do this:
np.concatenate((a[:,:,None], b[:,:,None], c[:,:,None]), axis=2).argmax(2)
Which is equivalent to the dstack I mentioned in my comment above (dstack adds a third axis to stack along if it doesn't exist in the arrays).
To test:
N = 10
M = 2
a = np.random.random((N,)*M)
b = np.random.random((N,)*M)
c = np.random.random((N,)*M)
def bystack(arrs):
return np.concatenate([arr[None,...] for arr in arrs], axis=0).argmax(0)
def byarray(arrs):
return np.array(arrs).argmax(axis=0)
def byasarray(arrs):
return np.asarray(arrs).argmax(axis=0)
def bylist(arrs):
assert arrs[0].ndim == 1, "ndim must be 1"
return [np.argmax(x) for x in zip(*arrs)]
In [240]: timeit bystack((a,b,c))
100000 loops, best of 3: 18.3 us per loop
In [241]: timeit byarray((a,b,c))
10000 loops, best of 3: 89.7 us per loop
In [242]: timeit byasarray((a,b,c))
10000 loops, best of 3: 90.0 us per loop
In [259]: timeit bylist((a,b,c))
1000 loops, best of 3: 267 us per loop
[np.argmax(x) for x in zip(*my_list)]
Well, this is just a list, but you know how to make it an array if you want. :)
To explain what this does: zip(*my_list) is equivalent to zip(a,b,c), which gives you a generator to loop over. Each step in the loop gives you a tuple like (a[i], b[i], c[i]), where i is the step in the loop. Then, np.argmax gives you the index of that tuple for the element with the largest value.
I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.
So if the array is like:
1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1
The result should be
1 3 2 2 2 1
Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.
I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.
Check scipy.stats.mode() (inspired by #tom10's comment):
import numpy as np
from scipy import stats
a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
m = stats.mode(a)
print(m)
Output:
ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))
As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:
print(m[0])
Output:
[[1 3 2 2 1 1]]
Update
The scipy.stats.mode function has been significantly optimized since this post, and would be the recommended method
Old answer
This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincount is handy, along with numpy.unique with the return_counts arg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
Result:
In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))
Some benchmarks:
In [4]: import scipy.stats
In [5]: a = numpy.random.randint(1,10,(1000,1000))
In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop
In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop
In [8]: a = numpy.random.randint(1,500,(1000,1000))
In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop
In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop
In [11]: a = numpy.random.random((200,200))
In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop
In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop
EDIT: Provided more of a background and modified the approach to be more memory-efficient
If you want to use numpy only:
x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)
gives
(array([-1, 1, 2, 3]), array([1, 1, 1, 2]))
And extract it:
index = np.argmax(counts)
return vals[index]
A neat solution that only uses numpy (not scipy nor the Counter class):
A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])
np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)
array([1, 3, 2, 2, 1, 1])
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.
simplest way in Python to get the mode of an list or array a
import statistics
a=[7,4,4,4,4,25,25,6,7,4867,5,6,56,52,32,44,4,4,44,4,44,4]
print(f"{statistics.mode(a)} is the mode (most frequently occurring number)")
That's it
I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.
For 1-d arrays:
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
For multiple dimensional arrays (little difference):
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6
nparr = nparr.reshape((10,2,5)) #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1) # just use .flatten() method
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
This may or may not be an efficient implementation, but it is convenient.
from collections import Counter
n = int(input())
data = sorted([int(i) for i in input().split()])
sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]
print(Mean)
The Counter(data) counts the frequency and returns a defaultdict. sorted(Counter(data).items()) sorts using the keys, not the frequency. Finally, need to sorted the frequency using another sorted with key = lambda x: x[1]. The reverse tells Python to sort the frequency from the largest to the smallest.
if you want to find mode as int Value here is the easiest way
I was trying to find out mode of Array using Scipy Stats but the problem is that output of the code look like:
ModeResult(mode=array(2), count=array([[1, 2, 2, 2, 1, 2]])) , I only want the Integer output so if you want the same just try this
import numpy as np
from scipy import stats
numbers = list(map(int, input().split()))
print(int(stats.mode(numbers)[0]))
Last line is enough to print Mode Value in Python: print(int(stats.mode(numbers)[0]))
If you wish to use only numpy and do it without using the index of the array. The following implementation combining dictionaries with numpy can be used.
val,count = np.unique(x,return_counts=True)
freq = {}
for v,c in zip(val,count):
freq[v] = c
mode = sorted(freq.items(),key =lambda kv :kv[1])[-1]
Finding Mode using dict in python
def mode(x):
d={}
k=0
v=0
for i in x:
d[i]=d.get(i,0)+1
if d[i]>v:
k=i
v=d[i]
print(d)
return k
print(mode(x))