I have the following type of arrays:
a = array([[1,1,1],
[1,1,1],
[1,1,1],
[2,2,2],
[2,2,2],
[2,2,2],
[3,3,0],
[3,3,0],
[3,3,0]])
I would like to count the number of occurrences of each type of array such as
[1,1,1]:3, [2,2,2]:3, and [3,3,0]: 3
How could I achieve this in python? Is it possible without using a for loop and counting into a dictionary? It has to be fast and should take less than 0.1 seconds or so. I looked into Counter, numpy bincount, etc. But, those are for individual element not for an array.
Thanks.
If you don't mind mapping to tuples just to get the count you can use a Counter dict which runs in 28.5 µs on my machine using python3 which is well below your threshold:
In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop
In [6]: c = Counter(map(tuple, a))
In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
collections.Counter can do this conveniently, and almost like the example given.
>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
... c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
This converts each sub-list to a tuple, which can be keys in a dictionary since they are immutable. Lists are mutable so can't be used as dict keys.
Why do you want to avoid using for loops?
And similar to #padraic-cunningham's much cooler answer:
>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
You could convert those rows to a 1D array using the elements as two-dimensional indices with np.ravel_multi_index. Then, use np.unique to give us the positions of the start of each unique row and also has an optional argument return_counts to give us the counts. Thus, the implementation would look something like this -
def unique_rows_counts(a):
# Calculate linear indices using rows from a
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
# Get the unique indices and their counts
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
# return the unique groups from a and their respective counts
return a[unq_idx], counts
Sample run -
In [64]: a
Out[64]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[2, 2, 2],
[2, 2, 2],
[2, 2, 2],
[3, 3, 0],
[3, 3, 0],
[3, 3, 0]])
In [65]: unqrows, counts = unique_rows_counts(a)
In [66]: unqrows
Out[66]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])
Benchmarking
Assuming you are okay with either numpy arrays or collections as outputs, one can benchmark the solutions provided thus far, like so -
Function definitions:
import numpy as np
from collections import Counter
def unique_rows_counts(a):
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
return a[unq_idx], counts
def map_Counter(a):
return Counter(map(tuple, a))
def forloop_Counter(a):
c = Counter()
for x in a:
c[tuple(x)] += 1
return c
Timings:
In [53]: a = np.random.randint(0,4,(10000,5))
In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop
In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop
In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop
The numpy_indexed package (disclaimer: I am its author) contains efficient vectorized functionality for these kind of operations:
import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)
Note that this works for arrays of any dimensionality or datatype.
Since numpy-1.13.0, np.unique can be used with axis argument:
>>> np.unique(a, axis=0, return_counts=True)
(array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]]), array([3, 3, 3]))
Related
For example, for
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
I want to get
[2, 2, 3]
Is there a way to do this without for loops or using np.vectorize?
Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.
Approach #1
One vectorized approach with sorting -
In [8]: b = np.sort(a,axis=1)
In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])
Approach #2
Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Runtime test
Approaches as funcs -
def sorting(a):
b = np.sort(a,axis=1)
return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
def bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
# From #wim's post
def pandas(a):
df = pd.DataFrame(a.T)
return df.nunique()
# #jp_data_analysis's soln
def numpy_apply(a):
return np.apply_along_axis(compose(len, np.unique), 1, a)
Case #1 : Square shaped one
In [164]: np.random.seed(0)
In [165]: a = np.random.randint(0,5,(10000,10000))
In [166]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop
Case #2 : Large number of rows
In [167]: np.random.seed(0)
In [168]: a = np.random.randint(0,5,(1000000,10))
In [169]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop
Extending to number of unique elements per column
To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -
def nunique_percol_sort(a):
b = np.sort(a,axis=0)
return (b[1:] != b[:-1]).sum(axis=0)+1
def nunique_percol_bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[1]))*n
M = a.shape[1]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Generic ndarray with generic axis
Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -
def nunique(a, axis):
return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
Sample runs -
In [77]: a
Out[77]:
array([[1, 0, 2, 2, 0],
[1, 0, 1, 2, 0],
[0, 0, 0, 0, 2],
[1, 2, 1, 0, 1],
[2, 0, 1, 0, 0]])
In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])
In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])
If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -
(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
For a custom tolerance value, feed those with np.isclose.
This solution via np.apply_along_axis isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len + np.unique functions.
import numpy as np
from toolz import compose
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.apply_along_axis(compose(len, np.unique), 1, a) # [2, 2, 3]
A oneliner using sort:
In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])
Are you open to considering pandas? Dataframes have a dedicated method for this
>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3
Given a 3 dimensional numpy array, how to find the indexes of top n smallest values ? The index of the minimum value can be found as:
i,j,k = np.where(my_array == my_array.min())
Here's one approach for generic n-dims and generic N smallest numbers -
def smallestN_indices(a, N):
idx = a.ravel().argsort()[:N]
return np.stack(np.unravel_index(idx, a.shape)).T
Each row of the the 2D output array would hold the indexing tuple that corresponds to one of the smallest array numbers.
We can also use argpartition, but that might not maintain the order. So, we need a bit more additional work with argsort there -
def smallestN_indices_argparitition(a, N, maintain_order=False):
idx = np.argpartition(a.ravel(),N)[:N]
if maintain_order:
idx = idx[a.ravel()[idx].argsort()]
return np.stack(np.unravel_index(idx, a.shape)).T
Sample run -
In [141]: np.random.seed(1234)
...: a = np.random.randint(111,999,(2,5,4,3))
...:
In [142]: smallestN_indices(a, N=3)
Out[142]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
In [143]: smallestN_indices_argparitition(a, N=3)
Out[143]:
array([[1, 2, 3, 0],
[0, 3, 2, 0],
[1, 2, 2, 1]])
In [144]: smallestN_indices_argparitition(a, N=3, maintain_order=True)
Out[144]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
Runtime test -
In [145]: a = np.random.randint(111,999,(20,50,40,30))
In [146]: %timeit smallestN_indices(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3, maintain_order=True)
...:
10 loops, best of 3: 97.6 ms per loop
100 loops, best of 3: 8.32 ms per loop
100 loops, best of 3: 8.34 ms per loop
I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.
import numpy as np
values = np.asarray([
[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)
The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.
Here's one approach -
# Get lexsorted indices and hence sorted values by those indices
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_vals = values[lexsort_idx]
# Mask of steps where rows shift (there are no duplicates in subsequent rows)
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]
# Get the stepped indices (indices shift at non duplicate rows) and
# the index values are scaled corresponding to row numbers
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))
# Re-arrange the stepped indices based on the original order of rows
# This is basically same as the original code does in last 4 steps,
# just in a concise manner
out_idx = stepped_idx[lexsort_idx.argsort()]
Sample step-by-step intermediate outputs -
In [55]: values
Out[55]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [56]: lexsort_idx
Out[56]: array([1, 4, 0, 3, 2])
In [57]: lexsort_vals
Out[57]:
array([[1, 1, 1],
[1, 1, 2],
[1, 2, 3],
[1, 2, 3],
[2, 2, 3]])
In [58]: mask
Out[58]: array([ True, True, True, False, True], dtype=bool)
In [59]: stepped_idx
Out[59]: array([0, 1, 2, 2, 4])
In [60]: lexsort_idx.argsort()
Out[60]: array([2, 0, 4, 3, 1])
In [61]: stepped_idx[lexsort_idx.argsort()]
Out[61]: array([2, 0, 4, 2, 1])
Performance boost
For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).
Runtime test
Applying few more optimization tricks, we would have a version like so -
def numpy_app(values):
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_v = values[lexsort_idx]
mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))
stepped_idx = np.arange(mask.size)
stepped_idx[mask] = 0
np.maximum.accumulate(stepped_idx, out=stepped_idx)
return stepped_idx[argsort_unique(lexsort_idx)]
#Warren Weckesser's rankdata based method as a func for timings -
def scipy_app(values):
v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
return rankdata(v, method='min') - 1
Timings -
In [97]: a = np.random.randint(0,9,(10000,3))
In [98]: out1 = numpy_app(a)
In [99]: out2 = scipy_app(a)
In [100]: np.allclose(out1, out2)
Out[100]: True
In [101]: %timeit scipy_app(a)
100 loops, best of 3: 5.32 ms per loop
In [102]: %timeit numpy_app(a)
100 loops, best of 3: 1.96 ms per loop
Here's a way to do it using scipy.stats.rankdata (with method='min'), by viewing the 2-d array as a 1-d structured array:
In [15]: values
Out[15]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [16]: v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
In [17]: rankdata(v, method='min') - 1
Out[17]: array([2, 0, 4, 2, 1])
The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)
Consider two lists:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
I want a resulting list c where
c = [0, 0, 2, 0, 4, 5, 0 ,0 ,0 ,0]
is a list of length len(b) with values taken from b defined by indices specified in a and zeros elsewhere.
What is the most elegant way of doing this?
Use a list comprehension with the conditional expression and enumerate.
This LC will iterate over the index and the value of the list b and if the index i is found within a then it will set the element to v, otherwise it'll set it to 0.
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [v if i in a else 0 for i, v in enumerate(b)]
print(c)
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]
Note: If a is large then you may be best converting to a set first, before using in. The time complexity for using in with a list is O(n) whilst for a set it is O(1) (in the average case for both).
The list comprehension is roughly equivalent to the following code (for explanation):
c = []
for i, v in enumerate(b):
if i in a:
c.append(v)
else:
c.append(0)
As you have the option of using numpy I've included a simple method below which uses initialises an array filled with zeros and then uses list indexing to replace the elements.
import numpy as np
a2 = np.array(a)
b2 = np.array(b)
c = np.zeros(len(b2))
c[a2] = b[a2]
When timing the three methods (my list comp, my numpy, and Jon's method) the following results are given for N = 1000, a = list(range(0, N, 10)), and b = list(range(N)).
In [170]: %timeit lc_func(a,b)
100 loops, best of 3: 3.56 ms per loop
In [171]: %timeit numpy_func(a2,b2)
100000 loops, best of 3: 14.8 µs per loop
In [172]: %timeit jon_func(a,b)
10000 loops, best of 3: 22.8 µs per loop
This is to be expected. The numpy function is fastest, but both Jon's function and the numpy are much faster than a list comprehension. If I increased the number of elements to 100,000 then the gap between numpy and Jon's method gets even larger.
Interestingly enough though, for small N Jon's function is the best! I suspect this is to do with the overhead of creating numpy arrays being trumped by the overhead of lists.
Moral of the story: large N? Go with numpy. Small N? Go with Jon.
The other option is to pre-initialise the target list with 0s - a fast operation, then over-write the value to the suitable index, eg:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [0] * len(b)
for el in a:
c[el] = b[el]
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]