I am seeing behaviour with numpy bincount that I cannot make sense of. I want to bin the values in a 2D array in a row-wise manner and see the behaviour below. Why would it work with dbArray but fail with simarray?
>>> dbArray
array([[1, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 1, 0]])
>>> N.apply_along_axis(N.bincount,1,dbArray)
array([[2, 3],
[0, 5],
[1, 4],
[4, 1],
[3, 2],
[3, 2]], dtype=int64)
>>> simarray
array([[2, 0, 2, 0, 2],
[2, 1, 2, 1, 2],
[2, 1, 1, 1, 2],
[2, 0, 1, 0, 1],
[1, 0, 1, 1, 2],
[1, 1, 1, 1, 1]])
>>> N.apply_along_axis(N.bincount,1,simarray)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
N.apply_along_axis(N.bincount,1,simarray)
File "C:\Python27\lib\site-packages\numpy\lib\shape_base.py", line 118, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (3)
The problem is that bincount isn't always returning the same shaped objects, in particular when values are missing. For example:
>>> m = np.array([[0,0,1],[1,1,0],[1,1,1]])
>>> np.apply_along_axis(np.bincount, 1, m)
array([[2, 1],
[1, 2],
[0, 3]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([2, 1]), array([1, 2]), array([0, 3])]
works, but:
>>> m = np.array([[0,0,0],[1,1,0],[1,1,0]])
>>> m
array([[0, 0, 0],
[1, 1, 0],
[1, 1, 0]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([3]), array([1, 2]), array([1, 2])]
>>> np.apply_along_axis(np.bincount, 1, m)
Traceback (most recent call last):
File "<ipython-input-49-72e06e26a718>", line 1, in <module>
np.apply_along_axis(np.bincount, 1, m)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/shape_base.py", line 117, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (1)
won't.
You could use the minlength parameter and pass it using a lambda or partial or something:
>>> np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=m)
array([[3, 0],
[1, 2],
[1, 2]])
As #DSM has already mentioned, bincount of a 2d array cannot be done without knowing the maximum value of the array, because it would mean an inconsistency of array sizes.
But thanks to the power of numpy's indexing, it was fairly easy to make a faster implementation of 2d bincount, as it doesn't use concatenation or anything.
def bincount2d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
indexing = np.arange(len(arr))
for col in arr.T:
count[indexing, col] += 1
return count
t = np.array([[1,2,3],[4,5,6],[3,2,2]], dtype=np.int64)
print(bincount2d(t))
P.S.
This:
t = np.empty(shape=[10000, 100], dtype=np.int64)
s = time.time()
bincount2d(t)
e = time.time()
print(e - s)
gives ~2 times faster result, than this:
t = np.empty(shape=[100, 10000], dtype=np.int64)
s = time.time()
bincount2d(t)
e = time.time()
print(e - s)
because of the for loop iterating over columns. So, it's better to transpose your 2d array, if shape[0] < shape[1].
UPD
Better than this can't be done (using python alone, I mean):
def bincount2d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
indexing = (np.ones_like(arr).T * np.arange(len(arr))).T
np.add.at(count, (indexing, arr), 1)
return count
This is a function that does exactly what you want, but without any loops.
def sub_sum_partition(a, partition):
"""
Generalization of np.bincount(partition, a).
Sums rows of a matrix for each value of array of non-negative ints.
:param a: array_like
:param partition: array_like, 1 dimension, nonnegative ints
:return: matrix of shape ('one larger than the largest value in partition', a.shape[1:]). The i's element is
the sum of rows j in 'a' s.t. partition[j] == i
"""
assert partition.shape == (len(a),)
n = np.prod(a.shape[1:], dtype=int)
bins = ((np.tile(partition, (n, 1)) * n).T + np.arange(n, dtype=int)).reshape(-1)
sums = np.bincount(bins, a.reshape(-1))
if n > 1:
sums = sums.reshape(-1, *a.shape[1:])
return sums
Related
This question already has answers here:
How to get a list of all indices of repeated elements in a numpy array
(9 answers)
Closed 11 months ago.
I have a large NumPy integer array with a distinct set of values, e.g.,
[0, 1, 0, 0, 0, 2, 2]
From this, I would like to get all values along with a set of indices where they occur. The following works, but the explicit comparison == appears less than optional to me.
import numpy as np
arr = [0, 1, 0, 0, 0, 2, 2]
vals = np.unique(arr)
d = {val: np.where(arr == val)[0] for val in vals}
print(d)
{0: array([0, 2, 3, 4]), 1: array([1]), 2: array([5, 6])}
Any better ideas?
Another solution:
arr = np.array([0, 1, 0, 0, 0, 2, 2])
a = arr.argsort()
v, cnt = np.unique(arr, return_counts=True)
x = dict(zip(v, np.split(a, cnt.cumsum()[:-1])))
print(x)
Prints:
{0: array([0, 2, 3, 4]), 1: array([1]), 2: array([5, 6])}
But the speed-up depends on your data (how big is the array, how many unique elements are in the array...)
Some benchmark (Ubuntu 20.04 on AMD 3700x, Python 3.9.7, numpy==1.21.5):
import perfplot
NUM_UNIQUE_VALUES = 10
def make_data(n):
return np.random.randint(0, NUM_UNIQUE_VALUES, n)
def k1(arr):
vals = np.unique(arr)
return {val: np.where(arr == val)[0] for val in vals}
def k2(arr):
a = arr.argsort()
v, cnt = np.unique(arr, return_counts=True)
return dict(zip(v, np.split(a, cnt.cumsum()[:-1])))
perfplot.show(
setup=make_data,
kernels=[k1, k2],
labels=["Nico", "Andrej"],
equality_check=None,
n_range=[2 ** k for k in range(1, 25)],
xlabel="2**N",
logx=True,
logy=True,
)
With NUM_UNIQUE_VALUES = 10:
With NUM_UNIQUE_VALUES = 1024:
Getting bins from array of 1 million elements (changing only number of unique values):
def make_data(n):
return np.random.randint(0, n, 1_000_000)
Here's an alternative, but I don't think this is any better. This creates an "index" array, inserts that as a second column, and sorts the rows. You'll see the final result has the values in order, with their original indices in the second column.
>>> arr = np.array([0,1,0,0,0,2,2])
>>> ndx = np.arange(arr.shape[0])
>>> ndx
array([0, 1, 2, 3, 4, 5, 6])
>>> both = np.vstack((arr,ndx)).T
>>> both
array([[0, 0],
[1, 1],
[0, 2],
[0, 3],
[0, 4],
[2, 5],
[2, 6]])
>>> both[both[:,0].argsort()]
array([[0, 0],
[0, 2],
[0, 3],
[0, 4],
[1, 1],
[2, 5],
[2, 6]])
I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...
array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...
df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]
where y is ...
[array([[0, 0, 1, 2, 0]]),
array([[0, 1, 0, 0, 1],
[0, 1, 3, 2, 1],
[0, 1, 0, 0, 1]]),
array([[ 1, 1, 0, 1, 11],
[ 1, 1, 1, 0, 11],
[ 1, 1, 0, 1, 11]])]
This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.
Any ideas on how to speed this up?
What about using the numpy_indexed library:
import numpy as np
import numpy_indexed as npi
a = np.array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)
Output
y
[array([[0, 0, 1, 2]]),
array([[0, 1, 0, 0],
[0, 1, 3, 2],
[0, 1, 0, 0]]),
array([[ 1, 1, 0, 1],
[ 1, 1, 1, 0],
[ 1, 1, 0, 1]])]
You can easily install the library with:
> pip install numpy-indexed
Let me know if this performs better,
from collections import defaultdict
import numpy as np
outgen = defaultdict(lambda: [])
# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
outgen[key].append(val)
# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]
You can use np.unique directly here.
unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}
This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.
arr[:] = arr[np.argsort(indexer), :] # not sure if this is guaranteed to preserve the order within each group
EDIT:
Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).
columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)
It uses the internal algorithm of Dataframe.drop_duplicates.
Note that the ordering of the keys is not the sort order of the unique tuples.
There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms
import riptable as rt
columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns, return_inverse=True)
Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples
I want to create a matrix from a function, such that the (3,3) matrix C has values equal to 1 if the row index is smaller than a given threshold k.
import numpy as np
k = 3
C = np.fromfunction(lambda i,j: 1 if i < k else 0, (3,3))
However, this piece of code throws an error
"The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()" and I do not really understand why.
The code for fromfunction is:
dtype = kwargs.pop('dtype', float)
args = indices(shape, dtype=dtype)
return function(*args, **kwargs)
You see it calls function just once - with the whole array of indices. It is not iterative.
In [672]: idx = np.indices((3,3))
In [673]: idx
Out[673]:
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
Your lambda expects scalar i,j values, not a 3d array
lambda i,j: 1 if i < k else 0
idx<3 is a 3d boolean array. The error arises when that is use in an if context.
np.vectorize or np.frompyfunc is better if you want to apply a scalar function to a set of arrays:
In [677]: np.vectorize(lambda i,j: 1 if i < 2 else 0)(idx[0],idx[1])
Out[677]:
array([[1, 1, 1],
[1, 1, 1],
[0, 0, 0]])
However it isn't faster than more direct iterative approaches, and way slower than functions that operation on whole arrays.
One of many whole-array approaches:
In [680]: np.where(np.arange(3)[:,None]<2, np.ones((3,3),int), np.zeros((3,3),int))
Out[680]:
array([[1, 1, 1],
[1, 1, 1],
[0, 0, 0]])
As suggested by #MarkSetchell you need to vectorize your function:
k = 3
f = lambda i,j: 1 if i < k else 0
C = np.fromfunction(np.vectorize(f), (3,3))
and you get:
C
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
The problem is that np.fromfunction does not iterate over all elements, it only returns the indices in each dimension. You can use np.where() to apply a condition based on those indices, choosing from two alternatives depending on the condition:
import numpy as np
k = 3
np.fromfunction(lambda i, j: np.where(i < k, 1, 0), (5,3))
which gives:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[0, 0, 0],
[0, 0, 0]])
This avoids naming the lambda without things becoming too unwieldy. On my laptop, this approach was about 20 times faster than np.vectorize().
I have an array that I've labeled using scipy.ndimage and I'd like to multiply each element by a factor specific to its corresponding label. I thought I could use ndimage.labeled_comprehension for this, however I can't seem to figure out how to pass an argument to the function. For example:
a = np.random.random(9).reshape(3,3)
lbls = np.repeat(np.arange(3),3).reshape(3,3)
ndx = np.arange(0,lbls.max()+1)
factors = np.random.randint(10,size=3)
>>> lbls
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> ndx
array([0, 1, 2])
>>> factors
array([5, 4, 8])
def fn(a, x):
return a*x
>>> b = ndimage.labeled_comprehension(a, labels=lbls, index=ndx, func=fn, out_dtype=float, default=0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/tgrant/anaconda/envs/python2/lib/python2.7/site-packages/scipy/ndimage/measurements.py", line 416, in labeled_comprehension
do_map([input], temp)
File "/Users/tgrant/anaconda/envs/python2/lib/python2.7/site-packages/scipy/ndimage/measurements.py", line 411, in do_map
output[i] = func(*[inp[l:h] for inp in inputs])
TypeError: fn() takes exactly 2 arguments (1 given)
As expected it gives an error since fn() needs factors fed into it somehow. Is labeled_comprehension able to do this?
Index into factors and then simply multiply with the image array -
a*factors[lbls]
Sample run -
In [483]: a # image/data array
Out[483]:
array([[ 0.10682998, 0.29631501, 0.08501469],
[ 0.46944505, 0.88346229, 0.75672908],
[ 0.11381292, 0.24096868, 0.86438641]])
In [484]: factors # scaling factors
Out[484]: array([8, 1, 1])
In [485]: lbls # labels
Out[485]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
In [486]: factors[lbls] # factors populated based on the labels
Out[486]:
array([[8, 8, 8],
[1, 1, 1],
[1, 1, 1]])
In [487]: a*factors[lbls] # finally scale the image array
Out[487]:
array([[ 0.85463981, 2.37052006, 0.68011752],
[ 0.46944505, 0.88346229, 0.75672908],
[ 0.11381292, 0.24096868, 0.86438641]])
Say you have two matrices, A is 2x2 and B is 2x7 (2 rows, 7 columns). I want to create a matrix C of shape 2x7, out of copies of A. The problem is np.hstack only understands situations where the column numbers divide (say 2 and 8, thus you can easily stack 4 copies of A to get C) ,but what about when they do not? Any ideas?
A = [[0,1] B = [[1,2,3,4,5,6,7], C = [[0,1,0,1,0,1,0],
[2,3]] [1,2,3,4,5,6,7]] [2,3,2,3,2,3,2]]
Here's an approach with modulus -
In [23]: ncols = 7 # No. of cols in output array
In [24]: A[:,np.mod(np.arange(ncols),A.shape[1])]
Out[24]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
Or with % operator -
In [27]: A[:,np.arange(ncols)%A.shape[1]]
Out[27]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
For such repeated indices, using np.take would be more performant -
In [29]: np.take(A, np.arange(ncols)%A.shape[1], axis=1)
Out[29]:
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])
A solution without numpy (although the np solution posted above is a lot nicer):
A = [[0,1],
[2,3]]
B = [[1,2,3,4,5,6,7],
[1,2,3,4,5,6,7]]
i_max, j_max = len(A), len(A[0])
C = []
for i, line_b in enumerate(B):
line_c = [A[i % i_max][j % j_max] for j, _ in enumerate(line_b)]
C.append(line_c)
print(C)
First solution is very nice. Another possible way would be to still use hstack, but if you don't want the pattern repeated fully you can use array slicing to get the values you need:
a.shape > (2,2)
b.shape > (2,7)
repeats = np.int(np.ceil(b.shape[1]/a.shape[0]))
trim = b.shape[1] % a.shape[0]
c = np.hstack([a] * repeats)[:,:-trim]
>
array([[0, 1, 0, 1, 0, 1, 0],
[2, 3, 2, 3, 2, 3, 2]])