Counting occurrences of elements of one array given another array - python

I want to find the frequency of the values of one array (arr1) given another array (arr2). They are both one-dimensional, and arr2 is sorted and has no repeating elements.
Example:
arr1 = np.array([1, 0, 3, 0, 3, 0, 3, 0, 8, 0, 1, 8, 0])
arr2 = np.array([0, 1, 2, 8])
The output should be: freq= np.array([6, 2, 0, 2)]
What I was trying was this:
arr2, freq = np.unique(arr1, return_counts=True)
But this method doesn't output values that have frequency of 0.

One way to do it can be like below:
import numpy as np
arr1 = np.array([1, 0, 3, 0, 3, 0, 3, 0, 8, 0, 1, 8, 0])
arr2 = np.array([0, 1, 2, 8])
arr3, freq = np.unique(arr1, return_counts=True)
dict_ = dict(zip(arr3, freq))
freq = np.array([dict_[i] if i in dict_ else 0 for i in arr2])
freq
Output:
[6, 2, 0, 2]
Alternative One-liner Solution
import numpy as np
arr1 = np.array([1, 0, 3, 0, 3, 0, 3, 0, 8, 0, 1, 8, 0])
arr2 = np.array([0, 1, 2, 8])
freq = np.array([np.count_nonzero(arr1 == i) for i in arr2])

Related

Python Numpy - Create 2d array where length is based on 1D array

Sorry for confusing title, but not sure how to make it more concise. Here's my requirements:
arr1 = np.array([3,5,9,1])
arr2 = ?(arr1)
arr2 would then be:
[
[0,1,2,0,0,0,0,0,0],
[0,1,2,3,4,0,0,0,0],
[0,1,2,3,4,5,6,7,8],
[0,0,0,0,0,0,0,0,0]
]
It doesn't need to vary based on the max, the shape is known in advance. So to start I've been able to get a shape of zeros:
arr2 = np.zeros((len(arr1),max_len))
And then of course I could do a for loop over arr1 like this:
for i, element in enumerate(arr1):
arr2[i,0:element] = np.arange(element)
but that would likely take a long time and both dimensions here are rather large (arr1 is a few million rows, max_len is around 500). Is there a clean optimized way to do this in numpy?
Building on a 'padding' idea posted by #Divakar some years ago:
In [161]: res = np.arange(9)[None,:].repeat(4,0)
In [162]: res[res>=arr1[:,None]] = 0
In [163]: res
Out[163]:
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Try this with itertools.zip_longest -
import numpy as np
import itertools
l = map(range, arr1)
arr2 = np.column_stack((itertools.zip_longest(*l, fillvalue=0)))
print(arr2)
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
I am adding a slight variation on #hpaulj's answer because you mentioned that max_len is around 500 and you have millions of rows. In this case, you can precompute a 500 by 500 matrix containing all possible rows and index into it using arr1:
import numpy as np
np.random.seed(0)
max_len = 500
arr = np.random.randint(0, max_len, size=10**5)
# generate all unique rows first, then index
# can be faster if max_len << len(arr)
# 53 ms
template = np.tril(np.arange(max_len)[None,:].repeat(max_len,0), k=-1)
res = template[arr,:]
# 173 ms
res1 = np.arange(max_len)[None,:].repeat(arr.size,0)
res1[res1>=arr[:,None]] = 0
assert (res == res1).all()

Replace a single value with multiple values

Given a mask numpy array such as:
mask = np.array([0, 0, 1, 0, 0, 0, 1, ...])
I want to replace each 1 with a target vector. Example:
target = np.array([5, 4, 3, 2, 1])
mask = np.array([0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,...])
output = np.array([0, 0, 5, 4, 3, 2, 1, 0, 5, 4, 3, 2, ...])
# Overlaps:
mask = np.array([0, 0, 1, 0, 0, 0, 1, 0, 0, 0,...])
output = np.array([0, 0, 5, 4, 3, 2, 5, 4, 3, 2, ...])
Naivly, one can write this via the following (ignoring boundary problems):
output = np.zeros_like(mask)
for i, x in enumerate(mask):
if x == 1:
output[i:i+len(target)] = target
I'm wondering, whether this is possible without resorting to a for loop?
numpy supports assigning value for the same index multiple times in one go, like so:
mask = np.array([0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0])
padding_idx = [2,3,4,5,6,5,6,7,8,9,8,9,10,11]
padding_values = [5,4,3,2,1,5,4,3,2,1,5,4,3,2]
mask[padding_idx] = padding_values
>>> mask
array([0, 0, 5, 4, 3, 5, 4, 3, 5, 4, 3, 2])
You just need to find out padding_idx and padding_values.
Note that padding_values = [5,4,3,2,1,5,4,3,2,1,5,4,3,2] has one value missing. So you need also to find a number of values missing. After that you can use broadcasting
vector = np.array([5,4,3,2,1])
N = len(vector)
mask = np.array([0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0])
idx = np.flatnonzero(mask)
missing_values = len(mask) - idx[-1] - N
#Broadcast
padding_idx = np.flatnonzero(mask)[:,None] + np.arange(N)
padding_values = np.repeat(vector[np.newaxis, :], len(idx), axis=0)
#Flatten
padding_idx = padding_idx.ravel()[:missing_values]
padding_values = padding_values.ravel()[:missing_values]
#Go!
mask[padding_idx] = padding_values
>>> mask
array([0, 0, 5, 4, 3, 5, 4, 3, 5, 4, 3, 2])
Not a full answer, but some thoughts: The for loop is O(n), where n = len(mask). We can use np.split to get that down to O(k), where k = number of 1s in mask:
def set_target(mask, target):
output = []
i, = np.where(mask == 1)
for split in np.split(mask, i):
if len(split) > len(target):
split[:len(target)] = target
output.append(split)
else:
output.append(target[:len(split)])
return np.concatenate(output, 0)

change gaps in numpy array according to gap size

I need to filter out short nonzero series, that lies between zeros. For example, this array:
t = np.array([1, 3, 1, 0, 0, 1, 8, 3, 0, 8, 2, 4, 7, 0,0,4,1])
should become:
array([1, 3, 1, 0, 0, 0, 0, 0, 0, 8, 2, 4, 7, 0, 0, 4, 1])
I found the first indices of non zero sequanceses, and counted num of non zeros between them. I wrote the following, It works, but look awful. I tried staf but got an errors.
How to rewrite it pythonicly ?
minseq = 4 # length of minimal non zero seq
p = np.where(fhr>0, 1, 0).astype(int)
s = np.array([1]+ list(np.diff(p)))
sind = np.where(s==1)[0][1:]
print(sind)
for i in range(len(sind) - 1):
s1 = sind[i]
e1 = sind[i+1]
subfhr = np.where(fhr[s1:e1] > 0, 1, 0).sum()
if (subfhr < minseq):
print(s1, e1, subfhr)
fhr[s1:e1] = 0
out:
[ 5 9 15]
5 9 3
array([1, 3, 1, 0, 0, 0, 0, 0, 0, 8, 2, 4, 7, 0, 0, 4, 1])
You can use image-processing based binary_closing -
from scipy.ndimage.morphology import binary_closing
def remove_small_nnz(a, W):
K = np.ones(W, dtype=int)
m = a==0
p = binary_closing(m,K)
a[~m & p] = 0
return a
Sample run -
In [97]: a
Out[97]: array([1, 3, 1, 0, 0, 1, 8, 3, 0, 8, 2, 4, 7, 0, 0, 4, 1])
In [98]: remove_small_nnz(a, W=3)
Out[98]: array([1, 3, 1, 0, 0, 1, 8, 3, 0, 8, 2, 4, 7, 0, 0, 4, 1])
In [99]: remove_small_nnz(a, W=4)
Out[99]: array([1, 3, 1, 0, 0, 0, 0, 0, 0, 8, 2, 4, 7, 0, 0, 4, 1])
In [100]: remove_small_nnz(a, W=5)
Out[100]: array([1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1])
Since you're only looking for nonzeros, you can cast the array to boolean, and look for spots where there is a sequence of however many Trues in a row as you're looking for.
import numpy as np
def orig(fhr, minseq):
p = np.where(fhr>0, 1, 0).astype(int)
s = np.array([1]+ list(np.diff(p)))
sind = np.where(s==1)[0][1:]
for i in range(len(sind) - 1):
s1 = sind[i]
e1 = sind[i+1]
subfhr = np.where(fhr[s1:e1] > 0, 1, 0).sum()
if (subfhr < minseq):
fhr[s1:e1] = 0
return fhr
def update(fhr, minseq):
# convert the sequence to boolean
nonzero = fhr.astype(bool)
# stack the boolean array with lagged copies of itself
seqs = np.stack([nonzero[i:-minseq+i] for i in range(minseq)],
axis=1)
# find the spots where the sequence is long enough
inseq = np.r_[np.zeros(minseq, np.bool), seqs.sum(axis=1) == minseq]
# the start and end of the series is are assumed to be included in result
inseq[minseq] = True
inseq[-1] = True
# make sure that the full sequence is included.
# There may be a way to vectorize this further
for ind in np.where(inseq)[0]:
inseq[ind-minseq:ind] = True
# Apply the inseq array as a mask
return inseq * fhr
fhr = np.array([1, 3, 1, 0, 0, 1, 8, 3, 0, 8, 2, 4, 7, 0,0,4,1])
minseq = 4
print(np.all(orig(fhr, minseq) == update(fhr, minseq)))
# True

How i can get the indexes of numpy array that contain one's

How I can get the indexes of element that contain 1 in numpy array, in an elegant way?
I tried to do a loop:
indexes = []
for i in range(len(array)):
if array[i] == 1:
indexes += [i]
Use np.where:
a = np.array([0, 0, 1, 1, 0, 1, 1, 1, 0])
np.where(a)
Output:
(array([2, 3, 5, 6, 7], dtype=int64),)
Or np.nonzero:
a.nonzero()
Output:
(array([2, 3, 5, 6, 7], dtype=int64),)
You can also index into np.arange:
np.arange(len(a))[a.astype(bool)]
Output:
array([2, 3, 5, 6, 7])
numpy.argwhere() could be a perfect worker API for doing this. Additionally, we also have to remove the singleton dimension using arr.squeeze(). Below are two cases:
If your input is a 0-1 array, then:
In [101]: a = np.array([0, 0, 1, 1, 0, 1, 1, 1, 0])
In [102]: np.argwhere(a).squeeze()
Out[102]: array([2, 3, 5, 6, 7])
On the other hand, if you have a generic array, then:
In [98]: np.random.seed(23)
In [99]: arr = np.random.randint(0, 5, 10)
In [100]: arr
Out[100]: array([3, 0, 1, 0, 4, 3, 2, 1, 3, 3])
In [106]: np.argwhere(arr == 1).squeeze()
Out[106]: array([2, 7])

Count how often integer y occurs right after integer x in a numpy array

I have a very large numpy.array of integers, where each integer is in the range [0, 31].
I would like to count, for every pair of integers (a, b) in the range [0, 31] (e.g. [0, 1], [7, 9], [18, 0]) how often b occurs right after a.
This would give me a (32, 32) matrix of counts.
I'm looking for an efficient way to do this with numpy. Raw python loops would be too slow.
Here's one way...
To make the example easier to read, I'll use a maximum value of 9 instead of 31:
In [178]: maxval = 9
Make a random input for the example:
In [179]: np.random.seed(123)
In [180]: x = np.random.randint(0, maxval+1, size=100)
Create the result, initially all 0:
In [181]: counts = np.zeros((maxval+1, maxval+1), dtype=int)
Now add 1 to each coordinate pair, using numpy.add.at to ensure that duplicates are counted properly:
In [182]: np.add.at(counts, (x[:-1], x[1:]), 1)
In [183]: counts
Out[183]:
array([[2, 1, 1, 0, 1, 0, 1, 1, 1, 1],
[2, 1, 1, 3, 0, 2, 1, 1, 1, 1],
[0, 2, 1, 1, 4, 0, 2, 0, 0, 0],
[1, 1, 1, 3, 3, 3, 0, 0, 1, 2],
[1, 1, 0, 1, 1, 0, 2, 2, 2, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 2],
[0, 4, 2, 3, 1, 0, 2, 1, 0, 1],
[0, 1, 1, 1, 0, 0, 2, 0, 0, 3],
[1, 2, 0, 1, 0, 0, 1, 0, 0, 0],
[2, 0, 2, 2, 0, 0, 2, 2, 0, 0]])
For example, the number of times 6 is followed by 1 is
In [188]: counts[6, 1]
Out[188]: 4
We can verify that with the following expression:
In [189]: ((x[:-1] == 6) & (x[1:] == 1)).sum()
Out[189]: 4
You can use numpy's built-in diff routine together with boolean arrays.
import numpy as np
test_array = np.array([1, 2, 3, 1, 2, 4, 5, 1, 2, 6, 7])
a, b = (1, 2)
sum(np.bitwise_and(test_array[:-1] == a, np.diff(test_array) == b - a))
# 3
If your array is multi-dimensional, you will need to flatten it first or make some small modifications to the code above.

Categories

Resources