Count Overlap Between Neighboring Indices in NumPy Array

Count Overlap Between Neighboring Indices in NumPy Array - python

I have a NumPy array of integers:
x = np.array([1, 0, 2, 1, 4, 1, 4, 1, 0, 1, 4, 3, 0, 1, 0, 2, 1, 4, 3, 1, 4, 1, 0])
and another array of indices that references the array above:
indices = np.array([22, 12, 8, 1, 14, 21, 7, 0, 13, 19, 5, 3, 9, 16, 2, 15, 11, 18, 20, 6, 4, 10, 17])
For every pair of neighboring indices, we need to count how many consecutive values in x are overlapping starting at each of the two neighboring indices. For example, for indices[2] and indices[3], we have index 8 and 1, respectively, and they both reference positions in x. Then, starting at x[8] and x[1], we count how many consecutive values are the same or are overlapping but we stop checking the overlap under specific conditions (see below). In other words, we check if:
x[8] == x[1]
x[9] == x[2] # increment each index by one
... # continue incrementing each index except in the following conditions
stop if i >= x.shape[0]
stop if j >= x.shape[0]
6. stop if x[i] == 0
7. stop if x[j] == 0
stop if x[i] != x[j]
In reality, we do this for all neighboring index pairs:
out = np.zeros(indices.shape[0], dtype=int)
for idx in range(indices.shape[0]-1):
count = 0
i = indices[idx]
j = indices[idx + 1]
k = 0
# while i+k < x.shape[0] and j+k < x.shape[0] and x[i+k] != 0 and x[j+k] != 0 and x[i+k] == x[j+k]:
while i+k < x.shape[0] and j+k < x.shape[0] and x[i+k] == x[j+k]:
count += 1
k += 1
out[idx] = k
And the output is:
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 2, 3, 0, 3, 0, 1, 0, 2, 2, 1, 2, 0] # This is the old output if x[i] == 0 and x[j] == 0 are included
[1 2 1 4 0 2 2 5 1 4 3 2 3 0 3 0 1 0 3 2 1 2 0]
I'm looking for a vectorized way to do this in NumPy.

This should do the trick (I am ignoring the two conditions x[i]=0 and x[j]=0)
for idx in range(indices.shape[0]-1):
i = indices[idx]
j = indices[idx + 1]
l = len(x) - max(i,j)
x1 = x[i:i+l]
x2 = x[j:j+l]
# Add False at the end to handle the case in which arrays are exactly the same
x0 = np.append(x1==x2, False)
out[idx] = np.argmin(x0)
Notice that with np.argmin I am exploiting the following two facts:
False < True
np.argmin only returns the first instance of the min in the array
Performance Analysis
Regarding time performance, I tested with N=10**5 and N=10**6, and as suggested in the comments, this cannot compete with numba jit.
def f(x, indices):
out = np.zeros(indices.shape[0], dtype=int)
for idx in range(indices.shape[0]-1):
i = indices[idx]
j = indices[idx + 1]
l = len(x) - max(i,j)
x1 = x[i:i+l]
x2 = x[j:j+l]
x0 = np.append(x1==x2, False)
out[idx] = np.argmin(x0)
return out
N=100_000
x = np.random.randint(0,10, N)
indices = np.arange(0, N)
np.random.shuffle(indices)
%timeit f(x, indices)
3.67 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N=1_000_000
x = np.random.randint(0,10, N)
indices = np.arange(0, N)
np.random.shuffle(indices)
%time f(x, indices)
Wall time: 8min 20s
(I did not have the patience to let %timeit finish)

Related

How to find the max array from both sides

Given an integer array A, I need to pick B elements from either left or right end of the array A to get maximum sum. If B = 4, then you can pick the first four elements or the last four elements or one from front and three from back etc.
Example input:
A = [5, -2, 3, 1, 2]
B = 3
The correct answer is 8 (by picking 5 from the left, and 1 and 2 from the right).
My code:
def solve(A, B):
n = len(A)
# track left most index and right most index i,j
i = 0
j = n-1
Sum = 0
B2 = B # B for looping and B2 for reference it
# Add element from front
for k in range(B):
Sum += A[k]
ans = Sum
# Add element from last
for _ in range(B2):
# Remove element from front
Sum -= A[i]
# Add element from last
Sum += A[j]
ans = max(ans, Sum)
return ans
But the answer I get is 6.

Solution
def max_bookend_sum(x, n):
bookends = x[-n:] + x[:n]
return max(sum(bookends[i : i + n]) for i in range(n + 1))
Explanation
Let n = 3 and take x,
>>> x = [4, 9, -7, 4, 0, 4, -9, -8, -6, 9]
Grab the "right" n elements, concatenate with the "left" n:
>>> bookends = x[-n:] + x[:n]
>>> bookends # last three elements from x, then first three
[-8, -6, 9, 4, 9, -7]
Take "sliding window" groups of n elements:
>>> [bookends[i : i + n] for i in range(n + 1)]
[[-8, -6, 9], [-6, 9, 4], [9, 4, 9], [4, 9, -7]]
Now, instead of producing the sublists sum them instead, and take the max:
>>> max(sum(bookends[i : i + n]) for i in range(n + 1))
22
For your large array A from the comments:
>>> max(sum(bookends[i : i + n]) for i in range(n + 1))
6253

Solution based on sum of the left and right slices:
Data = [-533, -666, -500, 169, 724, 478, 358, -38, -536, 705, -855, 281, -173, 961, -509, -5, 942, -173, 436, -609,
-396, 902, -847, -708, -618, 421, -284, 718, 895, 447, 726, -229, 538, 869, 912, 667, -701, 35, 894, -297, 811,
322, -667, 673, -336, 141, 711, -747, -132, 547, 644, -338, -243, -963, -141, -277, 741, 529, -222, -684,
35] # to avoid var shadowing
def solve(A, B):
m, ln = None, len(A)
for i in range(B):
r = -(B-i-1) # r is right index to slice
tmp = sum(A[0:i + 1]) + sum(A[r:]) if r < 0 else 0
m = tmp if m is None else max(m, tmp)
return m
print(solve(Data, 48)) # 6253

A recursive approach with comments.
def solve(A, B, start_i=0, end_i=None):
# set end_i to the index of last element
if end_i is None:
end_i = len(A) - 1
# base case 1: we have no more moves
if B == 0:
return 0
# base case 2: array only has two elemens
if end_i - start_i == 1:
return max(A)
# next, we need to choose whether to use one of our moves on
# the left side of the array or the right side. We compute both,
# then check which one is better.
# pick the left side to sum
sum_left = A[start_i] + solve(A, B - 1, start_i + 1, end_i)
# pick the right side to sum
sum_right = A[end_i] + solve(A, B - 1, start_i, end_i - 1)
# return the max of both options
return max(sum_left, sum_right)
arr = [5, -2, 3, 1, 2]
print(solve(arr, 3)) # prints 8

The idea is if we have this list:
[5, 1, 1, 8, 2, 10, -2]
Then the possible numbers for B=3 would be:
lhs = [5, 1, 1] # namely L[+0], L[+1], L[+2]
rhs = [2, 10, -2] # namely R[-3], R[-2], R[-1]
The possible combinations would be:
[5, 1, 1] # L[+0], L[+1], L[+2]
[5, 1, -2] # L[+0], L[+1], R[-1]
[5, 10, -2] # L[+0], R[-2], R[-1]
[2, 10, -2] # R[-3], R[-2], R[-1]
As you can see, we can easily perform forward and backward iterations which will start from all L (L[+0], L[+1], L[+2]), and then iteratively replacing the last element with an R (R[-1], then R[-2], then R[-3]) up until all are R (R[-3], then R[-2], then R[-1]).
def solve(A, B):
n = len(A)
max_sum = None
for lhs, rhs in zip(range(B, -1, -1), range(0, -(B+1), -1)):
combined = A[0:lhs] + (A[rhs:] if rhs < 0 else [])
combined_sum = sum(combined)
max_sum = combined_sum if max_sum is None else max(max_sum, combined_sum)
return max_sum
for A in [
[5, 1, 1, 8, 2, 10, -2],
[5, 6, 1, 8, 2, 10, -2],
[5, 6, 3, 8, 2, 10, -2],
]:
print(A)
print("\t1 =", solve(A, 1))
print("\t2 =", solve(A, 2))
print("\t3 =", solve(A, 3))
print("\t4 =", solve(A, 4))
Output
[5, 1, 1, 8, 2, 10, -2]
1 = 5
2 = 8
3 = 13
4 = 18
[5, 6, 1, 8, 2, 10, -2]
1 = 5
2 = 11
3 = 13
4 = 20
[5, 6, 3, 8, 2, 10, -2]
1 = 5
2 = 11
3 = 14
4 = 22

public int solve(int[] A, int B) {
int sum = 0;
int i = 0;
int n = A.length -1;
for (int k = 0; k < B; k++){
sum += A[k];
}
int ans = sum;
int B2 = B -1;
for (int j = n; j > n -B; j--){
sum -= A[B2];
sum += A[j];
ans = Math.max(ans, sum);
B2--;
}
return ans;
}
}

Remove elements from Numpy array until y has equivalent elements in each value

I have an array y composed of 0 and 1, but at a different frequency.
For example:
y = np.array([0, 0, 1, 1, 1, 1, 0])
And I have an array x of the same length.
x = np.array([0, 1, 2, 3, 4, 5, 6])
The idea is to filter out elements until there are the same number of 0 and 1.
A valid solution would be to remove index 5:
x = np.array([0, 1, 2, 3, 4, 6])
y = np.array([0, 0, 1, 1, 1, 0])
A naive method I can think of is to get the difference between the value frequency of y (in this case 4-3=1) create a mask for y == 1 and switch random elements from True to False until the difference is 0. Then create a mask for y == 0, do a OR between them and apply it to both x and y.
This doesn't really seem the best "python/numpy way" of doing it though.
Any suggestions? Something like randomly select n elements from the highest count, where n is the count of the lowest value.
If this is easier with pandas then that would work for me too.
Naive algorithm assuming 1 > 0:
mask_pos = y == 1
mask_neg = y == 0
pos = len(y[mask_pos])
neg = len(y[mask_neg])
diff = pos-neg
while diff > 0:
rand = np.random.randint(0, len(y))
if mask_pos[rand] == True:
mask_pos[rand] = False
diff -= 1
mask_final = mask_pos | mask_neg
y_new = y[mask_final]
x_new = x[mask_final]
This naive algorithm is really slow

One way to do that with NumPy is this:
import numpy as np
# Makes a mask to balance ones and zeros
def balance_binary_mask(binary_array):
binary_array = np.asarray(binary_array).ravel()
# Count number of ones
z = np.count_nonzero(binary_array)
# If there are less ones than zeros
if z <= len(binary_array) // 2:
# Invert the array
binary_array = ~binary_array
# Find ones
idx = np.nonzero(binary_array)[0]
# Number of elements to remove
rem = 2 * len(idx) - len(binary_array)
# Pick random indices to remove
rem_idx = np.random.choice(idx, size=rem, replace=False)
# Make mask
mask = np.ones_like(binary_array, dtype=bool)
# Mask elements to remove
mask[rem_idx] = False
return mask
# Test
np.random.seed(0)
y = np.array([0, 0, 1, 1, 1, 1, 0])
x = np.array([0, 1, 2, 3, 4, 5, 6])
m = balance_binary_mask(y)
print(m)
# [ True True True True False True True]
y = y[m]
x = x[m]
print(y)
# [0 0 1 1 1 0]
print(x)
# [0 1 2 3 5 6]

i need help getting stDev without using importmath, python

### import math
def mean(values):
return sum(values)*1.0/len(values)
def std():
pass
print(std())
def std(values):
length = len(values)
if length < 2:
return("Standard deviation requires at least two data points")
m = mean(values)
total_sum = 0
for i in range(length):
total_sum += (values[i]-m)**2
under_root = total_sum*1.0/length
return math.sqrt(under_root)
vals = [5]
stan_dev = std(vals)
print(stan_dev)
values = [1, 2, 3, 4, 5]
stan_dev = std(values)
print(stan_dev)
__________________________________________________________________________
lst = [3, 19, 21, 1435, 653342]
sum = reduce((lambda x, y: x +y), lst)
print (sum)
# list = [3, 19, 21, 1435, 653342]
i need to be able to get the stDev without using sum or len
i need to 'unpack' the stDev ???

You can do it with two loops (there are shorter ways but this is simple):
arr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Calculate the mean first
N, X = 0, 0
for xi in arr:
N += 1
X += xi
mean = X/N
# Calculate the standard deviation
DSS = 0
for xi in arr:
DSS += (xi - mean)**2
std = (DSS/N)**(1/2)
Outputs 4.5 for mean and 2.872 for std.

Generate lexicographic series efficiently in Python

I want to generate a lexicographic series of numbers such that for each number the sum of digits is a given constant. It is somewhat similar to 'subset sum problem'. For example if I wish to generate 4-digit numbers with sum = 3 then I have a series like:
[3 0 0 0]
[2 1 0 0]
[2 0 1 0]
[2 0 0 1]
[1 2 0 0] ... and so on.
I was able to do it successfully in Python with the following code:
import numpy as np
M = 4 # No. of digits
N = 3 # Target sum
a = np.zeros((1,M), int)
b = np.zeros((1,M), int)
a[0][0] = N
jj = 0
while a[jj][M-1] != N:
ii = M-2
while a[jj][ii] == 0:
ii = ii-1
kk = ii
if kk > 0:
b[0][0:kk-1] = a[jj][0:kk-1]
b[0][kk] = a[jj][kk]-1
b[0][kk+1] = N - sum(b[0][0:kk+1])
b[0][kk+2:] = 0
a = np.concatenate((a,b), axis=0)
jj += 1
for ii in range(0,len(a)):
print a[ii]
print len(a)
I don't think it is a very efficient way (as I am a Python newbie). It works fine for small values of M and N (<10) but really slow beyond that. I wish to use it for M ~ 100 and N ~ 6. How can I make my code more efficient or is there a better way to code it?

Very effective algorithm adapted from Jorg Arndt book "Matters Computational"
(Chapter 7.2 Co-lexicographic order for compositions into exactly k parts)
n = 4
k = 3
x = [0] * n
x[0] = k
while True:
print(x)
v = x[-1]
if (k==v ):
break
x[-1] = 0
j = -2
while (0==x[j]):
j -= 1
x[j] -= 1
x[j+1] = 1 + v
[3, 0, 0, 0]
[2, 1, 0, 0]
[2, 0, 1, 0]
[2, 0, 0, 1]
[1, 2, 0, 0]
[1, 1, 1, 0]
[1, 1, 0, 1]
[1, 0, 2, 0]
[1, 0, 1, 1]
[1, 0, 0, 2]
[0, 3, 0, 0]
[0, 2, 1, 0]
[0, 2, 0, 1]
[0, 1, 2, 0]
[0, 1, 1, 1]
[0, 1, 0, 2]
[0, 0, 3, 0]
[0, 0, 2, 1]
[0, 0, 1, 2]
[0, 0, 0, 3]
Number of compositions and time on seconds for plain Python (perhaps numpy arrays are faster) for n=100, and k = 2,3,4,5 (2.8 ghz Cel-1840)
2 5050 0.040000200271606445
3 171700 0.9900014400482178
4 4421275 20.02204465866089
5 91962520 372.03577995300293
I expect time 2 hours for 100/6 generation
Same with numpy arrays (x = np.zeros((n,), dtype=int)) gives worse results - but perhaps because I don't know how to use them properly
2 5050 0.07999992370605469
3 171700 2.390003204345703
4 4421275 54.74532389640808
Native code (this is Delphi, C/C++ compilers might optimize better) generates 100/6 in 21 seconds
3 171700 0.012
4 4421275 0.125
5 91962520 1.544
6 1609344100 20.748
Cannot go sleep until all measurements aren't done :)
MSVS VC++: 18 seconds! (O2 optimization)
5 91962520 1.466
6 1609344100 18.283
So 100 millions variants per second.
A lot of time is wasted for checking of empty cells (because fill ratio is small). Speed described by Arndt is reached on higher k/n ratios and is about 300-500 millions variants per second:
n=25, k=15 25140840660 60.981 400 millions per second

My recommendations:
Rewrite it as a generator utilizing yield, rather than a loop that concatenates a global variable on each iteration.
Keep a running sum instead of calculating the sum of some subset of the array representation of the number.
Operate on a single instance of your working number representation instead of splicing a copy of it to a temporary variable on each iteration.
Note no particular order is implied.

I have a better solution using itertools as follows,
from itertools import product
n = 4 #number of elements
s = 3 #sum of elements
r = []
for x in range(n):
r.append(x)
result = [p for p in product(r, repeat=n) if sum(p) == s]
print(len(result))
print(result)
I am saying this is better because it took 0.1 secs on my system, while your code with numpy took 0.2 secs.
But as far as n=100 and s=6, this code takes time to go through all the combinations, I think it will take days to compute the results.

I found a solution using itertools as well (Source: https://bugs.python.org/msg144273). Code follows:
import itertools
import operator
def combinations_with_replacement(iterable, r):
# combinations_with_replacement('ABC', 2) --> AA AB AC BB BC CC
pool = tuple(iterable)
n = len(pool)
if not n and r:
return
indices = [0] * r
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != n - 1:
break
else:
return
indices[i:] = [indices[i] + 1] * (r - i)
yield tuple(pool[i] for i in indices)
int_part = lambda n, k: (tuple(map(c.count, range(k))) for c in combinations_with_replacement(range(k), n))
for item in int_part(3,4): print(item)

Python: how to avoid loop?

I have a list of entries
l = [5, 3, 8, 12, 24]
and a matrix M
M:
12 34 5 8 7
0 24 12 3 1
I want to find the indeces of the matrix where appear the numbers in l. For the k-entry of l I want to save a random couple of indices i, j where M[i][j]==l[k]. I am doing the following
indI = []
indJ = []
for i in l:
tmp = np.where(M == i)
rd = randint(len(tmp))
indI.append(tmp[0][rd])
indJ.append(tmp[1][rd])
I would like to see if there is a way to avoid that loop

One way in which you should be able to significantly speed up your code is to avoid duplicate work:
tmp = np.where(M == i)
As this gives you a list of all locations in M where the value is equal to i, it must be searching through the entire matrix. So for each element in l, you are searching through the full matrix.
Instead of doing that, try indexing your matrix as a first step:
matrix_index = {}
for i in len(M):
for j in len(M[i]):
if M[i][j] not in matrix_index:
matrix_index[M[i][j]] = [(i,j)]
else:
matrix_index[M[i][j]].append((i,j))
Then for each value in l, instead of doing a costly search through the full matrix, you can just get it straight from your matrix index.
Note: I haven't with numpy very much, so I may have gotten the specific syntax incorrect. There may also be a more idiomatic way of doing this in numpy.

If both l and M are not large matrices like the following:
In: l0 = [5, 3, 8, 12, 34, 1, 12]
In: M0 = [[12, 34, 5, 8, 7],
In: [ 0, 24, 12, 3, 1]]
In: l = np.asarray(l)
In: M = np.asarray(M)
You can try this:
In: np.where(l[None, None, :] == M[:, :, None])
Out:
(array([0, 0, 0, 0, 0, 1, 1, 1, 1]), <- i
array([0, 0, 1, 2, 3, 2, 2, 3, 4]), <- j
array([3, 6, 4, 0, 2, 3, 6, 1, 5])) <- k
The rows should be the i, j, k, respectively and read the column to get every (i, j, k) you need. For example, the 1st column [0, 0, 3] means M[0, 0] = l[3], and the 2nd column [0, 0, 6] says M[0, 0] = l[6], and vice versa. I think these are what you want.
However, the numpy trick can not be extended to very large matrices, such as 2M elements in l or 2500x2500 elements in M. They need quite a lot memory and very very long time to compute... if they are lucky not to crash for out of memory. :)

One solution that does not use the word for is
c = np.apply_along_axis(lambda row: np.random.choice(np.argwhere(row).ravel()), 1, M.ravel()[np.newaxis, :] == l[:, np.newaxis])
indI, indJ = c // M.shape[1], c % M.shape[1]
Note that while that solves the problem, M.ravel()[np.newaxis, :] == l[:, np.newaxis] will quickly produce MemoryErrors. A more pragmatic approach would be to get the indices of interest through something like
s = np.argwhere(M.ravel()[np.newaxis, :] == l[:, np.newaxis])
and then do the random choice post-processing by hand. This, however, probably does not yield any significant performance improvements over your search.
What makes it slow, though, is that you search through the entire matrix in every step of your loop; by pre-sorting the matrix (at a certain cost) gives you a straightforward way of making each individual search much faster:
In [312]: %paste
def direct_search(M, l):
indI = []
indJ = []
for i in l:
tmp = np.where(M == i)
rd = np.random.randint(len(tmp[0])) # Note the fix here
indI.append(tmp[0][rd])
indJ.append(tmp[1][rd])
return indI, indJ
def using_presorted(M, l):
a = np.argsort(M.ravel())
M_sorted = M.ravel()[a]
def find_indices(i):
s = np.searchsorted(M_sorted, i)
j = 0
while M_sorted[s + j] == i:
yield a[s + j]
j += 1
indices = [list(find_indices(i)) for i in l]
c = np.array([np.random.choice(i) for i in indices])
return c // M.shape[1], c % M.shape[1]
## -- End pasted text --
In [313]: M = np.random.randint(0, 1000000, (1000, 1000))
In [314]: l = np.random.choice(M.ravel(), 1000)
In [315]: %timeit direct_search(M, l)
1 loop, best of 3: 4.76 s per loop
In [316]: %timeit using_presorted(M, l)
1 loop, best of 3: 208 ms per loop
In [317]: indI, indJ = using_presorted(M, l) # Let us check that it actually works
In [318]: np.all(M[indI, indJ] == l)
Out[318]: True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count Overlap Between Neighboring Indices in NumPy Array - python

Related

How to find the max array from both sides

Remove elements from Numpy array until y has equivalent elements in each value

i need help getting stDev without using importmath, python

Generate lexicographic series efficiently in Python

Python: how to avoid loop?

Categories

Resources