Related
So here is my problem: I have an array like this:
arr = array([0, 0, 1, 8, 10, 20, 26, 32, 37, 52, 0, 0, 46, 42, 30, 19, 8, 2, 0, 0, 0])
In this array I want to find n consecutive values, greater than zero with the biggest sum. In this example with n = 5 this would be array([20, 26, 32, 37, 52]) and the index would be 5.
What I tried is of course a loop:
n = 5
max_sum = 0
max_loc = 0
for i in range(arr.size - n):
if all(arr[i:i + n] > 0) and arr[i:i + n].sum() > max_sum:
max_sum = arr[i:i + n].sum()
max_loc = i
print(max_loc)
This is fine for not too many short arrays but of course I need to use this on many not so short arrays.
I was experimenting with numpy so I would only have to iterate non-zero value groups:
diffs = np.concatenate((np.array([False]), np.diff(arr > 0)))
groups = np.split(arr, np.where(diffs)[0])
for group in groups:
if group.sum() > 0 and group.size >= n:
...
but I believe this is nice but not the right direction. I am looking for a simpler and faster numpy / pandas solution that really uses the powers of these packages.
Using cross-correlation, numpy.correlate, is a possible, concise and fast solution:
n=5
arr[arr<0] = np.iinfo(arr.dtype).min # The greatest negative integer possible
#Thanks for the np.iinfo suggestion, #Corralien
idx = np.argmax(np.correlate(arr, np.ones(n), 'valid'))
idx, arr[idx:(idx+5)]
Another possible solution:
n, l = 5, arr.size
arr[arr<0] = np.iinfo(arr.dtype).min # The greatest negative integer possible
#Thanks for the np.iinfo suggestion, #Corralien
idx = np.argmax([np.sum(np.roll(arr,-x)[:n]) for x in range(l-n+1)])
idx, arr[idx:(idx+n)]
Output:
(5, array([20, 26, 32, 37, 52]))
You can use sliding_window_view:
from numpy.lib.stride_tricks import sliding_window_view
N = 5
win = sliding_window_view(arr, N)
idx = ((win.sum(axis=1)) * ((win>0).all(axis=1))).argmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
Answer greatly enhanced by chrslg to save memory and keep a win as a view.
Update
A nice bonus is this should work with Pandas Series just fine.
N = 5
idx = pd.Series(arr).where(lambda x: x > 0).rolling(N).sum().shift(-N+1).idxmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
I want to extract the longest sequence of consecutive non NaN values from an array in Python. So from this one:
a = [NaN, NaN, NaN, 1, 4, NaN, NaN, NaN, 4, 6, 8, 4, 6, 6, 4, 3, 2, NaN, NaN, NaN, 2, NaN, NaN, NaN]
I would like to get
a_nonNaN_long = [4, 6, 8, 4, 6, 6, 4, 3, 2]
So the way I was thinking to go about this is to get the first non NaN value using this function
def firstNonNan(listfloats):
i = 0
for item in listfloats:
i += 1
if math.isnan(item) == False:
return i
And then use the index from this function in a while loop to get subsection of the array until I find the longest consecutive sequence of Non nan values. I wonder if anybody has some other/better way to do it?
You can use itertools.groupby to get non-NaN stretches, then max for the longest:
from itertools import groupby
import math
out = max((list(g) for k,g in groupby(a, math.isnan) if not k), key=len)
Output:
[4, 6, 8, 4, 6, 6, 4, 3, 2]
Used input:
NaN = float('nan')
a = [NaN, NaN, NaN, 1, 4, NaN, NaN, NaN, 4, 6, 8, 4, 6, 6, 4, 3, 2, NaN, NaN, NaN, 2, NaN, NaN, NaN]
To see whether numpy based vectorized approaches or pure python algorithms to find the bounds of non-NaN sequences could be performance competitive with a python groupby answer such as the one by #mozway, I used perfplot to benchmark four strategies: pure python with groupby, pure python without groupby, numpy and numba.
In one run, I used a python list as input (as in the question), which puts the onus on the numpy and numba runs to do the round-trip conversion from list to np.array and back, as well as to deal with heterogeneous data types given that NaN is float and the values are int.
Here's how it looks:
In the second run, I used a python list as input for the non-numpy solutions, and an np.array as input (with NaN replaced by an integer sentinel outside the range of actual values, to allow a dtype of int32) and output, to see whether recasting the problem in numpy-friendly array types would help the numpy/numba solutions.
Here are the results:
Conclusion:
The numpy and numba solutions are about 1.5 orders of magnitude faster if they are allowed to use homogeneous numpy input and output. Otherwise (if they must incur the roundtrip overhead of list-to-homogenous-numpy conversion and back) they are more or less on top of the pure python solutions.
Pure python groupby beats non-groupby by a bit, and in the second case, numba beats numpy by a little.
Note that none of the others beats the groubpy solution by #mozway for simplicity of expression.
Here is the benchmark code:
NaN = float("nan")
a = [NaN, NaN, NaN, 1, 4, NaN, NaN, NaN, 4, 6, 8, 4, 6, 6, 4, 3, 2, NaN, NaN, NaN, 2, NaN, NaN, NaN]
import numpy as np
aNp = np.array([0 if v is NaN else v for v in a], np.int32)
def foo_1(a):
mnMinus1 = min(v for v in a if not v is NaN) - 1
np_a = np.array([mnMinus1 if v is NaN else v for v in a], np.int32)
return list(np_foo_1(np_a, mnMinus1))
def foo_2(a):
mnMinus1 = min(v for v in a if not v is NaN) - 1
np_a = np.array([mnMinus1 if v is NaN else v for v in a], np.int32)
return list(np_foo_2(np_a, mnMinus1))
def np_foo_1(b, mnMinus1):
R = np.concatenate((np.array([mnMinus1]), b[:-1]))
bIsnan = b==mnMinus1
RIsnan = R==mnMinus1
nonNanL = (~bIsnan) & RIsnan
nonNanR = bIsnan & ~RIsnan
index = np.arange(len(b))
left, right = index[nonNanL], index[nonNanR]
lens = right - left
i = lens.argmax()
result = b[left[i]:right[i]]
return result
from numba import njit
#njit
def np_foo_2(b, mnMinus1):
R = np.concatenate((np.array([mnMinus1]), b[:-1]))
bIsnan = b==mnMinus1
RIsnan = R==mnMinus1
nonNanL = (~bIsnan) & RIsnan
nonNanR = bIsnan & ~RIsnan
index = np.arange(len(b))
left, right = index[nonNanL], index[nonNanR]
lens = right - left
i = lens.argmax()
result = b[left[i]:right[i]]
return result
def foo_3(a):
LR = []
left, right = 0, 0
while left < len(a):
while right < len(a) and a[right] is not NaN:
right += 1
LR.append((left, right))
left = right + 1
while left < len(a) and a[left] is NaN:
left += 1
right = left + 1
#i = max(zip(LR, range(len(LR))), key = lambda x: x[0][1] - x[0][0])[-1]
i, mx = 0, 0
for j in range(len(LR)):
cur = LR[j][1] - LR[j][0]
if cur > mx:
i, mx = j, cur
return a[LR[i][0]:LR[i][1]]
left = [i for i, v in enumerate(a) if v is not NaN and (not i or a[i-1] is NaN)]
right = [i for i, v in enumerate(a) if v is NaN and (False if not i else a[i-1] is not NaN)]
#i = max(zip(left, right, range(len(left))), key = lambda x: x[1] - x[0])[-1]
i, mx = 0, 0
for j in range(len(left)):
cur = right[j] - left[j]
if cur > mx:
i, mx = j, cur
return a[left[i]:right[i]]
from itertools import groupby
import math
def foo_4(a):
out = max((list(g) for k,g in groupby(a, math.isnan) if not k), key=len)
return out
def dual2_foo_1(a, np_a):
return foo_1(a)
def dual2_foo_2(a, np_a):
return foo_2(a)
def dual2_foo_3(a, np_a):
return foo_3(a)
def dual2_foo_4(a, np_a):
return foo_4(a)
def dual_foo_1(a, np_a):
return np_foo_1(np_a, 0)
def dual_foo_2(a, np_a):
return np_foo_2(np_a, 0)
def dual_foo_3(a, np_a):
return foo_3(a)
def dual_foo_4(a, np_a):
return foo_4(a)
foo_count = 4
foo_names=['foo_' + str(i + 1) for i in range(foo_count)]
foo_labels=['numpy', 'numpy_numba', 'python_find_seq_bounds', 'python_groupby']
exec("foo_funcs=[" + ','.join(f"foo_{str(i + 1)}" for i in range(foo_count)) + "]")
exec("dual_foo_funcs=[" + ','.join(f"dual_foo_{str(i + 1)}" for i in range(foo_count)) + "]")
exec("dual2_foo_funcs=[" + ','.join(f"dual2_foo_{str(i + 1)}" for i in range(foo_count)) + "]")
for foo in foo_names:
print(f'{foo} output:')
print(eval(f"{foo}(a)"))
import perfplot
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams["figure.autolayout"] = True
perfplot.show(
setup=lambda n: (a * n, np.array([0 if v is NaN else v for v in a * n], np.int32)),
kernels=dual_foo_funcs,
labels=foo_labels,
n_range=[2 ** k for k in range(11)],
equality_check=np.allclose,
xlabel='n/24',
logx="auto",
logy="auto"
)
perfplot.show(
setup=lambda n: (a * n, np.array([0 if v is NaN else v for v in a * n], np.int32)),
kernels=dual2_foo_funcs, # switch to dual_foo_funcs for second benchmark
labels=foo_labels,
n_range=[2 ** k for k in range(11)],
equality_check=np.allclose,
xlabel='n/24',
logx="auto",
logy="auto"
)
I have a numpy array with many rows in it that look roughly as follows:
0, 50, 50, 2, 50, 1, 50, 99, 50, 50
50, 2, 1, 50, 50, 50, 98, 50, 50, 50
0, 50, 50, 98, 50, 1, 50, 50, 50, 50
0, 50, 50, 50, 50, 99, 50, 50, 2, 50
2, 50, 50, 0, 98, 1, 50, 50, 50, 50
I am given a variable n<50. Each row, of length 10, has the following in it:
Every number from 0 to n, with one possibly missing. In the example above, n=2.
Possibly a 98, which will be in the place of the missing number, if there is a number missing.
Possibly a 99, which will be in the place of the missing number, if there is a number missing, and there is not already a 98.
Many 50's.
What I want to get is an array with all the indices of the 0s in the first row, all the indices of the 1s in the second row, all the indices of the 2s in the third row, etc. For the above example, my desired output is this:
0, 6, 0, 0, 3
5, 2, 5, 5, 5
3, 1, 3, 8, 0
You may have noticed the catch: sometimes, exactly one of the numbers is replaced either by a 98, or a 99. It's pretty easy to write a for loop which determines which number, if any, was replaced, and uses that to get the array of indices.
Is there a way to do this with numpy?
The follwing numpy solution rather aggressively uses the assumptions listed in OP. If they are not 100% guaranteed some more checks may be in order.
The mildly clever bit (even if I say so myself) here is to use the data array itself for finding the right destinations of their indices. For example, all the 2's need their indices stored in row 2 of the output array. Using this we can bulk store most of the indices in a single operation.
Example input is in array data:
n = 2
y,x = data.shape
out = np.empty((y,n+1),int)
# find 98 falling back to 99 if necessary
# and fill output array with their indices
# if neither exists some nonsense will be written but that does no harm
# most of this will be overwritten later
out.T[...] = ((data-98)&127).argmin(axis=1)
# find n+1 lowest values in each row
idx = data.argpartition(n,axis=1)[:,:n+1]
# construct auxiliary indexer
yr = np.arange(y)[:,None]
# put indices of low values where they belong
out[yr,data[yr,idx[:,:-1]]] = idx[:,:-1]
# ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
# the clever bit
# rows with no missing number still need the last value
nomiss, = (data[range(y),idx[:,n]] == n).nonzero()
out[nomiss,n] = idx[nomiss,n]
# admire
print(out.T)
outputs:
[[0 6 0 0 3]
[5 2 5 5 5]
[3 1 3 8 0]]
I don't think you're getting away without a for-loop here. But here's how you could go about it.
For each number in n, find all of the locations where it is known. Example:
locations = np.argwhere(data == 1)
print(locations)
[[0 5]
[1 2]
[2 5]
[4 5]]
You can then turn this into a map for easy lookup per number in n:
known = {
i: dict(np.argwhere(data == i))
for i in range(n + 1)
}
pprint(known)
{0: {0: 0, 2: 0, 3: 0, 4: 3},
1: {0: 5, 1: 2, 2: 5, 4: 5},
2: {0: 3, 1: 1, 3: 8, 4: 0}}
Do the same for the unknown numbers:
unknown = dict(np.argwhere((data == 98) | (data == 99)))
pprint(unknown)
{0: 7, 1: 6, 2: 3, 3: 5, 4: 4}
And now for each location in the result, you can lookup the index in the known list and fallback to the unknown.
result = np.array(
[
[known[i].get(j, unknown.get(j)) for j in range(len(data))]
for i in range(n + 1)
]
)
print(result)
[[0 6 0 0 3]
[5 2 5 5 5]
[3 1 3 8 0]]
Bonus: Getting fancy with dictionary constructor and unpacking:
from collections import OrderedDict
unknown = np.argwhere((data == 98) | (data == 99))
results = np.array([
[*OrderedDict((*unknown, *np.argwhere(data == i))).values()]
for i in range(n + 1)
])
print(results)
consider array1 and array2, with:
array1 = [a1 a2 NaN ... an]
array2 = [[NaN b2 b3 ... bn],
[b21 NaN b23 ... b2n],
...]
Both arrays are numpy-arrays. There is an easy way to compute the Euclidean distance between array1and each row of array2:
EuclideanDistance = np.sqrt(((array1 - array2)**2).sum(axis=1))
What messes up this computation are the NaN values. Of course, I could easily replace NaN with some number. But instead, I want to do the following:
When I compare array1 with row_x of array2, I count the columns in which one of the arrays has NaN and the other doesn't. Let's assume the count is 3. I will then delete these columns from both arrays and compute the Euclidean distance between the two. In the end, I add a minus_value * count to the calculated distance.
Now, I cannot think of a fast and efficient way to do this. Can somebody help me?
Here are a few of my ideas:
minus = 1000
dist = np.zeros(shape=(array1.shape[0])) # this array will store the distance of array1 to each row of array2
array1 = np.repeat(array1, array2.shape[0], axis=0) # now array1 has the same dimensions as array2
for i in range(0, array1.shape[0]):
boolarray = np.logical_or(np.isnan(array1[i]), np.isnan(array2[i]))
count = boolarray.sum()
deleteIdxs = boolarray.nonzero() # this should give the indices where boolarray is True
dist[i] = np.sqrt(((np.delete(array1[i], deleteIdxs, axis=0) - np.delete(array2[i], deleteIdxs, axis=0))**2).sum(axis=0))
dist[i] = dist[i] + count*minus
These lines look more than ugly to me, however. Also, I keep getting an index error: Apparently deleteIdxs contains an index that is out of range for array1. Don't know how this can even be.
You can find all the indices with where the value is nan using:
indices_1 = np.isnan(array1)
indices_2 = np.isnan(array2)
Which you can combine to:
indices_total = indices_1 + indices_2
And you can keep all the not nan values using:
array_1_not_nan = array1[~indices_total]
array_2_not_nan = array2[~indices_total]
I would write a function to handle the distance calculation. I am sure there is a faster and more efficient way to write this (list comprehensions, aggregations, etc.), but readability counts, right? :)
import numpy as np
def calculate_distance(fixed_arr, var_arr, penalty):
s_sum = 0.0
counter = 0
for num_1, num_2 in zip(fixed_arr, var_arr):
if np.isnan(num_1) or np.isnan(num_2):
counter += 1
else:
s_sum += (num_1 - num_2) ** 2
return np.sqrt(s_sum) + penalty * counter, counter
array1 = np.array([1, 2, 3, np.NaN, 5, 6])
array2 = np.array(
[
[3, 4, 9, 3, 4, 8],
[3, 4, np.NaN, 3, 4, 8],
[np.NaN, 9, np.NaN, 3, 4, 8],
[np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
]
)
dist = np.zeros(len(array2))
minus = 10
for index, arr in enumerate(array2):
dist[index], _ = calculate_distance(array1, arr, minus)
print(dist)
You have to think about the value for the minus variable very carefully. Is adding a random value really useful?
As #Nathan suggested, a more resource efficient can easily be implemented.
fixed_arr = array1
penalty = minus
dist = [
(
lambda indices=(np.isnan(fixed_arr) + np.isnan(var_arr)): np.linalg.norm(
fixed_arr[~indices] - var_arr[~indices]
)
+ (indices == True).sum() * penalty
)()
for var_arr in array2
]
print(dist)
However I would only try to implement something like this if I absolutely needed to (if it's the bottleneck). For all other times I would be happy to sacrifice some resources in order to gain some readability and extensibility.
You can filter out the columns containing nan with:
mask1 = np.isnan(arr1)
mask2 = np.isnan(arr2).any(0)
mask = ~(mask1 | mask2)
# the two filtered arrays
arr1[mask], arr2[mask]
I am trying to apply conditional statements in a numpy array and to get a boolean array with 1 and 0 values.
I tried so far the np.where(), but it allows only 3 arguments and in my case I have some more.
I first create the array randomly:
numbers = np.random.uniform(1,100,5)
Now, if the value is lower then 30, I would like to get a 0. If the value is greater than 70, I would like to get 1. And if the value is between 30 and 70, I would like to get a random number between 0 and 1. If this number is greater than 0.5, then the value from the array should get 1 as a boolean value and in other case 0. I guess this is made again with the np.random function, but I dont know how to apply all of the arguments.
If the input array is:
[10,40,50,60,90]
Then the expected output should be:
[0,1,0,1,1]
where the three values in the middle are randomly distributed so they can differ when making multiple tests.
Thank you in advance!
Use numpy.select and 3rd condition should should be simplify by numpy.random.choice:
numbers = np.array([10,40,50,60,90])
print (numbers)
[10 40 50 60 90]
a = np.select([numbers < 30, numbers > 70], [0, 1], np.random.choice([1,0], size=len(numbers)))
print (a)
[0 0 1 0 1]
If need 3rd condition with compare by 0.5 is possible convert mask to integers for True, False to 1, 0 mapping:
b = (np.random.rand(len(numbers)) > .5).astype(int)
#alternative
#b = np.where(np.random.rand(len(numbers)) > .5, 1, 0)
a = np.select([numbers < 30, numbers > 70], [0, 1], b)
Or you can chain 3 times numpy.where:
a = np.where(numbers < 30, 0,
np.where(numbers > 70, 1,
np.where(np.random.rand(len(numbers)) > .5, 1, 0)))
Or use np.select:
a = np.select([numbers < 30, numbers > 70, np.random.rand(len(numbers)) > .5],
[0, 1, 1], 0)