I have a binary numpy array, mostly zero-valued, and I want to fill the gaps bewteen non-zero values with a given value, but in an alternate way.
For example:
[0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0]
should result in either
[0,0,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0]
or
[1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,0,1,1,1]
The idea is: while scanning the array left to right, fill 0 values with 1 up the next 1, if you didn't do it up to the previous 1.
I can do this iteratively and in this way
A = np.array([0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0])
ones_index = np.where(A == 1)[0]
begins = ones_index[::2] # beginnings of filling section
ends = ones_index[1::2] # ends of filling sections
from itertools import zip_longest
# fill those sections
for begin, end in zip_longest(begins, ends, fillvalue=len(A)):
A[begin:end] = 1
but I'm looking for a more efficent solution, maybe with numpy broadcasting. Any ideas?
One nice answer to this question is that we can produce the first result via np.logical_xor.accumulate(arr) | arr and the second via ~np.logical_xor.accumulate(arr) | arr. A quick demonstration:
A = np.array([0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0])
print(np.logical_xor.accumulate(A) | A)
print(~np.logical_xor.accumulate(A) | A)
The resulting output:
[0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0]
[1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1]
np.where(arr.cumsum() % 2 == 1, 1, arr)
# array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
# 0, 0, 1, 1, 1, 1, 0, 0])
Related
I have two dataframe columns containing sequences of 0 and -1.
Using Python command “count” I can calculate the number of times the 1st column equal '-1' ( =3 times) and the number of times the 2nd column equals '-1' ( =2 times). Actually, I would like to calculate the number of times that both columns x and y are equal to '-1' simultaneously ( = it should be equal to 1 in the given example)(something like calculating: count = df1['x'][df1['x'] == df1['y'] == -1]. count() but I cannot put 2 conditions directly in command 'count'..).
Is there a simpe way to do it (using count or some other workaround)?
Thanks in advance!
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
df1 = pd.DataFrame({
"x": [0, 0, 0, -1 , 0, -1, 0, 0, 0, 0 , 0, 0, 0, -1, 0],
"y": [0, 0, 0, 0 , 0, 0, -1, 0, 0, 0 , 0, 0, 0, -1, 0],
})
df1
x y
0 0 0
1 0 0
2 0 0
3 -1 0
4 0 0
5 -1 0
6 0 -1
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 -1 -1
14 0 0
count = df1['x'][df1['x'] == -1]. count()
count
3
count = df1['y'][df1['y'] == -1]. count()
count
2
You can use eq + all to get a boolean Series that returns True if both columns are equal to -1 at the same time. Then sum fetches the total:
out = df1[['x','y']].eq(-1).all(axis=1).sum()
Output:
1
Sum x and y, and count the ones where they add to -2. ie both are -1
(df1.x + df1.y).eq(-2).sum()
1
Is there any way to make the below code more efficient.
for i in range(0, len(df)):
current_row = df.iloc[i]
if i > 0:
previous_row =df.iloc[i-1]
else:
previous_row = current_row
if (current_row['A'] != 1):
if ((current_row['C'] < 55) and (current_row['D'] >= -1)):
df.loc[i,'F'] = previous_row['F'] + 1
else:
df.loc[i,'F'] = previous_row['F']
For example if the dataframe is like below:
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[1,1,1, 0, 0, 0, 1, 1, 1]})
My output should look like this
>>> df
A C D F
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 0 0 0 2
4 0 0 0 3
5 0 0 0 4
6 1 1 1 1
7 0 1 1 2
8 0 1 1 3
Essentially, I want to convert consecutive duplicates of Trues, to False as the title suggests.
For example, say, i have an array of 0s and 1s
x = pd.Series([1,0,0,1,1])
should become:
y = pd.Series([0,0,0,0,1])
# where the 1st element of x becomes 0 since its not a consecutive
# and the 4th element becomes 0 because its the first instance of the consecutive duplicate
# And everything else should remain the same.
This can also apply to consecutives of more than two, Say i have a much longer array:
eg.
x = pd.Series([1,0,0,1,1,1,0,1,1,0,1,1,1,1,0,0,1,1,1,1,1])
becomes;
y = pd.Series([0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1])
Posts that i have searched are mostly either deleting consecutive duplicates, and does not retain the original length. In this case, it should retain the original length.
It is something like the following code:
for i in range(len(x)):
if x[i] == x[i+1]:
x[i] = True
else:
x[i] = False
but this gives me a never ending run. And does not accommodate consecutives of more than two.
Pandas solution - create Series, then consecutive groups by shift and cumsum and filter last 1 values in duplicates by Series.duplicated:
s = pd.Series(x)
g = s.ne(s.shift()).cumsum()
s1 = (~g.duplicated(keep='last') & g.duplicated(keep=False) & s.eq(1)).astype(int)
print (s1.tolist())
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
EDIT:
For multiple columns use function:
x = pd.Series([1,0,0,1,1,1,0,1,1,0,1,1,1,1,0,0,1,1,1,1,1])
df = pd.DataFrame({'a':x, 'b':x})
def f(s):
g = s.ne(s.shift()).cumsum()
return (~g.duplicated(keep='last') & g.duplicated(keep=False) & s.eq(1)).astype(int)
df = df.apply(f)
print (df)
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 1 1
6 0 0
7 0 0
8 1 1
9 0 0
10 0 0
11 0 0
12 0 0
13 1 1
14 0 0
15 0 0
16 0 0
17 0 0
18 0 0
19 0 0
20 1 1
Vanilla Python :
x = [1,0,0,1,1,1,0,1,1,0,1,1,1,1,0,0,1,1,1,1,1]
counter = 0
for i, e in enumerate(x):
if not e:
counter = 0
continue
if not counter or (i < len(x) - 1 and x[i+1]):
counter += 1
x[i] = 0
print(x)
Prints :
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
I have two large matrices (1800L;1800C), epeq and triax, that have columns like:
epeq=
0
1
1
2
1
0
3
3
1
1
0
2
1
1
1
triax=
-1
1
3
1
-2
-3
-1
1
2
3
2
1
-1
-3
-1
1
as you can see, triax columns have cycles of positive and negative elements. I want a cumulative sum in epeq in the beginning of each cycle in triax and that this value stay constant during the cycle, like this:
epeq_cr=
0
1
1
1
1
1
1
11
11
11
11
11
11
11
11
17
and apply this procedure to all columns of the epeq matrix. I have that code but something miss.
epeq_cr = np.copy(epeq)
for g in range(1,len(epeq_cr)):
for h in range(len(epeq_cr[g])):
if (triax[g-1][h]<0 and triax[g][h]>0):
epeq_cr[g][h] = np.cumsum()...
I've run out of time to look at this now but I'd start by figuring out where the cycles start in the triax:
epeq = np.array([1, 1, 2, 1, 0, 3, 3, 1, 1, 0, 2, 1, 1, 1])
triax = np.array([-1, 1, 3, 1, -2, -3, -1, 1, 2, 3, 2, 1, -1, -3, -1, 1])
t_shift = np.roll(triax, 1)
t_shift[0] = 0
cycle_starts = np.argwhere((triax > 0) & (t_shift < 0)).flatten()
array([ 1, 7, 15])
So for any position, i, in epeq_cr you need to find the largest number less than i in cycle_starts and sum(epeq[:position]).
epeq_cr = np.copy(epeq)
for g in range(1,len(epeq_cr)):
for h in range(len(epeq_cr[g])):
if (triax[g-1][h]<=0 and triax[g][h]>=0):
epeq_cr[g][h]=sum(epeq[v][h] for v in range(g+1))
else:
epeq_cr[g][h]=epeq_cr[g-1][h]