I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold, I will set values 1 to n rows forward from the initial row.
For example, in the dataframe below I will be loop to row 2016-08-05. If n_hold = 2, then I will set both 2016-08-08 and 2016-08-09 to 1 too.:
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 0
2016-08-09 0
2016-08-10 0
The resulting df will then is
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 1
2016-08-09 1
2016-08-10 0
The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.
Here is my (slow) solution, x is the initial signal dataframe:
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
Completely changed answer, because working differently with consecutive 1 values:
Explanation:
Solution remove each consecutive 1 first by where with chained boolean mask by comparing with ne (not equal !=) with shift to NaNs, forward filling them by ffill with limit parameter and last replace 0 back:
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
Timings and comparing outputs:
np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)
def orig(x):
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
return final_signal
#print (orig(x))
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)
df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
input orig new
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 1 1
7 0 1 1
8 0 1 1
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 0 0 0
#check outputs
#print (s.values == orig(x).values)
Timings:
%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.
For example:
0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1
The result has to be this:
1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0
I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.
Do you have idea for this problem?
You may using with cummax to mask all position need to shift as NaN and sorted
df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]:
1 2 3 4 5 6 7
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
Or we using the function justify write by Divakar(much faster than the apply sorted)
pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]:
0 1 2 3 4 5 6
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
You can make use of numpy.ogrid here:
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]
array([[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0]], dtype=int64)
Timings
In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %%timeit
...: a = df.values
...: s = a.argmax(1) * - 1
...: m, n = a.shape
...: r, c = np.ogrid[:m, :n]
...: s[s < 0] += n
...: c = c - s[:, None]
...: pd.DataFrame(a[r, c])
...:
...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:
from numba import njit
#njit
def shifter(A):
res = np.zeros(A.shape)
for i in range(res.shape[0]):
start, end = 0, 0
for j in range(res.shape[1]):
if A[i, j] != 0:
start = j
break
res[i, :res.shape[1]-start] = A[i, start:]
return res
Performance benchmarking
def jpp(df):
return pd.DataFrame(shifter(df.values).astype(int))
def user348(df):
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
return pd.DataFrame(a[r, c])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
assert np.array_equal(jpp(df).values, user348(df).values)
%timeit jpp(df) # 9.2 ms per loop
%timeit user348(df) # 18.5 ms per loop
Here is a stride_tricks solution, which is fast because it enables slice-wise copying.
def pp(x):
n, m = x.shape
am = x.argmax(-1)
mam = am.max()
xx = np.empty((n, m + mam), x.dtype)
xx[:, :m] = x
xx[:, m:] = 0
xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
return xx[np.arange(x.shape[0]), am]
It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.
How fast? For large enough inputs on par with numba:
x = np.random.randint(0, 2, (10000, 10))
from timeit import timeit
shifter(x) # that should compile it, right?
print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))
Sample output:
0.8630472810036736
0.7336142909916816
I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.
This question already has an answer here:
Efficiently zero elements of numpy array using a boolean mask
(1 answer)
Closed 1 year ago.
i have a numpy matrix 10x10 and want to zero values in some columns, accordingly to a vector [1,0,0,0,0,1,0,0,1,0] - how to do it with best performance? using other python libraries is also acceptable, if work better
The simplest way to do this is multiplication. Multiplying a value by 0 zeroes it out, and multiplying a value by 1 has no effect, so multiplying your matrix with your vector will do exactly what you want:
m = np.random.randint(1, 10, (10,10))
v = np.array([1,0,0,0,0,1,0,0,1,0])
print(m * v)
Output:
[[7 0 0 0 0 5 0 0 5 0]
[8 0 0 0 0 5 0 0 6 0]
[1 0 0 0 0 5 0 0 9 0]
[1 0 0 0 0 6 0 0 1 0]
[5 0 0 0 0 8 0 0 5 0]
[5 0 0 0 0 4 0 0 9 0]
[1 0 0 0 0 3 0 0 9 0]
[1 0 0 0 0 9 0 0 8 0]
[6 0 0 0 0 4 0 0 6 0]
[1 0 0 0 0 6 0 0 1 0]]
You were concerned that multiplying might be too slow, and wanted to know how to do it by selecting. That's easy too:
bv = v.astype(np.bool)
m[:,bv] = 0
print(m)
Or, instead of astype, you could use bv = v == 1, but since you end up with the exact same bool array, and I can't imagine that would make a difference.
So, which is fastest?
In [123]: %timeit m*v
2.87 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [124]: bv = v.astype(np.bool)
In [125]: %timeit m[:,v.astype(np.bool)]
5.02 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [127]: bv = v==1
In [128]: %timeit m[:,v.astype(np.bool)]
5.03 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So, the "slow" way actually runs in less than two thirds the time.
Also, it takes only 5 microseconds no matter how you do it—which is what you should expect, given how small the array is.
I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.
Timestamp Boolean_condition
1 1
2 1
3 0
4 1
5 1
6 1
7 0
8 0
9 1
10 0
How to count the total number of time intervals within the whole series that meet this condition?
The desired output should look like this:
Timestamp Boolean_condition Event_number
1 1 1
2 1 1
3 0 NaN
4 1 2
5 1 2
6 1 2
7 0 NaN
8 0 NaN
9 1 3
10 0 NaN
You can create Series with cumsum of two masks and then create NaN by function Series.mask:
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
Name: Boolean_condition, dtype: int32
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
Timestamp Boolean_condition Event_number
0 1 1 1.0
1 2 1 1.0
2 3 0 NaN
3 4 1 2.0
4 5 1 2.0
5 6 1 2.0
6 7 0 NaN
7 8 0 NaN
8 9 1 3.0
9 10 0 NaN
Timings:
#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def nick(df):
isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
return df
def jez(df):
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
def jez1(df):
mask0 = ~df.Boolean_condition
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop
In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop
In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
You could try the following:
1) Get all values of True instance (here, 1) which comprises of isone
2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.
3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.
4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.
isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
Faster approach:
Using only numpy:
1) Get it's array representation.
2) Compute the non-zero, here (1's) indices.
3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.
4) Initialize a new array filled with Nan's of the same shape as that of the original array.
5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.
6) Assign these back to the new column.
def nick(df):
b = df.Bolean_condition.values
slc = np.flatnonzero(b)
slc_pl_1 = np.append(np.nan, slc)
nan_arr = np.full(b.size, fill_value=np.nan)
nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
df['Event_number'] = nan_arr
return df
Timings:
For a DF of 10,000 rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(10000),
Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
)
df1.shape
# (10000, 2)
def jez(df):
mask0 = df.Bolean_condition.eq(0)
mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
return (df)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop
%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop
For a DF containing 1 million rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(1000000),
Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
)
df1.shape
# (1000000, 2)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop
%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
This should work but might be a bit slow for a very long df.
df = pd.concat([df,pd.Series([0]*len(df), name = '2')], axis = 1)
if df.iloc[0,1] == 1:
counter = 1
df.iloc[0, 2] = counter
else:
counter = 0
df.iloc[0,2] = 0
previous = df.iloc[0,1]
for y,x in df.iloc[1:,].iterrows():
print(y)
if x[1] == 1 and previous == 1:
previous = x[1]
df.iloc[y, 2] = counter
if x[1] == 0:
previous = x[1]
df.iloc[y,2] = 0
if x[1] == 1 and previous == 0:
counter += 1
previous = x[1]
df.iloc[y,2] = counter
A custom function does the trick. here is a solution in Matlab code:
Boolean_condition = [1 1 0 1 1 1 0 0 1 0];
Event_number = [NA NA NA NA NA NA NA NA NA NA];
loop_event_number = 1;
for timestamp=1:10
if Boolean_condition(timestamp)==1
Event_number(timestamp) = loop_event_number;
last_event_number = loop_event_number;
else
loop_event_number = last_event_number +1;
end
end
% Event_number = 1 1 NA 2 2 2 NA NA 3 NA
I have a df like so:
Count
1
0
1
1
0
0
1
1
1
0
and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.
You could:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
to get:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
From here you can, for any threshold:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
to get:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
or, in a single step:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
compared to:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop
Not sure if this is optimized, but you can give it a try:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0