Converting one int to multiple bool columns in pandas - python

Background
I got a data frame with integers. These integers represents a series of features that are either present or not present for that row.
I want these features to be named columns in my data frame.
Problem
My current solution explodes in memory and is crazy slow. How do I improve the memory efficiency of this?
import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
It seems to be the .apply(pd.Series) that is slowing this down. Everything else is quite fast until I add this.
I cannot skip it because a simple list will not make a dataframe.

you can use numpy.binary_repr method:
In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
.add_prefix('f')
Out[336]:
f0 f1 f2 f3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
or
In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
...: columns=np.arange(1,5)) \
...: .add_prefix('f')
...:
Out[346]:
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0

Here's a vectorized NumPy approach -
def num2bin(nums, width):
return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)
Sample run -
In [70]: df
Out[70]:
some_int
0 1
1 5
2 3
3 8
4 4
In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
columns = [["f1", "f2", "f3", "f4"]])
Out[71]:
f1 f2 f3 f4
0 0 0 0 1
1 0 1 0 1
2 0 0 1 1
3 1 0 0 0
4 0 1 0 0
Explanation
1) Inputs :
In [98]: nums = np.array([1,5,3,8,4])
In [99]: width = 4
2) Get the 2 powered range numbers :
In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])
3) Convert nums to a 2D array version as we later on want to do element-wise bit-ANDing between it and the 2-powered numbers in a vectorized mannner following the rules of broadcasting :
In [101]: nums[:,None]
Out[101]:
array([[1],
[5],
[3],
[8],
[4]])
In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]:
array([[0, 0, 0, 1],
[0, 4, 0, 1],
[0, 0, 2, 1],
[8, 0, 0, 0],
[0, 4, 0, 0]])
To understand the bit-ANDIng, let's consider the number 5 from nums and its bit-ANDing for it against all 2-powered numbers [8,4,2,1] :
In [103]: 5 & 8 # 0101 & 1000
Out[103]: 0
In [104]: 5 & 4 # 0101 & 0100
Out[104]: 4
In [105]: 5 & 2 # 0101 & 0010
Out[105]: 0
In [106]: 5 & 1 # 0101 & 0001
Out[106]: 1
Thus, we see that there are no intersection against [8,2], whereas for others we have non-zeros.
4) In the final stage, look for matches (non-zeros) and simply convert those to 1s and rest to 0s by comparing against 0 resulting in a boolean array and then converting to int dtype :
In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))
In [108]: matches!=0
Out[108]:
array([[False, False, False, True],
[False, True, False, True],
[False, False, True, True],
[ True, False, False, False],
[False, True, False, False]], dtype=bool)
In [109]: (matches!=0).astype(int)
Out[109]:
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
Runtime test
In [58]: df = pd.DataFrame({'some_int':range(100000)})
# #jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop
# #jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop
# #jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop
# #MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop
# #MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop
# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop

I think you need:
a = pd.DataFrame(df['some_int'].astype(int)
.apply(bin)
.str[2:]
.str.zfill(4)
.apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Another solution, thanks Jon Clements and ayhan:
a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(),
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
A bit changed:
a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values],
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Timings:
df = pd.DataFrame({'some_int':range(100000)})
In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop
In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop
In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop
In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop
In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop
In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop

Related

Step left rows values until not null in python

I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.
For example:
0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1
The result has to be this:
1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0
I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.
Do you have idea for this problem?
You may using with cummax to mask all position need to shift as NaN and sorted
df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]:
1 2 3 4 5 6 7
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
Or we using the function justify write by Divakar(much faster than the apply sorted)
pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]:
0 1 2 3 4 5 6
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
You can make use of numpy.ogrid here:
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]
array([[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0]], dtype=int64)
Timings
In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %%timeit
...: a = df.values
...: s = a.argmax(1) * - 1
...: m, n = a.shape
...: r, c = np.ogrid[:m, :n]
...: s[s < 0] += n
...: c = c - s[:, None]
...: pd.DataFrame(a[r, c])
...:
...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:
from numba import njit
#njit
def shifter(A):
res = np.zeros(A.shape)
for i in range(res.shape[0]):
start, end = 0, 0
for j in range(res.shape[1]):
if A[i, j] != 0:
start = j
break
res[i, :res.shape[1]-start] = A[i, start:]
return res
Performance benchmarking
def jpp(df):
return pd.DataFrame(shifter(df.values).astype(int))
def user348(df):
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
return pd.DataFrame(a[r, c])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
assert np.array_equal(jpp(df).values, user348(df).values)
%timeit jpp(df) # 9.2 ms per loop
%timeit user348(df) # 18.5 ms per loop
Here is a stride_tricks solution, which is fast because it enables slice-wise copying.
def pp(x):
n, m = x.shape
am = x.argmax(-1)
mam = am.max()
xx = np.empty((n, m + mam), x.dtype)
xx[:, :m] = x
xx[:, m:] = 0
xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
return xx[np.arange(x.shape[0]), am]
It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.
How fast? For large enough inputs on par with numba:
x = np.random.randint(0, 2, (10000, 10))
from timeit import timeit
shifter(x) # that should compile it, right?
print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))
Sample output:
0.8630472810036736
0.7336142909916816

Vectorized cummulative sum based on value in array numpy [duplicate]

Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.
Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop
Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0
You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64
You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.

Pandas: Find row wise frequent value

I have a dataset with binary values. I want to find out frequent value in each row. This dataset have couple of millions records. What would be the most efficient way to do it? Following is the sample of the dataset.
import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1 bit2 bit2 bit4 bit5 frequent freq_count
0 0 0 1 1 0 3
1 1 1 0 0 1 3
1 0 1 1 1 1 4
I want to create frequent as well as freq_count columns like the sample above. These are not part of original dataset and will be created after looking at all rows.
Here's one approach -
def freq_stat(df):
a = df.values
zero_c = (a==0).sum(1)
one_c = a.shape[1] - zero_c
df['frequent'] = (zero_c<=one_c).astype(int)
df['freq_count'] = np.maximum(zero_c, one_c)
return df
Sample run -
In [305]: df
Out[305]:
bit1 bit2 bit2.1 bit4 bit5
0 0 0 0 1 1
1 1 1 1 0 0
2 1 0 1 1 1
In [308]: freq_stat(df)
Out[308]:
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Benchmarking
Let's test out this one against the fastest approach from #jezrael's soln :
from scipy import stats
def mod(df): # #jezrael's best soln
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
Also, let's use the same setup from the other post and get the timings -
In [323]: np.random.seed(100)
...: N = 10000
...: #[10000 rows x 20 columns]
...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
...:
# #jezrael's soln
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop
# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop
You can use scipy.stats.mode:
from scipy import stats
a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))
df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Use Counter.most_common:
from collections import Counter
def f(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Another solution:
def f(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Alternative:
from collections import defaultdict
def f(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Timings:
np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))
In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop
In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop
In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop
In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop
In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop
from collections import Counter
from collections import defaultdict
from scipy import stats
def f1(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
def f2(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
def f3(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
def mod(df):
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
def mod1(df):
a = df.values
b = stats.mode(a, axis=1)
df['a'] = b[0][:, 0]
df['b'] = b[1][:, 0]
return df

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

How to count the number of time intervals that meet a boolean condition within a pandas dataframe?

I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.
Timestamp Boolean_condition
1 1
2 1
3 0
4 1
5 1
6 1
7 0
8 0
9 1
10 0
How to count the total number of time intervals within the whole series that meet this condition?
The desired output should look like this:
Timestamp Boolean_condition Event_number
1 1 1
2 1 1
3 0 NaN
4 1 2
5 1 2
6 1 2
7 0 NaN
8 0 NaN
9 1 3
10 0 NaN
You can create Series with cumsum of two masks and then create NaN by function Series.mask:
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
Name: Boolean_condition, dtype: int32
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
Timestamp Boolean_condition Event_number
0 1 1 1.0
1 2 1 1.0
2 3 0 NaN
3 4 1 2.0
4 5 1 2.0
5 6 1 2.0
6 7 0 NaN
7 8 0 NaN
8 9 1 3.0
9 10 0 NaN
Timings:
#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def nick(df):
isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
return df
def jez(df):
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
def jez1(df):
mask0 = ~df.Boolean_condition
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop
In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop
In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
You could try the following:
1) Get all values of True instance (here, 1) which comprises of isone
2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.
3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.
4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.
isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
Faster approach:
Using only numpy:
1) Get it's array representation.
2) Compute the non-zero, here (1's) indices.
3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.
4) Initialize a new array filled with Nan's of the same shape as that of the original array.
5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.
6) Assign these back to the new column.
def nick(df):
b = df.Bolean_condition.values
slc = np.flatnonzero(b)
slc_pl_1 = np.append(np.nan, slc)
nan_arr = np.full(b.size, fill_value=np.nan)
nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
df['Event_number'] = nan_arr
return df
Timings:
For a DF of 10,000 rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(10000),
Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
)
df1.shape
# (10000, 2)
def jez(df):
mask0 = df.Bolean_condition.eq(0)
mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
return (df)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop
%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop
For a DF containing 1 million rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(1000000),
Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
)
df1.shape
# (1000000, 2)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop
%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
This should work but might be a bit slow for a very long df.
df = pd.concat([df,pd.Series([0]*len(df), name = '2')], axis = 1)
if df.iloc[0,1] == 1:
counter = 1
df.iloc[0, 2] = counter
else:
counter = 0
df.iloc[0,2] = 0
previous = df.iloc[0,1]
for y,x in df.iloc[1:,].iterrows():
print(y)
if x[1] == 1 and previous == 1:
previous = x[1]
df.iloc[y, 2] = counter
if x[1] == 0:
previous = x[1]
df.iloc[y,2] = 0
if x[1] == 1 and previous == 0:
counter += 1
previous = x[1]
df.iloc[y,2] = counter
A custom function does the trick. here is a solution in Matlab code:
Boolean_condition = [1 1 0 1 1 1 0 0 1 0];
Event_number = [NA NA NA NA NA NA NA NA NA NA];
loop_event_number = 1;
for timestamp=1:10
if Boolean_condition(timestamp)==1
Event_number(timestamp) = loop_event_number;
last_event_number = loop_event_number;
else
loop_event_number = last_event_number +1;
end
end
% Event_number = 1 1 NA 2 2 2 NA NA 3 NA

Categories

Resources