Relative row selection in pandas? - python

Is it possible in pandas to select the 5 rows before/after a specific row if they match a specific condition?
For instance, is it possible to start from row 19, and then select the five preceding rows for which b is True (thus selecting 16,16,10,7, and 4). I would call this 'relative' location. (Is there a better term for this? A place I can read about this type of lookup?)
a | b
=============
0 True
1 False
4 True
7 True
9 False
10 True
13 True
16 True
18 False
19 True

try this:
In [31]: df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
Out[31]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
Step by step:
index of the element where a==19:
In [32]: df[df.a == 19].index[0]
Out[32]: 9
now we can list all elements where b is True and which index is less than 9:
In [30]: df.ix[(df.b) & (df.index <9)].tail(5)
Out[30]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
now combine both of them:
In [33]: df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
Out[33]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
Speed it up a little bit:
In [103]: idx19 = df[df.a == 19].index[0]
In [104]: idx19
Out[104]: 9
In [107]: %timeit df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
1000 loops, best of 3: 973 µs per loop
In [108]: %timeit df.ix[(df.b) & (df.index < idx19)].tail(5)
1000 loops, best of 3: 564 µs per loop
PS so it got 42% faster

Related

How to create a new column containing the largest value in a list that is smaller than cell value in an existing column?

I have a pandas dataframe that looks like:
a
0 0
1 -2
2 4
3 1
4 6
I also have a list
A = [-1, 2, 5, 7]
I want to add a new column called 'b', that contains the largest value in A that is smaller than the cell value in column 'a'. If no such value exists, I want the value in 'b' to be 'X'. So, the goal is to get:
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
How do I achieve this?
There is a build-in function merge_asof
s=pd.DataFrame({'a':A,'b':A})
pd.merge_asof(df.assign(index=df.index).sort_values('a'),s,on='a').set_index('index').sort_index().fillna('X')
Out[284]:
a b
index
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
def largest_min(x):
less_than = list(filter(lambda l: l < x, A))
if len(less_than):
return max(less_than)
return 'X'
df['b'] = df['a'].apply(largest_min)
edited: To fix error and and 'X' for no values found
Not sure of a pandas method, but numpy.searchsorted is a perfect fit here.
Finds indices where elements should be inserted to maintain order.
Once you have the indices that your elements would be inserted into to maintain the sort, you can look at the element to the left of these indices in your lookup array to find the closest smaller element. If the element would be inserted at the beginning of the list (index 0), we know that a smaller element does not exist in the lookup list, and we account for that scenario using np.where
A = np.array([-1, 2, 5, 7])
r = np.searchsorted(A, df.a.values)
df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X')
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
This method will be much faster than apply here.
df = pd.concat([df]*10_000)
%%timeit
r = np.searchsorted(A, df.a.values)
df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X')
6.09 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['a'].apply(largest_min)
196 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is another way to do it as well:
df1 = pd.Series(A)
def filler(val):
v = df1[df1 < val.iloc[0]].max()
return v
df.assign(b=df.apply(filler, axis=1).fillna('X'))
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
df = pd.DataFrame({'a':[0,1,4,1,6]})
A = [-1,2,5,7]
new_list = []
for i in df.iterrows():
for j in range(len(A)):
if A[j] < i[1]['a']:
print(A[j])
pass
elif j == 0:
new_list.append(A[j])
break
else:
new_list.append(A[j-1])
break
df['b'] = new_list

Comparing values from Pandas data frame column using offset values from another column

I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.

How to count the number of time intervals that meet a boolean condition within a pandas dataframe?

I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.
Timestamp Boolean_condition
1 1
2 1
3 0
4 1
5 1
6 1
7 0
8 0
9 1
10 0
How to count the total number of time intervals within the whole series that meet this condition?
The desired output should look like this:
Timestamp Boolean_condition Event_number
1 1 1
2 1 1
3 0 NaN
4 1 2
5 1 2
6 1 2
7 0 NaN
8 0 NaN
9 1 3
10 0 NaN
You can create Series with cumsum of two masks and then create NaN by function Series.mask:
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
Name: Boolean_condition, dtype: int32
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
Timestamp Boolean_condition Event_number
0 1 1 1.0
1 2 1 1.0
2 3 0 NaN
3 4 1 2.0
4 5 1 2.0
5 6 1 2.0
6 7 0 NaN
7 8 0 NaN
8 9 1 3.0
9 10 0 NaN
Timings:
#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def nick(df):
isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
return df
def jez(df):
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
def jez1(df):
mask0 = ~df.Boolean_condition
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop
In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop
In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
You could try the following:
1) Get all values of True instance (here, 1) which comprises of isone
2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.
3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.
4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.
isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
Faster approach:
Using only numpy:
1) Get it's array representation.
2) Compute the non-zero, here (1's) indices.
3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.
4) Initialize a new array filled with Nan's of the same shape as that of the original array.
5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.
6) Assign these back to the new column.
def nick(df):
b = df.Bolean_condition.values
slc = np.flatnonzero(b)
slc_pl_1 = np.append(np.nan, slc)
nan_arr = np.full(b.size, fill_value=np.nan)
nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
df['Event_number'] = nan_arr
return df
Timings:
For a DF of 10,000 rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(10000),
Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
)
df1.shape
# (10000, 2)
def jez(df):
mask0 = df.Bolean_condition.eq(0)
mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
return (df)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop
%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop
For a DF containing 1 million rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(1000000),
Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
)
df1.shape
# (1000000, 2)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop
%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
This should work but might be a bit slow for a very long df.
df = pd.concat([df,pd.Series([0]*len(df), name = '2')], axis = 1)
if df.iloc[0,1] == 1:
counter = 1
df.iloc[0, 2] = counter
else:
counter = 0
df.iloc[0,2] = 0
previous = df.iloc[0,1]
for y,x in df.iloc[1:,].iterrows():
print(y)
if x[1] == 1 and previous == 1:
previous = x[1]
df.iloc[y, 2] = counter
if x[1] == 0:
previous = x[1]
df.iloc[y,2] = 0
if x[1] == 1 and previous == 0:
counter += 1
previous = x[1]
df.iloc[y,2] = counter
A custom function does the trick. here is a solution in Matlab code:
Boolean_condition = [1 1 0 1 1 1 0 0 1 0];
Event_number = [NA NA NA NA NA NA NA NA NA NA];
loop_event_number = 1;
for timestamp=1:10
if Boolean_condition(timestamp)==1
Event_number(timestamp) = loop_event_number;
last_event_number = loop_event_number;
else
loop_event_number = last_event_number +1;
end
end
% Event_number = 1 1 NA 2 2 2 NA NA 3 NA

updating prior rows of dataframe conditionally upon current

I have a dataframe (ev), and I would like to read it and whenever the value of the 'trig' column is 64, I need to update the value of the critical column that is 4 rows above, and change it to 999. I tried the code below but it does not change anything, though seems it should work.
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] == 999
Try this, you were close: Learn about single "=" vs double "=="
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
I think you can mask by mask with fillna False, because after shift you get NaN:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 5 4
5 7 6
6 8 8
7 9 9
8 0 64
9 7 6
10 6 7
11 4 8
12 3 6
13 5 64
mask = (ev.trig == 64).shift(-4).fillna(False)
print (mask)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 False
13 False
Name: trig, dtype: bool
ev['critical'] = ev.critical.mask(mask, 999)
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 999 4
5 7 6
6 8 8
7 9 9
8 0 64
9 999 6
10 6 7
11 4 8
12 3 6
13 5 64
EDIT:
Timings:
I think better is avoiding iteration in pandas, because in large dataframe it is very slow:
len(df)=1400:
In [66]: %timeit (jez(ev))
1000 loops, best of 3: 1.29 ms per loop
In [67]: %timeit (mer(ev1))
10 loops, best of 3: 49.9 ms per loop
len(df)=14k:
In [59]: %timeit (jez(ev))
100 loops, best of 3: 2.49 ms per loop
In [60]: %timeit (mer(ev1))
1 loop, best of 3: 501 ms per loop
len(df)=140k:
In [63]: %timeit (jez(ev))
100 loops, best of 3: 15.8 ms per loop
In [64]: %timeit (mer(ev1))
1 loop, best of 3: 6.32 s per loop
Code for timings:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
ev = pd.concat([ev]*100).reset_index(drop=True)
#ev = pd.concat([ev]*1000).reset_index(drop=True)
#ev = pd.concat([ev]*10000).reset_index(drop=True)
ev1 = ev.copy()
def jez(df):
ev['critical'] = ev.critical.mask((ev.trig == 64).shift(-4).fillna(False), 999)
return (ev)
def mer(df):
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
return (ev)
print (jez(ev))
print (mer(ev1))

Why am I getting strange behavior for creating several boolean series?

I have a DataFrame to which I am adding several boolean columns. For each column, I initialize it to False and then set some values to True. If I do this for one and then for another, the first gets reinitialized to all False. For example,
In [170]: df['racedif']=False
In [171]: df['racedif'][~ df.newpers]=df.ptdtrace[~ df.newpers]!=df.ptdtrace.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
In [172]: df.racedif.sum()
Out[172]: 28
In [173]: df.sexdif.sum()
Out[173]: 0
In [174]: df['sexdif']=False
In [175]: df['sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
In [176]: df.sexdif.sum()
Out[176]: 31
In [177]: df.racedif.sum()
Out[177]: 0
But if I first initialize them both to False before setting values, this does not happen.
In [203]: df['sexdif']=False
...: df['racedif']=False
...: df['sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
...: df['racedif'][~ df.newpers]=df.ptdtrace[~ df.newpers]!=df.ptdtrace.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
...:
In [204]: df.sexdif.sum()
Out[204]: 31
In [205]: df.racedif.sum()
Out[205]: 28
Why is this happening and is this a bug?
Added a simpler example that does not have the same problem. Why?
In [255]: df.x=False
In [256]: df.x[df.is456]=df['truth'][df.is456]
In [257]: df.x
Out[257]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: x, dtype: bool
In [258]: df.y=False
In [259]: df.y[df.is456]=df['truth'][df.is456]
In [260]: df.y
Out[260]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: y, dtype: bool
In [261]: df.x
Out[261]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: x, dtype: bool
Non-chained indexing
In [281]: df.loc[:,'sexdif']=False
In [282]: df.sexdif.sum()
Out[282]: 0
In [283]: df.loc[:,'sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
In [284]: df.sexdif.sum()
Out[284]: 31
In [285]: df.loc[:,'racedif']=False
In [286]: df.sexdif.sum()
Out[286]: 0
you are chain indexing, see docs here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy
bottom line is use
df.loc[row_indexer,col_indexer] = value
to assign and not
df[col_indexer][row_indexer] = value

Categories

Resources