Get n rows before specific value in pandas - python

Say, i have the following dataframe:
import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9],
'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]}
df = pd.DataFrame(dict)
df
val label
0 3.2 0
1 2.4 2
2 -2.3 1
3 -4.9 -1
4 3.2 1
5 2.4 2
6 -2.3 -1
7 -4.9 -1
8 2.4 1
9 -2.3 1
10 -4.9 -1
I want to take each n (for example 2) rows before -1 value in column label. In the given df first -1 appears at index 3, we take 2 rows before it and drop index 3, then next -1 appears at index 6, we again keep 2 rows before and etc. The desired output is as following:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Thanks for any ideas!

You can get the index values and then get the previous two row index values:
idx = df[df.label == -1].index
filtered_idx = (idx-1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]
df_new = df.iloc[filtered_idx]
output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Speed comparison with for a for loop solution:
# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1
def vectorized_filter(df):
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
df_new = df.iloc[filtered_idx]
return df_new
def loop_filter(df):
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
return df2
%timeit vectorized_filter(df)
%timeit loop_filter(df)
vectorized runs ~20x faster on my machine

Here's a solution:
new_df = pd.DataFrame()
markers = df[df.label.eq(-1)].index
for marker in markers:
new_df = new_df.append(df[marker-2:marker])
new_df.reset_index().drop_duplicates().set_index("index")
Result:
val label
index
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1

filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
print(df2)
Output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
This should also work if you have the label as -1 in the first two rows

Related

Rolling and Mode function to get the majority of voting for rows in pandas Dataframe

I have a pandas Dataframe:
np.random.seed(0)
df = pd.DataFrame({'Close': np.random.uniform(0, 100, size=10)})
lbound, ubound = 0, 1
change = df["Close"].diff()
df["Change"] = change
df["Result"] = np.select([ np.isclose(change, 1) | np.isclose(change, 0) | np.isclose(change, -1),
# The other conditions
(change > 0) & (change > ubound),
(change < 0) & (change < lbound),
change.between(lbound, ubound)],[0, 1, -1, 0])
Close Change Result
0 54.881350 NaN 0
1 71.518937 16.637586 1
2 60.276338 -11.242599 -1
3 54.488318 -5.788019 -1
4 42.365480 -12.122838 -1
5 64.589411 22.223931 1
6 43.758721 -20.830690 -1
7 89.177300 45.418579 1
8 96.366276 7.188976 1
9 38.344152 58.022124 -1
Problem statement - Now, I want the majority of voting for index 1,2,3,4 assigned to index 0, index 2,3,4,5 assigned to index 1 of result columns, and so on for all the subsequent indexes.
I tried:
df['Voting'] = df['Result'].rolling(window = 4,min_periods=1).apply(lambda x: x.mode()[0]).shift()
But,this doesn't give the result I intend. It takes the first 4 rolling window and applies the mode function.
Close Change Result Voting
0 54.881350 NaN 0 NaN
1 71.518937 16.637586 1 0.0
2 60.276338 -11.242599 -1 0.0
3 54.488318 -5.788019 -1 -1.0
4 42.36548 -12.122838 -1 -1.0
5 64.589411 22.223931 1 -1.0
6 43.758721 -20.830690 -1 -1.0
7 89.177300 45.418579 1 -1.0
8 96.366276 7.188976 1 -1.0
9 38.344152 -58.022124 -1 1.0
Result I Intend - Rolling window of 4(index 1,2,3,4) should be set and mode function be applied and result
should be assigned to index 0,then next rolling window(index 2,3,4,5) and result should
be assigned to index 1 and so on..
You have to reverse your list before then shift of 1 (because you don't want the current index in the result):
majority = lambda x: 0 if len((m := x.mode())) > 1 else m[0]
df['Voting'] = (df[::-1].rolling(4, min_periods=1)['Result']
.apply(majority).shift())
print(df)
# Output
Close Change Result Voting
0 54.881350 NaN 0 -1.0
1 71.518937 16.637586 1 -1.0
2 60.276338 -11.242599 -1 -1.0
3 54.488318 -5.788019 -1 0.0
4 42.365480 -12.122838 -1 1.0
5 64.589411 22.223931 1 0.0
6 43.758721 -20.830690 -1 1.0
7 89.177300 45.418579 1 0.0
8 96.366276 7.188976 1 -1.0
9 38.344152 58.022124 -1 NaN

How can I fill empty DataFrame based on conditions?

I have following dataframe called condition:
[0] [1] [2] [3]
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
For easier reproduction:
import numpy as np
import pandas as pd
n=4
t=3
condition = pd.DataFrame([[0,0,1,0], [0,1,0,0], [0,0,0, 1], [0,0,0, 1]], columns=['0','1', '2', '3'])
condition.index=np.arange(1,n+1)
Further I have several dataframes that should be filled in a foor loop
df = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) ) #NaN DataFrame
df_2 = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) )
df_3 = pd.DataFrame(3,index = range(1,n+1),columns= range(t+1) )
for i,t in range(t,-1,-1):
if condition[t]==1:
df.loc[:,t] = df_3.loc[:,t]**2
df_2.loc[:,t]=0
elif (condition == 0 and no 1 in any column after t)
df.loc[:,t] = 2.5
....
else:
df.loc[:,t] = 5
df_2.loc[:,t]= df.loc[:,t+1]
I am aware that this for loop is not correct, but what I wanted to do, is to check elementwise condition (recursevly) and if it is 1 (in condition) to fill dataframe df with squared valued of df_3. If it is 0 in condition, I should differentiate two cases.
In the first case, there are no 1 after 0 (row 1 and 2 in condition) then df = 2.5
Second case, there was 1 after and fill df with 5 (row 3 and 4)
So the dataframe df should look something like this
[0] [1] [2] [3]
1 5 5 9 2.5
2 5 9 2.5 2.5
3 5 5 5 9
4 5 5 5 9
The code should include for loop.
Thanks!
I am not sure if this is what you want, but based on your desired output you can do this with only masking operations (which is more efficient than looping over the rows anyway). Your code could look like this:
is_one = condition.astype(bool)
is_after_one = (condition.cumsum(axis=1) - condition).astype(bool)
df = pd.DataFrame(5, index=condition.index, columns=condition.columns)
df_2 = pd.DataFrame(2.5, index=condition.index, columns=condition.columns)
df_3 = pd.DataFrame(3, index=condition.index, columns=condition.columns)
df.where(~is_one, other=df_3 * df_3, inplace=True)
df.where(~is_after_one, other=df_2, inplace=True)
which yields:
0 1 2 3
1 5 5 9.0 2.5
2 5 9 2.5 2.5
3 5 5 5.0 9.0
4 5 5 5.0 9.0
EDIT after comment:
If you really want to loop explicitly over the rows and columns, you could do it like this with the same result:
n_rows = condition.index.size
n_cols = condition.columns.size
for row_index in range(n_rows):
for col_index in range(n_cols):
cond = condition.iloc[row_index, col_index]
if col_index < n_cols - 1:
rest_row = condition.iloc[row_index, col_index + 1:].to_list()
else:
rest_row = []
if cond == 1:
df.iloc[row_index, col_index] = df_3.iloc[row_index, col_index] ** 2
elif cond == 0 and 1 not in rest_row:
# fill whole row at once
df.iloc[row_index, col_index:] = 2.5
# stop iterating over the rest
break
else:
df.iloc[row_index, col_index] = 5
df_2.loc[:, col_index] = df.iloc[:, col_index + 1]
The result is the same, but this is much more inefficient and ugly, so I would not recommend it like this

Create a New Column Based on Some Values From Other Column in Pandas

Imagine that this Data Set:
A
1 2
2 4
3 3
4 5
5 5
6 5
I would like to create new column by this condition from A:
if A[i] < A[i-1] then B[i] = -1 else B[i] = 1
the result is:
A B
1 2 NaN
2 4 1
3 3 -1
4 5 1
5 7 1
6 6 -1
All codes and solutions that I have found just compare the rows in same location.
Use the diff function. then the sign function:
df.assign(B = np.sign(df.A.diff()))
Out[248]:
A B
0 2 NaN
1 4 1.0
2 3 -1.0
3 5 1.0
4 7 1.0
5 6 -1.0
df['B']=[1 if i!=0 and df['A'][i] < df['A'][i-1] else -1 for i,v in enumerate(df['A'])]
or
df['B']=[1 if i!=0 and df['A'][i] < df['A'][i-1] else -1 for i in range(len(df['A']))]
Edit (for three states like greater, less, and equal):
import numpy as np
df['B']=np.NAN*len(df.a)
for i in range(1,len(df['a'])):
if df['a'][i] < df['a'][i-1]: df['B'][i]=1
elif df['a'][i] == df['a'][i-1]: df['B'][i]=0
else: df['B'][i]=-1
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,4,3,5,7,6]})
df['B'] = np.where(df['A'] < df['A'].shift(1), -1, 1)
in order to keep the nan in the beginning:
df['B'] = np.where(df['A'].shift(1).isna(), np.nan, df['B'])

populate row with opposite value of the xth previous row, if condition is true

Following is the Dataframe I am starting from:
import pandas as pd
import numpy as np
d= {'PX_LAST':[1,2,3,3,3,1,2,1,1,1,3,3],'ma':[2,2,2,2,2,2,2,2,2,2,2,2],'action':[0,0,1,0,0,-1,0,1,0,0,-1,0]}
df_zinc = pd.DataFrame(data=d)
df_zinc
Now, I need to add a column called 'buy_sell', which:
when 'action'==1, populates with 1 if 'PX_LAST' >'ma', and with -1 if 'PX_LAST'<'ma'
when 'action'==-1, populates with the opposite of the previous non-zero value that was populated
FYI: in my data, the row that needs to be filled with the opposite of the previous non-zero item is always at the same distance from the previous non-zero item (i.e., 2 in the current example). This should facilitate making the code.
the code that I made so far is the following. It seems right to me. Do you have any fixes to propose?
while index < df_zinc.shape[0]:
if df_zinc['action'][index] == 1:
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
else:
df_zinc.loc[index,'buy_sell'] = 1
elif df_zinc['action'][index] == -1:
df_zinc['buy_sell'][index] = df_zinc['buy_sell'][index-3]*-1
index=index+1
df_zinc
the resulting dataframe would look like this:
df_zinc['buy_sell'] = [0,0,1,0,0,-1,0,-1,0,0,1,0]
df_zinc
So, this would be my suggestion according to the example output (and assuming I understood the question properly:
def buy_sell(row):
if row['action'] == 0:
return 0
if row['PX_LAST'] > row['ma']:
return 1 * (-1 if row['action'] == 0 else 1)
else:
return -1 * (-1 if row['action'] == 0 else 1)
return 0
df_zinc = df_zinc.assign(buy_sell=df_zinc.apply(buy_sell, axis=1))
df_zinc
This should behave as expected by the rules. It does not take into account the possibility of 'PX_LAST' being equal to 'ma', returning 0 by default, as it was not clear what rule to follow in that scenario.
EDIT
Ok, after the new logic explained, I think this should do the trick:
def assign_buysell(df):
last_nonzero = None
def buy_sell(row):
nonlocal last_nonzero
if row['action'] == 0:
return 0
if row['action'] == 1:
if row['PX_LAST'] < row['ma']:
last_nonzero = -1
elif row['PX_LAST'] > row['ma']:
last_nonzero = 1
elif row['action'] == -1:
last_nonzero = last_nonzero * -1
return last_nonzero
return df.assign(buy_sell=df.apply(buy_sell, axis=1))
df_zinc = assign_buysell(df_zinc)
This solution is independent of how long ago the nonzero value was seen, it simply remembers the last nonzero value and pipes the opposite wen action is -1.
You can use np.select, and use np.nan as a label for the rows that satisfy the third condition:
c1 = df_zinc.action.eq(1) & df_zinc.PX_LAST.gt(df_zinc.ma)
c2 = df_zinc.action.eq(1) & df_zinc.PX_LAST.lt(df_zinc.ma)
c3 = df_zinc.action.eq(-1)
df_zinc['buy_sell'] = np.select([c1,c2, c3], [1, -1, np.nan])
Now in order to fill NaNs with the value from n rows above (in this case 3), you can fillna with a shifted version of the dataframe:
df_zinc['buy_sell'] = df_zinc.buy_sell.fillna(df_zinc.buy_sell.shift(3)*-1)
Output
PX_LAST ma action buy_sell
0 1 2 0 0.0
1 2 2 0 0.0
2 3 2 1 1.0
3 3 2 0 0.0
4 3 2 0 0.0
5 1 2 -1 -1.0
6 2 2 0 0.0
7 1 2 1 -1.0
8 1 2 0 0.0
9 1 2 0 0.0
10 3 2 -1 1.0
11 3 2 0 0.0
I would use np.select for this, since you have multiple conditions:
conditions = [
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] < df_zinc['ma'])
]
choices = [1, -1, 1, -1]
df_zinc['buy_sell'] = np.select(conditions, choices, default=0)
result
print(df_zinc)
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
here my solution using the function shift() to trap the data of 3th up row:
df_zinc['buy_sell'] = 0
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']), 'buy_sell'] = -1
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']), 'buy_sell'] = 1
df_zinc.loc[df_zinc['action'] == -1, 'buy_sell'] = -df_zinc['buy_sell'].shift(3)
df_zinc['buy_sell'] = df_zinc['buy_sell'].astype(int)
print(df_zinc)
output:
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0

Pandas DataFrame use previous row value for complicated 'if' conditions to determine current value

I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData

Categories

Resources