Iterating a Pandas dataframe over 'n' next rows - python

I have this Pandas dataframe df:
station a_d direction
a 0 0
a 0 0
a 1 0
a 0 0
a 1 0
b 0 0
b 1 0
c 0 0
c 1 0
c 0 1
c 1 1
b 0 1
b 1 1
b 0 1
b 1 1
a 0 1
a 1 1
a 0 0
a 1 0
I'd assign a value_id that increments when direction value change and refers only to the last pair of station value first it changes with different [0,1] a_d value. I can ignore the last (in this example the last two) dataframe row. In other words:
station a_d direction id_value
a 0 0
a 0 0
a 1 0
a 0 0 0
a 1 0 0
b 0 0 0
b 1 0 0
c 0 0 0
c 1 0 0
c 0 1 1
c 1 1 1
b 0 1
b 1 1
b 0 1 1
b 1 1 1
a 0 1 1
a 1 1 1
a 0 0
a 1 0
Using df.iterrows() i write this script:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
It works but it's very slow. With a 10*10^6 rows dataframe I need a faster way. Any idea?
#user5402 code works well but I note that a break after the last else reduce computational time also:
df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
if i == 0:
continue
elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
value_id += 1
for z in range(1,11):
if i+z >= len(df)-1:
break
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
break
elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
break
else:
df.loc[i,'value_id'] = value_id
break

You are not effectively using z in the inner for loop. You never access the i+z-th row. You access the i-th row and the i+1-th row and the i+2-th row, but never the i+z-th row.
You can replace that inner for loop with:
if i+1 > len(df)-1:
pass
elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
pass
elif (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
pass
else:
df.loc[i,'value_id'] = value_id
Note that I also slightly optimized the second elif because at that point you already know df.loc[i+1,'a_d'] does not equal df.loc [i,'a_d'].
Not having to loop over z will save a lot of time.

Related

Exclude future signals

I dont know if I'm solving this the right way - a 1 Signal should return 1 Buy and then no 1 Buy until -1 Signal and then -1 Sell is returned. Same thing for -1 Sell. Anyone got any smart input?
df.loc[0, "Buy"] = 0
x = 0
for i in range(len(df)):
if df.loc[i, "Signal"] == 1 and x == 0:
df.loc[i, "Buy"] = 1
x = 1
y=0
elif df.loc[i, "Signal"] == -1 and y == 0:
df.loc[i, "Buy"] = -1
x = 0
y=1
else:
df.loc[i+1, "Buy"] = ""
Dataframe
Signal Buy
0
1 1
1
1
0
0
-1 -1
0
0
1 1
1
0
0
-1 -1
0
-1
0
-1
0
0

Vectorized function with counter on pandas dataframe column

Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)

Create a multiple if statement and a function to replace values in columns

I am working with a large dataset, but in order to simplify, I will use a simpler example (whose some rows have been deleted to clean the dataset) in order to illustrate the problem.
Imagine I have this dataset:
Code_0 Code_1 Code_3 Code_4 Code_5 Code_6 Code_7
3 1 1 3 0 0 1 1
9 0 0 0 0 0 0 1
10 4 2 3 1 1 0 0
15 0 3 0 5 1 1 1
So, what I want to do in every row is, If Code_5 is equal to one and code_0 is bigger than 0, then, to code_0 will be added one. On the other hand, if code_6 == 1,and code_1 is bigger than 0, then to code_1 will be added one. Finally, if code_7 == 1 and code_3 is bigger than 0, than to code_3 will be added 1. So I want to have this:
Code_0 Code_1 Code_3 Code_4 Code_5 Code_6 Code_7
3 1 2 4 0 0 1 1
9 0 0 0 0 0 0 1
10 5 2 3 1 1 0 0
15 0 4 0 5 1 1 1
I did this code but it is not working. Anyway I think that there is maybe a better option.
def add_one(x):
if x == df_data['Code_5']:
if x == 1 and df_data['Code_0'] > 0:
df_data['Code_0'] = df_data['Code_0'] + 1
if x == df_data['Code_6']:
if x == 1 and df_data['Code_1'] > 0:
df_data['Code_1'] = df_data['Code_1'] + 1
if x == df_data['Code_7']:
if x == 1 and df_data['Code_2'] > 0:
df_data['Code_2'] = df_data['Code_2'] + 1
df_data['Code_0'] = df_data['Code_5'].apply(add_one)
df_data['Code_1'] = df_data['Code_6'].apply(add_one)
df_data['Code_2'] = df_data['Code_7'].apply(add_one)
Anyone can help me, please?
YOu can simplify by passing the complete row:
In [163]: def add_one(row):
...: if row['Code_5'] == 1 and row['Code_0'] > 0:
...: row['Code_0'] = row['Code_0'] + 1
...: if row['Code_6'] == 1 and row['Code_1'] > 0:
...: row['Code_1'] = row['Code_1'] + 1
...: if row['Code_7'] == 1 and row['Code_2'] > 0:
...: row['Code_2'] = row['Code_2'] + 1
...: return row
...:
In [164]: add_one
Out[164]: <function __main__.add_one(row)>
In [165]: df =df.apply(lambda x: add_one(x), axis=1)
axis =1 is specify the columns

Looping over a pandas column and creating a new column if it meets conditions

I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0

Assign a new column to dataframe based on data from current row and previous row

I have 2 columns of data called level 1 event and level 2 event.
Both are columns of 1s and zeros.
lev_1 lev_2 lev_2_&_lev_1
0 1 0 0
1 0 0 0
2 1 0 0
3 1 1 1
4 1 0 0
col['lev2_&_lev_1] = 1 if lev_2 of current row and lev_1 of previous row are both 1.
I have achieved this by using for loop.
i = 1
while i < a.shape[0]:
if a['lev_1'].iloc[i - 1] == 1 & a['lev_2'].iloc[i] == 1:
a['lev_2_&_lev_1'].iloc[i] = 1
i += 1
I wanted to know a computationally efficient way to do this because my original df is very big.
Thank you!
Use np.where and .shift():
df['lev_2_&_lev_1'] = np.where(df['lev_2'].eq(1) & df['lev_1'].shift().eq(1), 1, 0)
lev_1 lev_2 lev_2_&_lev_1
0 1 0 0
1 0 0 0
2 1 0 0
3 1 1 1
4 1 0 0
Explanation
df['lev_2'].eq(1): checks if current row is equal to 1
df['lev_1'].shift().eq(1): checks if previous row is equal to 1
np.where(condition, 1, 0): if condition is True return 1 else 0
You want:
(df['lev_2'] & df['lev_1'].shift()).astype(int)

Categories

Resources