pandas loop on elements - python

I would like to compute delta time in a dataframe (with some condition) so I write a loop:
for i in range(1,len(df.index)):
if df.type[i] == df.type[i-1]:
df.delta[i]=df.time[i]-df.time[i-1]
else:
df.delta[i]= ''
but it seems not very optimised because it's very long and I get a SettingWithCopyWarning (which I don't understand). What is the best way to do such a computation?

I would use .shift() for that. It makes a new column with values shifted by 1.
So if we have no conditions, you'll need just df["time"] - df["time"].shift(), but as you want to add condition, where would help. So here is a one line solution
(df["time"] - df["time"].shift()).where(df["type"] == df["type"].shift(), "")
Or as suggested in other answer, you can use diff
df["time"].diff().where(df["type"] == df["type"].shift(), "")

You should use a vectorised approach. For example, you can use numpy.where with pd.Series.shift and pd.Series.diff:
df['C_id'] = np.where(df['type'] == df['type'].shift(), df['time'].diff(), np.nan)
Note I strongly recommend you do not use an empty string '' as your alternative value, as this will force your series to have object dtype instead of float.

my approach would be to use pandas.apply()
type_prev = ''
time_prev = 0
def lambda_func(row):
global type_prev
global time_prev
if row['type'] == time_prev:
time_diff = row['time'] - time_prev
else:
time_diff = ''
time_prev = row['time']
type_prev = row['type']
return time_diff
df['delta'] = df.apply(lambda_func)

Related

Optimization of this python function

def split_trajectories(df):
trajectories_list = []
count = 0
for record in range(len(df)):
if record == 0:
continue
if df['time'].iloc[record] - df['time'].iloc[record - 1] > pd.Timedelta('0 days 00:00:30'):
temp_df = reset_index(df[count:record])
if not temp_df.empty:
if len(temp_df) > 50:
trajectories_list.append(temp_df)
count = record
return trajectories_list
This is a python function that receives a pandas dataframe and divides it into a list of dataframes when their time delta is greater than 30 seconds and if the dataframe contains than 50 records. In my case I need to execute this function thousands of times and I wonder if anyone can help me optimize it. Thanks in advance!
I tried to optimize it as far as I can.
You're doing a few things right here, like iterating using a range instead of .iterrows and using .iloc.
Simple things you can do:
Switch .iloc to .iat since you only need 'time' anyway
Don't recalculate Timedelta every time, just save it as a variable
The big thing is that you don't need to actually save each temp_df, or even create it. You can save a tuple (count, record) and retrieve from df afterwards, as needed. (Although, I have to admit I don't quite understand what you're accomplishing with the count:record logic; should one of your .iloc's involve count instead?)
def split_trajectories(df):
trajectories_list = []
td = pd.Timedelta('0 days 00:00:30')
count = 0
for record in range(1, len(df)):
if df['time'].iat[record] - df['time'].iat[record - 1] > td:
if record-count>50:
new_tuple = (count, record)
trajectories_list.append(new_tuple)
return trajectories_list
Maybe you can try this one below, you could use count if you want to keep track of number of df's or else the length of trajectories_list will give you that info.
def split_trajectories(df):
trajectories_list = []
df_difference = df['time'].diff()
if not (df_difference.empty) and (df_difference.shape[0]>50):
trajectories_list.append(df_difference)
return trajectories_list

How can i use .fillna with specific values?

df["load_weight"] = df.loc[(df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")].fillna(1000, inplace=True)
i want to change the NaN value in "load_weight" column, but only for the rows that contain "HORNSBY BEND" and "BRUSH", but above code gave me "none" to the whole "load_weight" column, what did i do wrong?
I would use a mask for boolean indexing:
m = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
df.loc[m, "load_weight"] = df.loc[m, 'load_weight'].fillna(1000)
NB. you can't keep inplace=True when you assign the output. This is what was causing your data to be replaced with None as methods called with inplace=True return nothing.
Alternative with only boolean indexing:
m1 = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
m2 = df['load_weight'].isna()
df.loc[m1&m2, "load_weight"] = 1000
Instead of fillna, you can directly use df.loc to do the required imputation
df.loc[((df['dropoff_site']=='HORNSBY BEND')&(df['load_type']=='BRUSH')
&(df['load_weight'].isnull())),'load_weight'] = 1000

Dropping rows at specific minutes

I am trying to drop rows at specific minutes ( 05,10, 20 )
I have datetime as an index
df5['Year'] = df5.index.year
df5['Month'] = df5.index.month
df5['Day']= df5.index.day
df5['Day_of_Week']= df5.index.day_name()
df5['hour']= df5.index.strftime('%H')
df5['Min']= df5.index.strftime('%M')
df5
Then I run below
def clean(df5):
for i in range(len(df5)):
hour = pd.Timestamp(df5.index[i]).hour
minute = pd.Timestamp(df5.index[i]).minute
if df5 = df5[(df5.index.minute ==5) | (df5.index.minute == 10)| (df5.index.minute == 20)]
df.drop(axis=1, index=i, inplace=True)
it returnes invalid syntax error.
Here looping is not necessary, also not recommended.
Use DatetimeIndex.minute with Index.isin and inverted mask by ~ filtering in boolean indexing:
df5 = df5[~df5.index.minute.isin([5, 10, 20])]
For reuse column df5['Min'] use strings values:
df5 = df5[~df5['Min'].isin(['05', '10', '20'])]
All together:
def clean(df5):
return df5[~df5.index.minute.isin([5, 10, 20])]
You can just do it using boolean indexing, assuming that the index is already parsed as datetime.
df5 = df5[~((df5.index.minute == 5) | (df5.index.minute == 10) | (df5.index.minute == 20))]
Or the opposite of the same answer:
df5 = df5[(df5.index.minute != 5) | (df5.index.minute != 10) | (df5.index.minute != 20)]
Generally speaking, the right synthax to combine a logic OR inside an IF statement is the following:
today = 'Saturday'
if today=='Sunday' OR today=='Saturday':
print('Today is off. Rest at home')
In your case, you should probably use something like this:
if df5 == df5[(df5.index.minute ==5)] OR df5[(df5.index.minute ==10)]
......
FINAL NOTE:
You made some mistakes using == and =
In Python (and many other programming languages), a single equal mark = is used to assign a value to a variable, whereas two consecutive equal marks == is used to check whether 2 expressions give the same value .
= is an assignment operator
== is an equality operator

How to add a new column to pandas dataframe while iterate over the rows?

I want to generate a new column using some columns that already exists.But I think it is too difficult to use an apply function. Can I generate a new column (ftp_price here) when iterating through this dataframe? Here is my code. When I call product_df['ftp_price'],I got a KeyError.
for index, row in product_df.iterrows():
current_curve_type_df = curve_df[curve_df['curve_surrogate_key'] == row['curve_surrogate_key_x']]
min_tmp_df = row['start_date'] - current_curve_type_df['datab_map'].apply(parse)
min_tmp_df = min_tmp_df[min_tmp_df > timedelta(days=0)]
curve = current_curve_type_df.loc[min_tmp_df.idxmin()]
tmp_diff = row['end_time'] - np.array(row['start_time'])
if np.isin(0, tmp_diff):
idx = np.where(tmp_diff == 0)
col_name = COL_NAMES[idx[0][0]]
row['ftp_price'] = curve[col_name]
else:
idx = np.argmin(tmp_diff > 0)
p_plus_one_rate = curve[COL_NAMES[idx]]
p_minus_one_rate = curve[COL_NAMES[idx - 1]]
d_plus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx]]
d_minus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx - 1]]
row['ftp_price'] = p_minus_one_rate + (p_plus_one_rate - p_minus_one_rate) * (row['start_date'] - d_minus_one_days) / (d_plus_one_days - d_minus_one_days)
An alternative to setting new value to a particular index is using at:
for index, row in product_df.iterrows():
product_df.at[index, 'ftp_price'] = val
Also, you should read why using iterrows should be avoided
A row can be a view or a copy (and is often a copy), so changing it would not change the original dataframe. The correct way is to always change the original dataframe using loc or iloc:
product_df.loc[index, 'ftp_price'] = ...
That being said, you should try to avoid to explicitely iterate the rows of a dataframe when possible...

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources