how to lag columns in batch in dataframe - python

I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.

Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0

Related

Concatenating two dataframes on common index, is there more efficient way to do this?

I have two dataframes, first one is:
col1 col2 col3
1 14 2 6
2 12 3 3
3 9 4 2
Second one is:
col4 col5 col6
2 14 2 6
3 12 3 3
I want to concatenate them and get the index values from second one and row values from the first one.
The result will be like this:
col1 col2 col3
2 12 3 3
3 9 4 2
My solution was pd.concat([df2, df1, axis=1)]).drop(df2, axis=1) but I believe there is more efficient way to do this.
You can use index from df2 with loc function on df1:
df1.loc[df2.index]
Output:
col1 col2 col3
2 12 3 3
3 9 4 2

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Split the data frame based on consecutive row values differences

I have a data frame like this,
df
col1 col2 col3
1 2 3
2 5 6
7 8 9
10 11 12
11 12 13
13 14 15
14 15 16
Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1.
So the result data frames will look like,
df1
col1 col2 col3
1 2 3
2 5 6
df2
col1 col2 col3
7 8 9
df3
col1 col2 col3
10 11 12
11 12 13
df4
col1 col2 col3
13 14 15
14 15 16
I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.
You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:
d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))
print(d[0])
col1 col2 col3
0 1 2 3
1 2 5 6
print(d[1])
col1 col2 col3
2 7 8 9
A more detailed break-down:
df.assign(difference=(diff:=df.col1.diff()),
condition=(gt1:=diff.gt(1)),
grouper=gt1.cumsum())
col1 col2 col3 difference condition grouper
0 1 2 3 NaN False 0
1 2 5 6 1.0 False 0
2 7 8 9 5.0 True 1
3 10 11 12 3.0 True 2
4 11 12 13 1.0 False 2
5 13 14 15 2.0 True 3
6 14 15 16 1.0 False 3
You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.
row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])
df_tup = ()
for n in range(0,len(split_inds)-1):
tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
df_tup.append(tempdf)
(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)

Join two data frame with two columns values of a df with a single column values of another dataframe. based on some conditions?

I have a dataframe like this:
df1
col1 col2 col3 col4
1 2 A S
3 4 A P
5 6 B R
7 8 B B
I have another data frame:
df2
col5 col6 col3
9 10 A
11 12 R
I want to join these two data frame if any value of col3 and col4 of df1 matches with col3 values of df2 it will join.
the final data frame will look like:
df3
col1 col2 col3 col5 col6
1 2 A 9 10
3 4 A 9 10
5 6 R 11 12
If col3 value presents in df2 then it will join via col3 values else it will join via col4 values if it presents in col3 values of df2
How to do this in most efficient way using pandas/python?
Use double merge with default inner join, for second filter out rows matched in df3, last concat together:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2[~df2['col3'].isin(df1['col3'])], on='col3'))
df = pd.concat([df3, df4],ignore_index=True)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9 10
1 3 4 A 9 10
2 5 6 R 11 12
EDIT: Use left join and last combine_first:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3', how='left')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2, on='col3', how='left'))
df = df3.combine_first(df4)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9.0 10.0
1 3 4 A 9.0 10.0
2 5 6 B 11.0 12.0
3 7 8 B NaN NaN

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

Categories

Resources