Save previos entry per group / id and date in a column - python

I have a dataframe in python, with the following sorted format:
df
Name Date Value
A 01.01.20 10
A 02.01.20 20
A 03.01.20 15
B 01.01.20 5
B 02.01.20 10
B 03.01.20 5
C 01.01.20 3
C 03.01.20 6
So not every Name has every date filled, how can I create a new column with previos date value (if it is missing, just pick the current value) so that it leads to:
Name Date Value Previos
A 01.01.20 10 10
A 02.01.20 20 10
A 03.01.20 15 20
B 01.01.20 5 5
B 02.01.20 10 5
B 03.01.20 5 10
C 01.01.20 3 3
C 03.01.20 6 6

Use DataFrameGroupBy.shift with Series.fillna:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df['Previos'] = df.groupby('Name')['Value'].shift().fillna(df['Value'])
print (df)
Name Date Value Previos
0 A 2020-01-01 10 10.0
1 A 2020-01-02 20 10.0
2 A 2020-01-03 15 20.0
3 B 2020-01-01 5 5.0
4 B 2020-01-02 10 5.0
5 B 2020-01-03 5 10.0
6 C 2020-01-01 3 3.0
7 C 2020-01-03 6 3.0
But if need shift by 1 day so in last group are same values like original solution is different - first is created DatetimeIndex and for new column is used DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df = df.set_index('Date')
s = df.groupby('Name')['Value'].shift(freq='D').rename('Previous')
df = df.join(s, on=['Name','Date']).fillna({'Previous': df['Value']})
print (df)
Name Value Previous
Date
2020-01-01 A 10 10.0
2020-01-02 A 20 10.0
2020-01-03 A 15 20.0
2020-01-01 B 5 5.0
2020-01-02 B 10 5.0
2020-01-03 B 5 10.0
2020-01-01 C 3 3.0
2020-01-03 C 6 6.0

Related

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

Rolling sum of groups by period

I have got this dataframe:
lst=[['01012021','A',10],['01012021','B',20],['02012021','A',12],['02012021','B',23]]
df2=pd.DataFrame(lst,columns=['Date','FN','AuM'])
I would like to get the rolling sum by date and FN. The desired result looks like this:
lst=[['01012021','A',10,''],['01012021','B',20,''],['02012021','A',12,22],['02012021','B',23,33]]
df2=pd.DataFrame(lst,columns=['Date','FN','AuM','Roll2PeriodSum'])
Would you please help me?
Thank you
Solution if consecutive datetimes, not used column date for count per groups:
df2['Roll2PeriodSum'] = (df2.groupby('FN').AuM
.rolling(2)
.sum()
.reset_index(level=0, drop=True))
print (df2)
Date FN AuM Roll2PeriodSum
0 01012021 A 10 NaN
1 01012021 B 20 NaN
2 02012021 A 12 22.0
3 02012021 B 23 43.0
Solution with datetimes, is used column date for counts:
df2['Date'] = pd.to_datetime(df2['Date'], format='%d%m%Y')
df = (df2.join(df2.set_index('Date')
.groupby('FN').AuM
.rolling('2D')
.sum().rename('Roll2PeriodSum'), on=['FN','Date']))
print (df)
Date FN AuM Roll2PeriodSum
0 2021-01-01 A 10 10.0
1 2021-01-01 B 20 20.0
2 2021-01-02 A 12 22.0
3 2021-01-02 B 23 43.0
df = (df2.join(df2.set_index('Date')
.groupby('FN').AuM
.rolling('2D', min_periods=2)
.sum()
.rename('Roll2PeriodSum'), on=['FN','Date']))
print (df)
Date FN AuM Roll2PeriodSum
0 2021-01-01 A 10 NaN
1 2021-01-01 B 20 NaN
2 2021-01-02 A 12 22.0
3 2021-01-02 B 23 43.0
Use groupby.rolling.sum:
df2['Roll2PeriodSum'] = (
df2.assign(Date=pd.to_datetime(df2['Date'], format='%d%m%Y'))
.groupby('FN').rolling(2)['AuM'].sum().droplevel(0)
)
print(df2)
# Output
Date FN AuM Roll2PeriodSum
0 01012021 A 10 NaN
1 01012021 B 20 NaN
2 02012021 A 12 22.0
3 02012021 B 23 43.0

How to add rows based on a condition with another dataframe

I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.
import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)

how to calculate the percentage in a group of columns in pandas dataframe while keeping the original format of data

I have a dataset given below:
date product_category product_type amount
2020-01-01 A 1 15
2020-01-01 A 2 25
2020-01-01 A 3 10
2020-01-02 B 1 15
2020-01-02 B 2 10
2020-01-03 C 2 100
2020-01-03 C 1 250
2020-01-03 C 3 150
I am trying to convert this data with a normalized amount based on product_category and date given below:
date product_category product_type amount
2020-01-01 A 1 0.30
2020-01-01 A 2 0.50
2020-01-01 A 3 0.20
2020-01-02 B 1 0.60
2020-01-02 B 2 0.40
2020-01-03 C 2 0.20
2020-01-03 C 1 0.50
2020-01-03 C 3 0.30
Is there any way to do with python dataframes and updating the original panda dataframe?
Use GroupBy.transform with sum for repeat aggregated sum, so possible divide by original column amount:
#to new column
df['norm'] = df['amount'].div(df.groupby(['date','product_category'])['amount'].transform('sum'))
#rewrite original column
#df['amount'] = df['amount'].div(df.groupby(['date','product_category'])['amount'].transform('sum'))
print (df)
date product_category product_type amount norm
0 2020-01-01 A 1 15 0.3
1 2020-01-01 A 2 25 0.5
2 2020-01-01 A 3 10 0.2
3 2020-01-02 B 1 15 0.6
4 2020-01-02 B 2 10 0.4
5 2020-01-03 C 2 100 0.2
6 2020-01-03 C 1 250 0.5
7 2020-01-03 C 3 150 0.3

Duplicate Quantity as new columns

This is my table
I want it to be the following, i.e., by duplicating Quantity of (shopID, productID) Quantity of other difference (shopID, productID) as new columns, Quantity_shopID_productID.
Following is my code:
from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
"ShopID":[1,1,1,1,2,2,2,2],
"ProductID":[1,1,2,2,1,1,2,2],
"Quantity":[3,3,4,4,5,5,6,6]})
for sid in df.ShopID.unique():
for pid in df.ProductID.unique():
col_name='Quantity{}_{}'.format(sid,pid)
print(col_name)
df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
df1.rename(columns={'Quantity':col_name}, inplace=True)
display(df1)
df=df.merge(df1, how="left",on="Date")
df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None
print(df)
The problem is, it works very slow as I have over 108 different (shopID, productID) combinations over 3 years period. Is there anyway to make it more efficient?
Method 1: using pivot_table with join (vectorized solution)
We can pivot your quantity values per shopid, productid to columns, and then join them back to your original dataframe. This should be way faster than your forloops since this is a vectorized approach:
piv = df.pivot_table(index=['ShopID', 'ProductID'], columns=['ShopID', 'ProductID'], values='Quantity')
piv2 = piv.ffill().bfill()
piv3 = piv2.mask(piv2.eq(piv))
final = df.set_index(['ShopID', 'ProductID']).join(piv3).reset_index()
Output
ShopID ProductID dt Quantity (1, 1) (1, 2) (2, 1) (2, 2)
0 1 1 2019-10-01 3 NaN 4.0 5.0 6.0
1 1 1 2019-10-02 3 NaN 4.0 5.0 6.0
2 1 2 2019-10-01 4 3.0 NaN 5.0 6.0
3 1 2 2019-10-02 4 3.0 NaN 5.0 6.0
4 2 1 2019-10-01 5 3.0 4.0 NaN 6.0
5 2 1 2019-10-02 5 3.0 4.0 NaN 6.0
6 2 2 2019-10-01 6 3.0 4.0 5.0 NaN
7 2 2 2019-10-02 6 3.0 4.0 5.0 NaN
Method 2, using GroupBy, mask, where:
We can speed up your code by using GroupBy and mask + where instead of two for-loops:
groups = df.groupby(['ShopID', 'ProductID'])
for grp, data in groups:
m = df['ShopID'].eq(grp[0]) & df['ProductID'].eq(grp[1])
values = df['Quantity'].where(m).ffill().bfill()
df[f'Quantity_{grp[0]}_{grp[1]}'] = values.mask(m)
Output
dt ShopID ProductID Quantity Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
0 2019-10-01 1 1 3 NaN 4.0 5.0 6.0
1 2019-10-02 1 1 3 NaN 4.0 5.0 6.0
2 2019-10-01 1 2 4 3.0 NaN 5.0 6.0
3 2019-10-02 1 2 4 3.0 NaN 5.0 6.0
4 2019-10-01 2 1 5 3.0 4.0 NaN 6.0
5 2019-10-02 2 1 5 3.0 4.0 NaN 6.0
6 2019-10-01 2 2 6 3.0 4.0 5.0 NaN
7 2019-10-02 2 2 6 3.0 4.0 5.0 NaN
This is a pivot and merge problem with a little extra:
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
Output:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
Test data with similar size to your actual data:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
The above code tooks about 10 seconds on an i5 laptop :-)

Categories

Resources