Merge one file to other file in groups - python

In Python and Pandas, I have one dataframe for 2018 which looks like this:
Date Stock_id Stock_value
02/01/2018 1 4
03/01/2018 1 2
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
and a dataframe with one column which has all the 2018 dates like the following:
Date
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
etc
I want to merge these to get my first dataframe with full dates for 2018 for each stock and with NAs wherever they were not any data.
Basically, I want to have for each stock a row for each date of 2018 (where the rows which do not have any data should filled in with NAs).
Thus, I want to have the following as an output for the sample above:
Date Stock_id Stock_value
01/01/2018 1 NA
02/01/2018 1 4
03/01/2018 1 2
04/01/2018 1 NA
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
05/01/2018 2 NA
How can I do this?
I tested
data = data_1.merge(data_2, on='Date' , how='outer')
and
data = data_1.merge(data_2, on='Date' , how='right')
but I still got the original dataframe with no new dates added but only with some rows which had everywhere NAs added.

Use product for all combinations of values with Stock_id and merge with left join:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
from itertools import product
c = ['Stock_id','Date']
df = pd.DataFrame(list(product(df1['Stock_id'].unique(), df2['Date'])), columns=c)
print (df)
Stock_id Date
0 1 2018-01-01
1 1 2018-01-02
2 1 2018-01-03
3 1 2018-01-04
4 1 2018-01-05
5 1 2018-01-06
6 2 2018-01-01
7 2 2018-01-02
8 2 2018-01-03
9 2 2018-01-04
10 2 2018-01-05
11 2 2018-01-06
and
df = df[['Date','Stock_id']].merge(df1, how='left')
#if necessary specify both columns
#df = df[['Date','Stock_id']].merge(df1, how='left', on=['Date','Stock_id'])
print (df)
Date Stock_id Stock_value
0 2018-01-01 1 NaN
1 2018-01-02 1 4.0
2 2018-01-03 1 2.0
3 2018-01-04 1 NaN
4 2018-01-05 1 7.0
5 2018-01-06 1 NaN
6 2018-01-01 2 6.0
7 2018-01-02 2 9.0
8 2018-01-03 2 4.0
9 2018-01-04 2 6.0
10 2018-01-05 2 NaN
11 2018-01-06 2 NaN
Another idea, but should be slow in large data:
df = (df1.groupby('Stock_id')[['Date','Stock_value']]
.apply(lambda x: x.set_index('Date').reindex(df2['Date']))
.reset_index())
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 NaN
1 1 2018-01-02 4.0
2 1 2018-01-03 2.0
3 1 2018-01-04 NaN
4 1 2018-01-05 7.0
5 1 2018-01-06 NaN
6 2 2018-01-01 6.0
7 2 2018-01-02 9.0
8 2 2018-01-03 4.0
9 2 2018-01-04 6.0
10 2 2018-01-05 NaN
11 2 2018-01-06 NaN

Related

Fill dates on dataframe within groups with same ending

This is what I have:
df = pd.DataFrame({'item': [1,1,2,2,1,1],
'shop': ['A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,1,4,4,5]]),
'qty': [5,6,7,8,9,10]})
print(df)
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 2 A 2018-01-01 7
3 2 A 2018-01-04 8
4 1 B 2018-01-04 9
5 1 B 2018-01-05 10
This is what I want:
out = pd.DataFrame({'item': [1,1,1,1,2,2,2,2,2,1,1],
'shop': ['A','A','A','A','A','A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,4,5,1,2,3,4,5,4,5]]),
'qty': [5,6,0,0,7,0,0,8,0,9,10]})
print(out)
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 1 A 2018-01-04 0
3 1 A 2018-01-05 0
4 2 A 2018-01-01 7
5 2 A 2018-01-02 0
6 2 A 2018-01-03 0
7 2 A 2018-01-04 8
8 2 A 2018-01-05 0
9 1 B 2018-01-04 9
10 1 B 2018-01-05 10
This is what I achieved so far:
df.set_index('date').groupby(['item', 'shop']).resample("D")['qty'].sum().reset_index(name='qty')
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 1 B 2018-01-04 9
3 1 B 2018-01-05 10
4 2 A 2018-01-01 7
5 2 A 2018-01-02 0
6 2 A 2018-01-03 0
7 2 A 2018-01-04 8
I want to complete the missing dates (by day!) so that each group [item-shop] will end with the same date.
Ideas?
The key here is create the min and max within different group , then we create the range and explode merge back
# find the min date for each shop under each item
s = df.groupby(['item','shop'])[['date']].min()
# find the global max
s['datemax'] = df['date'].max()
# combine two results
s['date'] = [pd.date_range(x,y) for x , y in zip(s['date'],s['datemax'])]
out = s.explode('date').reset_index().merge(df,how='left').fillna(0)
out
item shop date datemax qty
0 1 A 2018-01-02 2018-01-05 5.0
1 1 A 2018-01-03 2018-01-05 6.0
2 1 A 2018-01-04 2018-01-05 0.0
3 1 A 2018-01-05 2018-01-05 0.0
4 1 B 2018-01-04 2018-01-05 9.0
5 1 B 2018-01-05 2018-01-05 10.0
6 2 A 2018-01-01 2018-01-05 7.0
7 2 A 2018-01-02 2018-01-05 0.0
8 2 A 2018-01-03 2018-01-05 0.0
9 2 A 2018-01-04 2018-01-05 8.0
10 2 A 2018-01-05 2018-01-05 0.0
I think this gives you what you want (columns are ordered differently)
max_date = df.date.max()
def reindex_to_max_date(df):
return df.set_index('date').reindex(pd.date_range(df.date.min(), max_date, name='date'), fill_value=0)
res = df.groupby(['shop', 'item']).apply(reindex_to_max_date)
res = res.qty.reset_index()
I grouped by shop, item to give the same sort order as you have in out but these can be swapped.
Not sure if this is the most efficient way but one idea is to create a dataframe with all the dates and do a left join at shop-item level as followinf
Initial data
import pandas as pd
df = pd.DataFrame({'item': [1,1,2,2,1,1],
'shop': ['A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x)
for x in [2,3,1,4,4,5]]),
'qty': [5,6,7,8,9,10]})
df = df.set_index('date')\
.groupby(['item', 'shop'])\
.resample("D")['qty']\
.sum()\
.reset_index(name='qty')
Dataframe with all dates
We first get the max and min date
rg = df.agg({"date":{"min", "max"}})
and then we create a df with all possible dates
df_dates = pd.DataFrame(
{"date": pd.date_range(
start=rg["date"]["min"],
end=rg["date"]["max"])
})
Complete dates
Now for every shop item we do a left join with all possible dates
def complete_dates(x, df_dates):
item = x["item"].iloc[0]
shop = x["shop"].iloc[0]
x = pd.merge(df_dates, x,
on=["date"],
how="left")
x["item"] = item
x["shop"] = shop
return x
And we finally apply this function to the original df.
df.groupby(["item", "shop"])\
.apply(lambda x:
complete_dates(x, df_dates)
)\
.reset_index(drop=True)
date item shop qty
0 2018-01-01 1 A NaN
1 2018-01-02 1 A 5.0
2 2018-01-03 1 A 6.0
3 2018-01-04 1 A NaN
4 2018-01-05 1 A NaN
5 2018-01-01 1 B NaN
6 2018-01-02 1 B NaN
7 2018-01-03 1 B NaN
8 2018-01-04 1 B 9.0
9 2018-01-05 1 B 10.0
10 2018-01-01 2 A 7.0
11 2018-01-02 2 A 0.0
12 2018-01-03 2 A 0.0
13 2018-01-04 2 A 8.0
14 2018-01-05 2 A NaN
You could use the complete function from pyjanitor to expose the missing values; the end date is the max of date, the starting date varies per group of item and shop.
Create a dictionary that pairs the target column date to a new date range:
new_date = {"date" : lambda date: pd.date_range(date.min(), df['date'].max())}
Pass the new_date variable to complete :
# pip install https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
df.complete([new_date], by = ['item', 'shop']).fillna(0)
item shop date qty
0 1 A 2018-01-02 5.0
1 1 A 2018-01-03 6.0
2 1 A 2018-01-04 0.0
3 1 A 2018-01-05 0.0
4 1 B 2018-01-04 9.0
5 1 B 2018-01-05 10.0
6 2 A 2018-01-01 7.0
7 2 A 2018-01-02 0.0
8 2 A 2018-01-03 0.0
9 2 A 2018-01-04 8.0
10 2 A 2018-01-05 0.0
complete is just an abstraction of pandas functions that makes it easier to explicitly expose missing values in a Pandas dataframe.

Add rows in a gap dates

I need to insert rows in my dataframe:
This is my df:
I want this result, grouped by client. I mean, I have to create this for every client present in my dataframe
Try something like this:
df['month'] = pd.to_datetime(df.month, format='%d/%m/%Y',dayfirst=True ,errors='coerce')
df.set_index(['month']).groupby(['client']).resample('M').asfreq().drop('client', axis=1).reset_index()
client month col1
0 1 2017-03-31 20.0
1 1 2017-04-30 NaN
2 1 2017-05-31 90.0
3 1 2017-06-30 NaN
4 1 2017-07-31 NaN
5 1 2017-08-31 NaN
6 1 2017-09-30 NaN
7 1 2017-10-31 NaN
8 1 2017-11-30 NaN
9 1 2017-12-31 100.0
10 2 2018-09-30 NaN
11 2 2018-10-31 7.0

How to join a table with each group of a dataframe in pandas

I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122

Pandas sum over a date range for each category separately

I have a dataframe with timeseries of sales transactions for different items:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_1['item'] = 1
df_1['sales']= 2
df_2['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_2['item'] = 2
df_2['sales']= 3
df_3['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_3['item'] = 3
df_3['sales']= 4
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item'])
df
Resulting dataframe:
date item sales
0 2018-01-01 1 2
1 2018-01-02 1 2
2 2018-01-03 1 2
3 2018-01-04 1 2
4 2018-01-05 1 2
0 2018-01-01 2 3
1 2018-01-02 2 3
2 2018-01-03 2 3
3 2018-01-04 2 3
4 2018-01-05 2 3
0 2018-01-01 3 4
1 2018-01-02 3 4
2 2018-01-03 3 4
3 2018-01-04 3 4
4 2018-01-05 3 4
I want to compute a sum of "sales" for a given item in a given time window. I can't use pandas rolling.sum
because the timeseries is sparse (eg. 2018-01-01 > 2018-01-04 > 2018-01-06 > etc.).
I've tried this solution (for time window = 2 days):
df['start_date'] = df['date'] - timedelta(3)
df['end_date'] = df['date'] - timedelta(1)
df['rolled_sales'] = df.apply(lambda x: df.loc[(df.date >= x.start_date) &
(df.date <= x.end_date), 'sales'].sum(), axis=1)
but it results with sums of sales of all items for a given time window:
date item sales start_date end_date rolled_sales
0 2018-01-01 1 2 2017-12-29 2017-12-31 0
1 2018-01-02 1 2 2017-12-30 2018-01-01 9
2 2018-01-03 1 2 2017-12-31 2018-01-02 18
3 2018-01-04 1 2 2018-01-01 2018-01-03 27
4 2018-01-05 1 2 2018-01-02 2018-01-04 27
0 2018-01-01 2 3 2017-12-29 2017-12-31 0
1 2018-01-02 2 3 2017-12-30 2018-01-01 9
2 2018-01-03 2 3 2017-12-31 2018-01-02 18
3 2018-01-04 2 3 2018-01-01 2018-01-03 27
4 2018-01-05 2 3 2018-01-02 2018-01-04 27
0 2018-01-01 3 4 2017-12-29 2017-12-31 0
1 2018-01-02 3 4 2017-12-30 2018-01-01 9
2 2018-01-03 3 4 2017-12-31 2018-01-02 18
3 2018-01-04 3 4 2018-01-01 2018-01-03 27
4 2018-01-05 3 4 2018-01-02 2018-01-04 27
My goal is to have rolled_sales computed for each item separately, like this:
date item sales start_date end_date rolled_sales
0 2018-01-01 1 2 2017-12-29 2017-12-31 0
1 2018-01-02 1 2 2017-12-30 2018-01-01 2
2 2018-01-03 1 2 2017-12-31 2018-01-02 4
3 2018-01-04 1 2 2018-01-01 2018-01-03 6
4 2018-01-05 1 2 2018-01-02 2018-01-04 8
0 2018-01-01 2 3 2017-12-29 2017-12-31 0
1 2018-01-02 2 3 2017-12-30 2018-01-01 3
2 2018-01-03 2 3 2017-12-31 2018-01-02 6
3 2018-01-04 2 3 2018-01-01 2018-01-03 9
4 2018-01-05 2 3 2018-01-02 2018-01-04 12
0 2018-01-01 3 4 2017-12-29 2017-12-31 0
1 2018-01-02 3 4 2017-12-30 2018-01-01 4
2 2018-01-03 3 4 2017-12-31 2018-01-02 8
3 2018-01-04 3 4 2018-01-01 2018-01-03 12
4 2018-01-05 3 4 2018-01-02 2018-01-04 16
I tried to apply solution suggested here: Pandas rolling sum for multiply values separately
but failed.
Any ideas?
Many Thanks in advance :)
Andy
Total sales With 2-days rolling window per item:
z = df.sort_values('date').set_index('date').groupby('item').rolling('2d')['sales'].sum()
Output:
item date
1 2018-01-01 2.0
2018-01-02 4.0
2018-01-03 4.0
2018-01-04 4.0
2018-01-05 4.0
2 2018-01-01 3.0
2018-01-02 6.0
2018-01-03 6.0
2018-01-04 6.0
2018-01-05 6.0
3 2018-01-01 4.0
2018-01-02 8.0
2018-01-03 8.0
2018-01-04 8.0
2018-01-05 8.0
Name: sales, dtype: float64
Total sales from last 2 days per item:
df[df.groupby('item').cumcount() < 2 ].groupby('item').sum()
Total sales between start_date and end_date per item:
start_date = pd.to_datetime('2017-12-2')
end_date = pd.to_datetime('2018-12-2')
df[df['date'].between(start_date, end_date)].groupby('item')['sales'].sum()
df['rolled_sum'] = (df.groupby('item')
.rolling('3D', on='date').sum()['sales']
.to_numpy()
)
After some data wrangling (I removed some rows to simulate sparse dates, and added helper columns "start_date" and "end_date" for 3 days distance from a given date), the final output looks like this:
date item sales start_date end_date rolled_sum
0 2018-01-01 1 2 2017-12-30 2018-01-01 2.0
3 2018-01-04 1 2 2018-01-02 2018-01-04 2.0
4 2018-01-05 1 2 2018-01-03 2018-01-05 4.0
7 2018-01-08 1 2 2018-01-06 2018-01-08 2.0
9 2018-01-10 1 2 2018-01-08 2018-01-10 4.0
12 2018-01-03 2 3 2018-01-01 2018-01-03 3.0
13 2018-01-04 2 3 2018-01-02 2018-01-04 6.0
15 2018-01-06 2 3 2018-01-04 2018-01-06 6.0
17 2018-01-08 2 3 2018-01-06 2018-01-08 6.0
18 2018-01-09 2 3 2018-01-07 2018-01-09 6.0
19 2018-01-10 2 3 2018-01-08 2018-01-10 9.0
21 2018-01-02 3 4 2017-12-31 2018-01-02 4.0
23 2018-01-04 3 4 2018-01-02 2018-01-04 8.0
25 2018-01-06 3 4 2018-01-04 2018-01-06 8.0
26 2018-01-07 3 4 2018-01-05 2018-01-07 8.0
27 2018-01-08 3 4 2018-01-06 2018-01-08 12.0
28 2018-01-09 3 4 2018-01-07 2018-01-09 12.0
29 2018-01-10 3 4 2018-01-08 2018-01-10 12.0
The magic was in rolling.sum parameter: instead of "3", I should use "3D".
Many Thanks for Your help :)
Andy

Merging multiple dataframe using month datetime

I have three dataframes. Each dataframe has date as column. I want to left join the three using date column. Date are present in the form 'yyyy-mm-dd'. I want to merge the dataframe using 'yyyy-mm' only.
df1
Date X
31-05-2014 1
30-06-2014 2
31-07-2014 3
31-08-2014 4
30-09-2014 5
31-10-2014 6
30-11-2014 7
31-12-2014 8
31-01-2015 1
28-02-2015 3
31-03-2015 4
30-04-2015 5
df2
Date Y
01-09-2014 1
01-10-2014 4
01-11-2014 6
01-12-2014 7
01-01-2015 2
01-02-2015 3
01-03-2015 6
01-04-2015 4
01-05-2015 3
01-06-2015 4
01-07-2015 5
01-08-2015 2
df3
Date Z
01-07-2015 9
01-08-2015 2
01-09-2015 4
01-10-2015 1
01-11-2015 2
01-12-2015 3
01-01-2016 7
01-02-2016 4
01-03-2016 9
01-04-2016 2
01-05-2016 4
01-06-2016 1
Try:
df4 = pd.merge(df1,df2, how='left', on='Date')
Result:
Date X Y
0 2014-05-31 1 NaN
1 2014-06-30 2 NaN
2 2014-07-31 3 NaN
3 2014-08-31 4 NaN
4 2014-09-30 5 NaN
5 2014-10-31 6 NaN
6 2014-11-30 7 NaN
7 2014-12-31 8 NaN
8 2015-01-31 1 NaN
9 2015-02-28 3 NaN
10 2015-03-31 4 NaN
11 2015-04-30 5 NaN
Use Series.dt.to_period with months periods and merge by multiple DataFrames in list:
import functools
dfs = [df1, df2, df3]
dfs = [x.assign(per=x['Date'].dt.to_period('m')) for x in dfs]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='per', how='left'), dfs)
print (df)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Alternative:
df1['per'] = df1['Date'].dt.to_period('m')
df2['per'] = df2['Date'].dt.to_period('m')
df3['per'] = df3['Date'].dt.to_period('m')
df4 = pd.merge(df1,df2, how='left', on='per').merge(df3, how='left', on='per')
print (df4)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN

Categories

Resources