How to merge dataframes and fill values - python

I am trying to merge the 2 DataFrames below to get an output where each code is listed on each date and the quantity is filled as 0 if the code was not in the original dataframe on that date. I have put an example of my input and desired output below but my live data will have over a years worth of dates and over 20,000 codes.
Input data:
df1
date
0 2021-05-03
1 2021-05-04
2 2021-05-05
3 2021-05-06
4 2021-05-07
5 2021-05-08
6 2021-05-09
7 2021-05-10
df2
date code qty
0 2021-05-03 A 2
1 2021-05-06 A 5
2 2021-05-07 A 4
3 2021-05-08 A 5
4 2021-05-10 A 6
5 2021-05-04 B 1
6 2021-05-08 B 4
Desired Output:
date code qty
03/05/2021 A 2
03/05/2021 B 0
04/05/2021 A 0
04/05/2021 B 1
05/05/2021 A 0
05/05/2021 B 0
06/05/2021 A 5
06/05/2021 B 0
07/05/2021 A 4
07/05/2021 B 0
08/05/2021 A 5
08/05/2021 B 4
09/05/2021 A 0
09/05/2021 B 0
10/05/2021 A 6
10/05/2021 B 0
I have tried the below merge but the output I get does not seem to be as desired:
df_new = df1.merge(df2, how='left', on='date')
date code qty
0 2021-05-03 A 2.0
1 2021-05-04 B 1.0
2 2021-05-05 NaN NaN
3 2021-05-06 A 5.0
4 2021-05-07 A 4.0
5 2021-05-08 A 5.0
6 2021-05-08 B 4.0
7 2021-05-09 NaN NaN
8 2021-05-10 A 6.0

This is better suited for a reindex. You create all combinations, set the index, reindex to all of those combinations, fillna and then reset the index.
import pandas as pd
idx = pd.MultiIndex.from_product([df1.date, df2['code'].unique()],
names=['date', 'code'])
df2 = (df2.set_index(['date', 'code'])
.reindex(idx)
.fillna(0, downcast='infer')
.reset_index())
date code qty
0 2021-05-03 A 2
1 2021-05-03 B 0
2 2021-05-04 A 0
3 2021-05-04 B 1
4 2021-05-05 A 0
5 2021-05-05 B 0
6 2021-05-06 A 5
7 2021-05-06 B 0
8 2021-05-07 A 4
9 2021-05-07 B 0
10 2021-05-08 A 5
11 2021-05-08 B 4
12 2021-05-09 A 0
13 2021-05-09 B 0
14 2021-05-10 A 6
15 2021-05-10 B 0

One option with pivot and stack:
(df2.pivot_table(index='date', columns='code', fill_value=0)
.reindex(df1.date, fill_value=0)
.stack('code')
.reset_index()
)
Output:
date code qty
0 2021-05-03 A 2
1 2021-05-03 B 0
2 2021-05-04 A 0
3 2021-05-04 B 1
4 2021-05-05 A 0
5 2021-05-05 B 0
6 2021-05-06 A 5
7 2021-05-06 B 0
8 2021-05-07 A 4
9 2021-05-07 B 0
10 2021-05-08 A 5
11 2021-05-08 B 4
12 2021-05-09 A 0
13 2021-05-09 B 0
14 2021-05-10 A 6
15 2021-05-10 B 0

Do a cross-join between df1 and unique vals of code. Then use df.fillna():
In [480]: x = pd.DataFrame(df2.code.unique())
In [483]: y = df1.assign(key=1).merge(x.assign(key=1), on='key').drop('key', 1).rename(columns={0: 'code'})
In [486]: res = y.merge(df2, how='left').fillna(0)
In [487]: res
Out[487]:
date code qty
0 2021-05-03 A 2.0
1 2021-05-03 B 0.0
2 2021-05-04 A 0.0
3 2021-05-04 B 1.0
4 2021-05-05 A 0.0
5 2021-05-05 B 0.0
6 2021-05-06 A 5.0
7 2021-05-06 B 0.0
8 2021-05-07 A 4.0
9 2021-05-07 B 0.0
10 2021-05-08 A 5.0
11 2021-05-08 B 4.0
12 2021-05-09 A 0.0
13 2021-05-09 B 0.0
14 2021-05-10 A 6.0
15 2021-05-10 B 0.0

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks
merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]
merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

How to join a table with each group of a dataframe in pandas

I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122

Merging multiple dataframe using month datetime

I have three dataframes. Each dataframe has date as column. I want to left join the three using date column. Date are present in the form 'yyyy-mm-dd'. I want to merge the dataframe using 'yyyy-mm' only.
df1
Date X
31-05-2014 1
30-06-2014 2
31-07-2014 3
31-08-2014 4
30-09-2014 5
31-10-2014 6
30-11-2014 7
31-12-2014 8
31-01-2015 1
28-02-2015 3
31-03-2015 4
30-04-2015 5
df2
Date Y
01-09-2014 1
01-10-2014 4
01-11-2014 6
01-12-2014 7
01-01-2015 2
01-02-2015 3
01-03-2015 6
01-04-2015 4
01-05-2015 3
01-06-2015 4
01-07-2015 5
01-08-2015 2
df3
Date Z
01-07-2015 9
01-08-2015 2
01-09-2015 4
01-10-2015 1
01-11-2015 2
01-12-2015 3
01-01-2016 7
01-02-2016 4
01-03-2016 9
01-04-2016 2
01-05-2016 4
01-06-2016 1
Try:
df4 = pd.merge(df1,df2, how='left', on='Date')
Result:
Date X Y
0 2014-05-31 1 NaN
1 2014-06-30 2 NaN
2 2014-07-31 3 NaN
3 2014-08-31 4 NaN
4 2014-09-30 5 NaN
5 2014-10-31 6 NaN
6 2014-11-30 7 NaN
7 2014-12-31 8 NaN
8 2015-01-31 1 NaN
9 2015-02-28 3 NaN
10 2015-03-31 4 NaN
11 2015-04-30 5 NaN
Use Series.dt.to_period with months periods and merge by multiple DataFrames in list:
import functools
dfs = [df1, df2, df3]
dfs = [x.assign(per=x['Date'].dt.to_period('m')) for x in dfs]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='per', how='left'), dfs)
print (df)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Alternative:
df1['per'] = df1['Date'].dt.to_period('m')
df2['per'] = df2['Date'].dt.to_period('m')
df3['per'] = df3['Date'].dt.to_period('m')
df4 = pd.merge(df1,df2, how='left', on='per').merge(df3, how='left', on='per')
print (df4)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN

Merge one file to other file in groups

In Python and Pandas, I have one dataframe for 2018 which looks like this:
Date Stock_id Stock_value
02/01/2018 1 4
03/01/2018 1 2
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
and a dataframe with one column which has all the 2018 dates like the following:
Date
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
etc
I want to merge these to get my first dataframe with full dates for 2018 for each stock and with NAs wherever they were not any data.
Basically, I want to have for each stock a row for each date of 2018 (where the rows which do not have any data should filled in with NAs).
Thus, I want to have the following as an output for the sample above:
Date Stock_id Stock_value
01/01/2018 1 NA
02/01/2018 1 4
03/01/2018 1 2
04/01/2018 1 NA
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
05/01/2018 2 NA
How can I do this?
I tested
data = data_1.merge(data_2, on='Date' , how='outer')
and
data = data_1.merge(data_2, on='Date' , how='right')
but I still got the original dataframe with no new dates added but only with some rows which had everywhere NAs added.
Use product for all combinations of values with Stock_id and merge with left join:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
from itertools import product
c = ['Stock_id','Date']
df = pd.DataFrame(list(product(df1['Stock_id'].unique(), df2['Date'])), columns=c)
print (df)
Stock_id Date
0 1 2018-01-01
1 1 2018-01-02
2 1 2018-01-03
3 1 2018-01-04
4 1 2018-01-05
5 1 2018-01-06
6 2 2018-01-01
7 2 2018-01-02
8 2 2018-01-03
9 2 2018-01-04
10 2 2018-01-05
11 2 2018-01-06
and
df = df[['Date','Stock_id']].merge(df1, how='left')
#if necessary specify both columns
#df = df[['Date','Stock_id']].merge(df1, how='left', on=['Date','Stock_id'])
print (df)
Date Stock_id Stock_value
0 2018-01-01 1 NaN
1 2018-01-02 1 4.0
2 2018-01-03 1 2.0
3 2018-01-04 1 NaN
4 2018-01-05 1 7.0
5 2018-01-06 1 NaN
6 2018-01-01 2 6.0
7 2018-01-02 2 9.0
8 2018-01-03 2 4.0
9 2018-01-04 2 6.0
10 2018-01-05 2 NaN
11 2018-01-06 2 NaN
Another idea, but should be slow in large data:
df = (df1.groupby('Stock_id')[['Date','Stock_value']]
.apply(lambda x: x.set_index('Date').reindex(df2['Date']))
.reset_index())
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 NaN
1 1 2018-01-02 4.0
2 1 2018-01-03 2.0
3 1 2018-01-04 NaN
4 1 2018-01-05 7.0
5 1 2018-01-06 NaN
6 2 2018-01-01 6.0
7 2 2018-01-02 9.0
8 2 2018-01-03 4.0
9 2 2018-01-04 6.0
10 2 2018-01-05 NaN
11 2 2018-01-06 NaN

Categories

Resources