I am trying to merge a pandas dataframe with a pivot table and it changes the column names. Can I retain the original column names from pivot without having them merged into a single column?
df1:
pn number_of_records date
A103310 0 2017-09-01
B103309 0 2017-06-01
C103308 0 2017-03-01
D103307 2 2016-12-01
E103306 2 2016-09-01
df2 which is a pivot table:
pn t1
a1 b1 c1
A103310 3432 324 124
B103309 342 123 984
C103308 435 487 245
D103307 879 358 234
E103306 988 432 235
doing a merge on this dataframe gives me:
df1_df2 = pd.merge(df1,df2,how="left",on="pn")
gives me the column names as:
pn number_of_records date (t1,a1) (t1,b1) (t1,c1)
How can I instead have them as:
pn number_of_records date t1
a1 b1 c1
in the dataframe after the merge?
Add a level to the columns of df1
pd.concat([df1], axis=1, keys=['']).swaplevel(0, 1, 1).merge(df2, on='pn')
pn number_of_records date t1
a1 b1 c1
0 A103310 0 2017-09-01 3432 324 124
1 B103309 0 2017-06-01 342 123 984
2 C103308 0 2017-03-01 435 487 245
3 D103307 2 2016-12-01 879 358 234
4 E103306 2 2016-09-01 988 432 235
Related
I have two data frames df1 and df2 as shown below:
df1
Date ID Amount BillNo1
10/08/2020 ABBCSQ1ZA 878 2020/156
10/08/2020 ABBCSQ1ZA 878 2020/157
10/12/2020 AC928Q1ZS 3998 343SONY
10/14/2020 AC9268RE3 198 432
10/16/2020 AA171E1Z0 5490 AIPO325
10/19/2020 BU073C1ZW 3432 IDBI436-Total
10/19/2020 BU073C1ZW 3432 IDBI437-Total
df2
Date ID Amount BillNo2
10/08/2020 ABBCSQ1ZA 878 156
10/11/2020 ATRC95REW 115 265
10/14/2020 AC9268RE3 198 A/432
10/16/2020 AA171E1Z0 5490 325
10/19/2020 BU073C1ZW 3432 436
10/19/2020 BU073C1ZW 3432 437
My final answer should be:
Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/156 156 # 156 matches
10/14/2020 AC9268RE3 198 432 A/432 # 432 matches
10/16/2020 AA171E1Z0 5490 AIPO325 325 # 325 matches
10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 # 436 matches
10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 # 437 matches
Non Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/157 NaN
10/12/2020 AC928Q1ZS 3998 343SONY NaN
10/11/2020 ATRC95REW 115 NaN 265
How do I merge two dataframes based on partial string match of Column =['BillNo1','BillNo2']?
You can define your own thresholds, but one proposal is below:
import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
"""x=possibilities , y = input"""
f = partial(
difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)
matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
return pd.DataFrame([y,matches.rename('BillNo2')]).T
temp = get_closest_match(df2['BillNo2'],df1['BillNo1'])
temp['BillNo2'] = (temp['BillNo2']
.fillna(df1['BillNo1']
.str.extract('('+'|'.join(df2['BillNo2'])+')',expand=False)))
merged = (df1.assign(BillNo2=df1['BillNo1'].map(dict(temp.values)))
.merge(df2.drop_duplicates(),on=['Date','ID','Amount','BillNo2']
,how='outer',indicator=True))
print(merged)
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
Once you have above merged df, you can do;
matched = merged.query("_merge=='both'")
unmatched = merged.query("_merge!='both'")
print("Matched Df \n ", matched,'\n\n',"Unmatched Df \n " , unmatched)
Matched Df
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
Unmatched Df
Date ID Amount BillNo1 BillNo2 _merge
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
Given the following table:
df = pd.DataFrame({'pers_no': [1,1,2], 'start_date': ['2000-03-01','2000-06-01', '2001-04-01'], 'end_date': ['2000-05-01','2000-07-01', '2001-06-01'], 'value': [199,219,249]})
pers_no start_date end_date value
0 1 2000-03-01 2000-05-01 199
1 1 2000-06-01 2000-07-01 219
2 2 2001-04-01 2001-06-01 249
How to expand the DataFrame to get extra rows for e.g. each month between start date and end date? The result should look like this:
pers_no date value
0 1 2000-03-01 199
1 1 2000-04-01 199
2 1 2000-05-01 199
3 1 2000-06-01 219
4 1 2000-07-01 219
5 2 2001-04-01 249
6 2 2001-05-01 249
7 2 2001-06-01 249
You can make new column with date_range and explode the data like this:
def get_dt_range(dt):
return pd.date_range(dt['start_date'], dt['end_date']+pd.offsets.MonthEnd(), freq='MS')
df['date'] = df[['start_date','end_date']].apply(get_dt_range, axis=1)
df.explode('date') [['pers_no', 'date', 'value']]
Output:
pers_no date value
0 1 2000-03-01 199
0 1 2000-04-01 199
0 1 2000-05-01 199
1 1 2000-06-01 219
1 1 2000-07-01 219
2 2 2001-04-01 249
2 2 2001-05-01 249
2 2 2001-06-01 249
You can do this:
pd.concat([pd.DataFrame({'Date': pd.date_range(row.start_date, row.end_date, freq='d'),
'value': row.value,
'pers_no': row.pers_no}, columns=['Date', 'value','pers_no'])
for i, row in df.iterrows()], ignore_index=True)
which gives:
Date value pers_no
0 2000-03-01 199 1
1 2000-03-02 199 1
2 2000-03-03 199 1
3 2000-03-04 199 1
4 2000-03-05 199 1
.. ... ... ...
150 2001-05-28 249 2
151 2001-05-29 249 2
152 2001-05-30 249 2
153 2001-05-31 249 2
154 2001-06-01 249 2
Is it possible to add a value in a column when the province name of second dataframe matches with the province name of the first dataframe? I searched for answers and weren't able to find anything useful for my case.
This is first DataFrame
date province confirmed released deceased
0 2020-03-30 Daegu 6624 3837 111
1 2020-03-30 Gyeongsangbuk-do 1298 772 38
2 2020-03-30 Gyeonggi-do 463 160 5
3 2020-03-30 Seoul 426 92 0
4 2020-03-30 Chungcheongnam-do 127 83 0
...
and this is second DataFrame
code province latitude longitude
0 12000 Daegu 35.872150 128.601783
1 60000 Gyeongsangbuk-do 36.576032 128.505599
2 20000 Gyeonggi-do 37.275119 127.009466
3 10000 Seoul 37.566953 126.977977
4 41000 Chungcheongnam-do 36.658976 126.673318
...
I would like to turn the first DataFrame like this.
date province confirmed released deceased latitude longitude
0 2020-03-30 Daegu 6624 3837 111 35.872150 128.601783
1 2020-03-30 Gyeongsangbuk-do 1298 772 38 36.576032 128.505599
2 2020-03-30 Gyeonggi-do 463 160 5 37.275119 127.009466
3 2020-03-30 Seoul 426 92 0 37.566953 126.977977
4 2020-03-30 Chungcheongnam-do 127 83 0 36.658976 126.673318
...
Thanks!
The pandas.DataFrame.merge method is what you want to use here.
Using your example DataFrames:
import pandas as pd
df1 = pd.DataFrame(dict(
date = [
'2020-03-30','2020-03-30','2020-03-30',
'2020-03-30','2020-03-30',],
province = [
'Daegu', 'Gyeongsangbuk-do', 'Gyeonggi-do',
'Seoul', 'Chungcheongnam-do'],
confirmed = [6624, 1298, 463, 426, 127],
released = [3837, 772, 160, 92, 83],
deceased = [111, 38, 5, 0, 0],
))
df2 = pd.DataFrame(dict(
code = [12000, 60000, 20000, 10000, 41000],
province = [
'Daegu', 'Gyeongsangbuk-do', 'Gyeonggi-do',
'Seoul', 'Chungcheongnam-do'],
latitude = [
35.872150, 36.576032, 37.275119,
37.566953, 36.658976],
longitude = [
128.601783, 128.505599, 127.009466,
126.977977, 126.673318],
))
df3 = df1.merge(
df2[['province', 'latitude','longitude']],
on = 'province',
)
pd.set_option('display.max_columns', 7)
print(df3)
Output:
date province confirmed released deceased latitude \
0 2020-03-30 Daegu 6624 3837 111 35.872150
1 2020-03-30 Gyeongsangbuk-do 1298 772 38 36.576032
2 2020-03-30 Gyeonggi-do 463 160 5 37.275119
3 2020-03-30 Seoul 426 92 0 37.566953
4 2020-03-30 Chungcheongnam-do 127 83 0 36.658976
longitude
0 128.601783
1 128.505599
2 127.009466
3 126.977977
4 126.673318
Example Code in python tutor
What you really want to do is merge both the DataFrames based on the province column.
Make a new DataFrame which you want.
First run a loop on first DataFrame and add all the values in it. (Leave the values for the columns which are not present)
Then run a loop on second DataFrame and add the its values by comparing the value of province to the already added value in the new DataFrame.
Here's an example
NewDataFrame
date province confirmed released deceased latitude longitude
After adding the first DataFrame
date province confirmed released deceased latitude longitude
0 2020-03-30 Daegu 6624 3837 111
1 2020-03-30 Gyeongsangbuk-do 1298 772 38
2 2020-03-30 Gyeonggi-do 463 160 5
3 2020-03-30 Seoul 426 92 0
4 2020-03-30 Chungcheongnam-do 127 83 0
After adding second DataFrame
date province confirmed released deceased latitude longitude
0 2020-03-30 Daegu 6624 3837 111 35.872150 128.601783
1 2020-03-30 Gyeongsangbuk-do 1298 772 38 36.576032 128.505599
2 2020-03-30 Gyeonggi-do 463 160 5 37.275119 127.009466
3 2020-03-30 Seoul 426 92 0 37.566953 126.977977
4 2020-03-30 Chungcheongnam-do 127 83 0 36.658976 126.673318
I would like to have a dataframe, created by combine only the total row values on two pivot tables and keeping the same column names, including the All column.
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 111 112 113 336
A2 121 122 123 366
A3 131 132 133 396
All 363 366 369 1098
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 211 212 213 636
A2 221 222 223 666
A3 231 232 233 696
All 663 666 669 1998
As I result I would like to have a data frame that would look like:
2019/08 2019/09 2019/10 All
363 366 369 1098
663 666 669 1998
I tried:
A=testA.iloc[3]
B=testB.iloc[3]
my_series = pd.concat([A,B],axis=1)
But it does not do what I expected :(
All All
MONTH
sum ALL_APPS 2019/08 363.0 NaN
2019/09 366.0 NaN
2019/10 369.0 NaN
All 1098.0 NaN
CUR_VER 2019/08 NaN 663.0
2019/09 NaN 666.0
2019/10 NaN 669.0
All NaN 1998.0
Try:
my_series=pd.concat([testA.iloc[-1], testB.iloc[-1]], axis=1, ignore_index=True).T
my_series.columns=map(lambda x: x[3], testA.columns)
How to .merge 2 df, 1 column to match 2 columns ??
The goal is to merge 2 df to have count of records for every campaign id from a REF table to the Data by id.
The issue .merge just compare 1 column with 1 column
The Data is mess up and for some rows there are id names rather then id's.
It works if I want to merge 1 column to 1 column, or 2 columns to 2 columns, but NOT for 1 column to 2 columns
Reff table
g_spend =
campaignid id_name cost
154 campaign1 15
155 campaign2 12
1566 campaign33 12
158 campaign4 33
Data
cw =
campaignid
154
154
155
campaign1
campaign33
1566
158
campaign1
campaign1
campaign33
campaign4
Desired output
g_spend =
campaignid id_name cost leads
154 campaign1 15 5
155 campaign2 12 0
1566 campaign33 12 3
158 campaign4 33 2
What I done..
# Just work for one column
cw.head()
grouped_cw = cw.groupby(["campaignid"]).count()
grouped_cw.rename(columns={'reach':'leads'}, inplace=True)
grouped_cw = pd.DataFrame(grouped_cw)
# now merging
g_spend.campaignid = g_spend.campaignid.astype(str)
g_spend = g_spend.merge(grouped_cw, left_on='campaignid', right_index=True)
I would first set id_name as index in g_spend, then do a replace on cw, followed by a value_counts:
s = (cw.campaignid
.replace(g_spend.set_index('id_name').campaignid
.value_counts()
.to_frame('leads')
)
g_spend = g_spend.merge(s, left_on='campaignid', right_index=True)
Output:
campaignid id_name cost leads
0 154 campaign1 15 5
1 155 campaign2 12 1
2 1566 campaign33 12 3
3 158 campaign4 33 2