Pandas: Joining two Dataframes based on two criteria matches - python

Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.

merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.

Related

Select rows based on condition on rows and columns

Hello I have a pandas dataframe that I want to clean.Here is an example:
IDBILL
IDBUYER
BILL
DATE
001
768787
45
1897-07-24
002
768787
30
1897-07-24
005
786545
45
1897-08-19
008
657676
89
1989-09-23
009
657676
42
1989-09-23
010
657676
18
1989-09-23
012
657676
51
1990-03-10
016
892354
73
1990-03-10
018
892354
48
1765-02-14
I want to delete the highest bills(and keep the lowest when the bills are made on the same day, by the same IDBUYER, and whose bills IDs follow each other.
To get this:
IDBILL
IDBUYER
BILL
DATE
002
768787
30
1897-07-24
005
786545
45
1897-08-19
010
657676
18
1989-09-23
012
657676
51
1990-03-10
016
892354
73
1990-03-10
018
892354
48
1765-02-14
Thank you in advance
Firstly convert 'DATE' column into datetime dtype by using to_datetime() method:
df['DATE'] = pd.to_datetime(df['DATE'])
Try with groupby() method:
result=df.groupby(['IDBUYER',df['DATE'].dt.day],as_index=False)[['IDBILL','BILL','DATE']].min()
OR
result=df.groupby(['DATE', 'IDBUYER'], sort=False)[['IDBILL','BILL']].min().reset_index()
Output of result:
IDBUYER IDBILL BILL DATE
0 657676 12 51 1990-03-10
1 657676 8 18 1989-09-23
2 768787 1 30 1897-07-24
3 786545 5 45 1897-08-19
4 892354 16 73 1990-03-10
5 892354 18 48 1765-02-14
You could try this to keep only min values of the lowest entry which is a follow-upof the idbill:
df['follow_up'] = df['IDBILL'].ne(df['IDBILL'].shift()+1).cumsum()
m = df.groupby(['IDBUYER', 'follow_up', df['DATE']])['BILL'].idxmin()
df.loc[sorted(m)]
# IDBILL IDBUYER BILL DATE follow_up
# 1 2 768787 30 1897-07-24 1
# 2 5 786545 45 1897-08-19 2
# 5 10 657676 18 1989-09-23 3
# 6 12 657676 51 1990-03-10 4
# 7 16 892354 73 1990-03-10 5
# 8 18 892354 48 1765-02-14 6

Sort outer multi-index

I want to sort a dataframe highest to lowest based on column B. I can't find an answer on how to sort the outer (i.e. first) index column.
I have this example data:
A B
Item Type
0 X 'rtr' 2
Tier 'sfg' 104
1 X 'zad' 7
Tier 'asd' 132
2 X 'frs' 4
Tier 'plg' 140
3 X 'gfq' 9
Tier 'bcd' 100
Each multi-index row contains a "Tier" row. I want to sort the outer index "Item" based on the "B" column value relating to each "Tier". The "A" column can be ignored for sorting purposes but needs to be included in the dataframe.
A B
Item Type
2 X 'frs' 4
Tier 'plg' 140
1 X 'zad' 7
Tier 'asd' 132
0 X 'rtr' 2
Tier 'sfg' 104
3 X 'gfq' 9
Tier 'bcd' 100
New Response #2
Based on all the inputs received, here's the solution. Hope this works for you.
import pandas as pd
df = pd.read_csv("xyz.txt")
df1 = df.copy()
#capture the original index of each row. This will be used for sorting later
df1['idx'] = df1.index
#create a dataframe with only items that match 'Tier'
#assumption is each Index has a row with 'Tier'
tier = df1.loc[df1['Type']=='Tier']
#sort Total for only the Tier rows
tier = tier.sort_values('Total')
#Create a list of the indexes in sorted order
#this will be the order to print the rows
tier_list = tier['Index'].tolist()
# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(tier_list, range(len(tier_list))))
# Generate a rank column that will be used to sort the dataframe numerically
df1['Tier_Rank'] = df1['Index'].map(sorterIndex)
#Now sort the dataframe based on rank column and original index
df1.sort_values(['Tier_Rank','idx'],ascending = [True, True],inplace = True)
#drop the temporary column we created
df1.drop(['Tier_Rank','idx'], 1, inplace = True)
#print the dataframe
print (df1)
Based on the source data, here's the final output. Let me know if this is in line with what you were looking for.
Index Type Id ... Intellect Strength Total
12 2 Chest Armor "6917529202229928161" ... 17 8 62
13 2 Gauntlets "6917529202229927889" ... 16 14 60
14 2 Helmet "6917529202223945870" ... 10 9 66
15 2 Leg Armor "6917529202802011569" ... 15 2 61
16 2 Set NaN ... 58 33 249
17 2 Tier NaN ... 5 3 22
24 4 Chest Armor "6917529202229928161" ... 17 8 62
25 4 Gauntlets "6917529202802009244" ... 7 9 63
26 4 Helmet "6917529202223945870" ... 10 9 66
27 4 Leg Armor "6917529202802011569" ... 15 2 61
28 4 Set NaN ... 49 28 252
29 4 Tier NaN ... 4 2 22
42 7 Chest Armor "6917529202229928161" ... 17 8 62
43 7 Gauntlets "6917529202791088503" ... 7 14 61
44 7 Helmet "6917529202223945870" ... 10 9 66
45 7 Leg Armor "6917529202229923870" ... 7 19 57
46 7 Set NaN ... 41 50 246
47 7 Tier NaN ... 4 5 22
0 0 Chest Armor "6917529202229928161" ... 17 8 62
1 0 Gauntlets "6917529202778947311" ... 10 15 62
2 0 Helmet "6917529202223945870" ... 10 9 66
3 0 Leg Armor "6917529202802011569" ... 15 2 61
4 0 Set NaN ... 52 34 251
5 0 Tier NaN ... 5 3 23
6 1 Chest Armor "6917529202229928161" ... 17 8 62
7 1 Gauntlets "6917529202778947311" ... 10 15 62
8 1 Helmet "6917529202223945870" ... 10 9 66
9 1 Leg Armor "6917529202229923870" ... 7 19 57
10 1 Set NaN ... 44 51 247
11 1 Tier NaN ... 4 5 23
18 3 Chest Armor "6917529202229928161" ... 17 8 62
19 3 Gauntlets "6917529202229927889" ... 16 14 60
20 3 Helmet "6917529202223945870" ... 10 9 66
21 3 Leg Armor "6917529202229923870" ... 7 19 57
22 3 Set NaN ... 50 50 245
23 3 Tier NaN ... 5 5 23
30 5 Chest Armor "6917529202229928161" ... 17 8 62
31 5 Gauntlets "6917529202802009244" ... 7 9 63
32 5 Helmet "6917529202223945870" ... 10 9 66
33 5 Leg Armor "6917529202229923870" ... 7 19 57
34 5 Set NaN ... 41 45 248
35 5 Tier NaN ... 4 4 23
36 6 Chest Armor "6917529202229928161" ... 17 8 62
37 6 Gauntlets "6917529202791088503" ... 7 14 61
38 6 Helmet "6917529202223945870" ... 10 9 66
39 6 Leg Armor "6917529202802011569" ... 15 2 61
40 6 Set NaN ... 49 33 250
41 6 Tier NaN ... 4 3 23
[48 rows x 11 columns]
New Response:
Based on the source data file shared, here's the group by and sort. Let me know how you want the values to be sorted. I have assumed that you want it sorted by Index, then Total.
df = df.groupby(['Index','Type',])\
.agg({'Total':'mean'})\
.sort_values(['Index','Total'])
The output of this will be as follows:
Total
Index Type
0 Tier 23
Leg Armor 61
Chest Armor 62
Gauntlets 62
Helmet 66
Set 251
1 Tier 23
Leg Armor 57
Chest Armor 62
Gauntlets 62
Helmet 66
Set 247
2 Tier 22
Gauntlets 60
Leg Armor 61
Chest Armor 62
Helmet 66
Set 249
3 Tier 23
Leg Armor 57
Gauntlets 60
Chest Armor 62
Helmet 66
Set 245
4 Tier 22
Leg Armor 61
Chest Armor 62
Gauntlets 63
Helmet 66
Set 252
Initial Response:
I dont have your raw data. Created some data to show you how sorting would work on groupby data. See if this is what you are looking for.
import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot'],
'Type':['Wild', 'Captive', 'Wild', 'Captive'],
'Air': ['Good','Bad', 'Bad', 'Good'],
'Max Speed': [380., 370., 24., 26.]})
df = df.groupby(['Animal','Type','Air'])\
.agg({'Max Speed':'mean'})\
.sort_values('Max Speed')
print(df)
The output will be as follows:
Max Speed
Animal Type Air
Parrot Wild Bad 24.0
Captive Good 26.0
Falcon Captive Bad 370.0
Wild Good 380.0
Without the sort command, the output will be a bit different.
df = df.groupby(['Animal','Type','Air'])\
.agg({'Max Speed':'mean'})
This will result in below. The Max Speed is not sorted. Instead it is using the group by sort of Animal then Type:
Max Speed
Animal Type Air
Falcon Captive Bad 370.0
Wild Good 380.0
Parrot Captive Good 26.0
Wild Bad 24.0

How to add a row to every group with pandas groupby?

I wish to add a new row in the first line within each group, my raw dataframe is:
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person ('ID'), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wrote some loop conditions but quite slow, If you have any good ideas, please help. Thanks a lot
Let's try groupby.apply here. We'll append a row to each group at the start, like this:
def augment_group(group):
first_row = group.iloc[[0]]
first_row['Date'] += pd.Timedelta(days=1)
return first_row.append(group)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
(df.groupby('ID', as_index=False, group_keys=False)
.apply(augment_group)
.reset_index(drop=True))
ID From_num To_num Date
0 James 78 96 2020-05-13
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02
17 Wong Started 39 2019-02-01
Although I agree with #Joran Beasley in the comments that this feels like somewhat of an XY problem. Perhaps try clarifying the problem you're trying to solve, instead of asking how to implement what you think is the solution to your issue?

Pandas: Getting 10 rows above a selected date

I am trying to use get_loc to get the current date and then return the 10 rows above the current date from the data, but I keep getting a Key Error.
Here is my datable => posting_df5:
Posting_date rooms Origin Rooms booked ADR Revenue
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
10 2019-04-10 31 31 31 184.138208 5708.284456
I tried doing the following:
idx = posting_df7.index.get_loc('2019-04-05')
posting_df7 = posting_df5.iloc[idx - 5 : idx + 5]
But I received the following error:
indexer = self._get_level_indexer(key, level=level)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2939, in _get_level_indexer
code = level_index.get_loc(key)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2019-04-05'
So, I then tried to first index Posting_date before using get_loc but it didn't work as well:
rooms Origin Rooms booked ADR Revenue
Posting_date
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
Then I used the same get_loc function but the same error appeared. How can I select the row based no the date required.
Thanks
Here is a a different approach ...
Because iloc and get_loc can be tricky, this solution uses boolean masking to return the rows relative to a given date, then use the head() function to return the number of rows you require.
import pandas as pd
PATH = '/home/user/Desktop/so/room_rev.csv'
# Read in data from a CSV.
df = pd.read_csv(PATH)
# Convert the date column to a `datetime` format.
df['Posting_date'] = pd.to_datetime(df['Posting_date'],
format='%Y-%m-%d')
# Sort based on date.
df.sort_values('Posting_date')
Original Dataset:
Posting_date rooms Origin Rooms booked ADR Revenue
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
10 2019-04-10 31 31 31 184.138208 5708.284456
Solution:
Replace the value in the head() function with the number of rows you want to return. Note: There is also a tail() function for the inverse.
df[df['Posting_date'] > '2019-04-05'].head(3)
Output:
Posting_date rooms Origin Rooms booked ADR Revenue
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623

Calculating expanding mean on 2 columns simultaneously

I have a table of 2 players competing each other:
date plA plB ptsA ptsB
0 01/01/2013 Jeff Tom 78 72
1 15/01/2013 Jeff Tom 52 67
2 01/02/2013 Tom Jeff 91 93
3 15/02/2013 Jeff Tom 83 87
4 01/03/2013 Tom Jeff 65 76
I want to apply the expanding mean, such that ptsA and ptsB for each player get counted in (and are not left) to the net result. Final output should make it more clear:
date plA plB ptsA ptsB meanA meanB
0 01/01/2013 Jeff Tom 78 72 78 72 # init mean
1 15/01/2013 Jeff Tom 52 67 65 69.5
2 01/02/2013 Tom Jeff 91 93 74.3 76.6 # Tom: (72+67+91)/3, Jeff: (78+52+93)/3
3 15/02/2013 Jeff Tom 83 87 76.5 79.25 # Jeff: (78+52+93+83)/4, Tom: (72+67+91+87)/4
4 01/03/2013 Tom Jeff 65 76 76.4 76.4 # Tom: (72+67+91+87+65)/5, Jeff: (78+52+93+83+76)/5
Now, I started grouping data by plA and like this:
by_A = players.sort(columns='date').groupby('plA')
players['meanA'] = by_A['ptsA'].apply(pd.expanding_mean)
players['meanB'] = by_A['ptsB'].apply(pd.expanding_mean)
and obviously, I need to do the same, and groupby('plB') and then Im drawing a blank how to join these two results correctly.
Perhaps pandas offers a built-in or you have a solution for it?
#EDIT Saullo Castro's solution with slightly different data
date studentA studentB scoreA scoreB meanJeff meanTom meanMaggie
0 2013-01-01 Jeff Tom 78 72 78.000000 72.000000 0.000000
1 2013-01-15 Jeff Maggie 52 67 65.000000 36.000000 33.500000
2 2013-02-01 Tom Jeff 91 93 74.333333 54.333333 22.333333
3 2013-02-15 Jeff Tom 83 87 76.500000 62.500000 16.750000
4 2013-03-01 Tom Jeff 65 76 76.400000 63.000000 13.400000
Maggie's mean should stay 67 all the way.
(Please, refer to the fixed solution below)
One approach is to find out all the player's names first:
names = pd.concat((df.plA, df.plB)).unique()
Then create one new column with the expanding mean for each player:
for name in names:
df['mean'+name] = pd.expanding_mean(df.ptsA*(df.plA==name) + df.ptsB*(df.plB==name))
Resulting in:
date plA plB ptsA ptsB meanJeff meanTom
0 2013-01-01 00:00:00 Jeff Tom 78 72 78.000000 72.000000
1 15/01/2013 Jeff Tom 52 67 65.000000 69.500000
2 2013-01-02 00:00:00 Tom Jeff 91 93 74.333333 76.666667
3 15/02/2013 Jeff Tom 83 87 76.500000 79.250000
4 2013-01-03 00:00:00 Tom Jeff 65 76 76.400000 76.400000
EDIT: Fixed solution:
For more than two names this is how you can build the formula for the expanding mean:
df = pd.read_excel('stack.xlsx', 'tabelle1')
names = pd.concat((df.plA, df.plB)).unique()
for name in names:
nA = df.plA==name
nB = df.plB==name
df['mean'+name] = np.cumsum(df.ptsA*nA + df.ptsB*nB)/np.maximum(1.,
np.cumsum(1.*np.logical_or(nA, nB)))
Resulting in:
date plA plB ptsA ptsB meanJeff meanTom meanMaggie
0 2013-01-01 00:00:00 Jeff Tom 78 72 78.000000 72.000000 0
1 2013-01-15 00:00:00 Jeff Maggie 52 67 65.000000 72.000000 67
2 2013-02-01 00:00:00 Tom Jeff 91 93 74.333333 81.500000 67
3 2013-02-15 00:00:00 Jeff Tom 83 87 76.500000 83.333333 67
4 2013-03-01 00:00:00 Tom Jeff 65 76 76.400000 78.750000 67

Categories

Resources