Python: add columns to dataframe from another with matching "vlookup" - python

I'd like to add two columns to an existing dataframe from another dataframe based on a lookup in the name column.
Dataframe to update looks like this:
Player
School
Conf
Cmp
Att
Danny Wuerffel
Florida
1
708
1170
Steve Sarkisian
Brigham Young
0
528
789
Billy Blanton
San Diego State
0
588
920
And I'd like to take the height and weight from this dataframe (actually a json file) and add it based on matching Player names:
Name
School
Conf
Height
Weight
Pct
Yds
Danny Wuerffel
Florida
1
6-2
217
60.5
10875
Steve Sarkisian
Brigham Young
0
6-3
230
66.9
7464
Billy Blanton
San Diego State
0
6-0
222
63.9
8165
Codewise I tried something like this so far:
existing_dataframe['Height'] = pd.Series(height_weight_df['Height'])
But I'm missing the part matching them on the name because the DFs aren't in the same order

Let us try
existing_dataframe = existing_dataframe.merge(height_weight_df[['Name','School','Height','Weight']],left_on=['Player','School'],right_on=['Name','School'],how='left')

Related

Merging strings of people's names in pandas

I have two datasets that I want to merge based off the persons name. One data set player_nationalities has their full name:
Player, Nationality
Kylian Mbappé, France
Wissam Ben Yedder, France
Gianluigi Donnarumma, Italy
The other dataset player_ratings shortens their first names with a full stop and keeps the other name(s).
Player, Rating
K. Mbappé, 93
W. Ben Yedder, 89
G. Donnarumma, 91
How do I merge these tables based on the column Player and avoid merging people with the same last name? This is my attempt:
df = pd.merge(player_nationality, player_ratings, on='Player', how='left')
Player, Nationality, Rating
K. Mbappé, France, NaN
W. Ben Yedder, France, NaN
G. Donnarumma, Italy, NaN
You would need to normalize the keys in both DataFrames in order to merge them.
One idea would be to create a function to process the full name in player_nationalities and merge on the processed value for player name. eg:
def convert_player_name(name):
try:
first_name, last_name = name.split(' ', maxsplit=1)
return f'{first_name[0]}. {last_name}'
except ValueError:
return name
player_nationalities['processed_name'] = [convert_player_name(name) for name in player_nationalities['Player']]
df_merged = player_nationalities.merge(player_ratings, left_on='processed_name', right_on='Player')
[out]
Player_x Nationality processed_name Player_y Rating
0 Kylian Mbappé France K. Mbappé K. Mbappé 93
1 Wissam Ben Yedder France W. Ben Yedder W. Ben Yedder 89
2 Gianluigi Donnarumma Italy G. Donnarumma G. Donnarumma 91

Make a wide dataframe long and add columns according to another column's name

I need to use some names of the columns as part of the df. While keeping the first 3 columns identical, I need to create some other columns based on the content of the row.
Here I have some transactions from some customers:
cust_id cust_first cust_last au_zo au_zo_pay fi_gu fi_gu_pay wa wa_pay
0 1000 Andrew Jones 50.85 debit NaN NaN 69.12 debit
1 1001 Fatima Lee NaN NaN 18.16 debit NaN NaN
2 1002 Sophia Lewis NaN NaN NaN NaN 159.54. credit
3 1003 Edward Bush 45.29 credit 59.63 credit NaN NaN
4 1004 Mark Nunez 20.87 credit 20.87 credit 86.18 debit
First, I need to add a new column, 'city'. Since it is not on the database. It is defaulted to be 'New York'. (that's easy!)
But here is where I am getting stuck:
Add a new column 'store' holds values according to where a transaction took place. au_zo --> autozone, fi_gu --> five guys, wa --> walmart
Add new column 'classification' according to the store previously added: auto zone --> auto-repair, five guys --> food, walmart --> groceries
Column 'amount' holds the value of the customer and store.
Column 'transaction_type' is the value of au_zo_pay, fi_gu_pay, wa_pay respectively.
So at the end it looks like this:
cust_id city cust_first cust_last store classification amount trans_type
0 1000 New York  Andrew Jones auto zone auto-repair 50.85 debit
1 1000 New York Andrew Jones walmart groceries 69.12 debit
2 1001 New York Fatima Lee five guys food 18.16 debit
3 1002 New York Sophia Solis walmart groceries 159.54 credit
4 1003 New York Edward Bush auto zone auto-repair 45.29 credit
5 1003 New York Edward Bush five guys food 59.63 credit
6 1004 New York Mark Nunez auto zone auto-repair 20.87 credit
7 1004 New York Mark Nunez five guys food 20.87 credit
8 1004 New York Mark Nunez walmart groceries 86.18 debit
I have tried using df.melt() but I don't get the results.
Is this something you want?
import pandas as pd
mp = {
'au_zo': 'auto-repair',
'wa':'groceries',
'fi_gu':'food'
}
### Read txt Data: get pandas df
# I copied and pasted your sample data to a txt file, you can ignore this part
with open(r"C:\Users\orf-haoj\Desktop\test.txt", 'r') as file:
head, *df = [row.split() for row in file.readlines()]
df = [row[1:] for row in df]
df = pd.DataFrame(df, columns=head)
### Here we conduct 2 melts to form melt_1 & melt_2 data
# this melt table is to melt cols 'au_zo','fi_gu', and 'wa'. & get amount as value
melt_1 = df.melt(id_vars=['cust_id', 'cust_first', 'cust_last'], value_vars=['au_zo','fi_gu','wa'], var_name='store', value_name='amount')
# this melt table is to melt cols ['au_zo_pay','fi_gu_pay','wa_pay']. & get trans_type cols
melt_2 = df.melt(id_vars=['cust_id', 'cust_first', 'cust_last'], value_vars=['au_zo_pay', 'fi_gu_pay', 'wa_pay'], var_name='store pay', value_name='trans_type')
# since I want to join these table later, it will a good to get one more key store
melt_2['store'] = melt_2['store pay'].apply(lambda x: '_'.join(x.split("_")[:-1]))
### Remove NaN
# you prob want to switch to test = test.loc[~test['amount'].isnull()] or something else if you have actual nan
melt_1 = melt_1.loc[melt_1['amount'] != 'NaN']
melt_2 = melt_2.loc[melt_2['trans_type'] != 'NaN']
### Inner join data based on 4 keys (assuming your data will have one to one relationship based on these 4 keys)
full_df = melt_1.merge(melt_2, on=['cust_id', 'cust_first', 'cust_last', 'store'], how='inner')
full_df['city'] = 'New York'
full_df['classification'] = full_df['store'].apply(lambda x: mp[x])
In addition, this method will have its limitation. For example, when one to one relationship is not true based on those four keys, it will generate wrong dataset.
Try this
# assign city column and set index by customer demographic columns
df1 = df.assign(city='New York').set_index(['cust_id', 'city', 'cust_first', 'cust_last'])
# fix column names by completing the abbrs
df1.columns = df1.columns.to_series().replace({'au_zo': 'autozone', 'fi_gu': 'five guys', 'wa': 'walmart'}, regex=True)
# split column names for a multiindex column
df1.columns = pd.MultiIndex.from_tuples([c.split('_') if c.endswith('pay') else [c, 'amount'] for c in df1.columns], names=['store',''])
# stack df1 to make the wide df to a long df
df1 = df1.stack(0).reset_index()
# insert classification column
df1.insert(5, 'classification', df1.store.map({'autozone': 'auto-repair', 'five guys': 'food', 'walmart': 'groceries'}))
df1
One other way is as follows:
df1 is exactly as df with renamed names ie having the name amount in from of the store value
df1 = (df
.rename(lambda x: re.sub('(.*)_pay', 'pay:\\1', x), axis=1)
.rename(lambda x:re.sub('^(((?!cust|pay).)*)$', 'amount:\\1', x), axis=1))
Now pivot to longer using pd.wide_to_long and do the replacement.
df2 = (pd.wide_to_long(df1, stubnames = ['amount', 'pay'],
i = df1.columns[:3], j = 'store', sep=':', suffix='\\w+')
.reset_index().dropna())
store = {'au_zo':'auto zone', 'fi_gu':'five guys', 'wa':'walmart'}
classification = {'au_zo':'auto-repair', 'fi_gu':'food', 'wa':'groceries'}
df2['classification'] = df2['store'].replace(classification)
df2['store'] = df2['store'].replace(store)
cust_id cust_first cust_last store amount pay classification
0 1000 Andrew Jones auto zone 50.85 debit auto-repair
2 1000 Andrew Jones walmart 69.12 debit groceries
4 1001 Fatima Lee five guys 18.16 debit food
8 1002 Sophia Lewis walmart 159.54. credit groceries
9 1003 Edward Bush auto zone 45.29 credit auto-repair
10 1003 Edward Bush five guys 59.63 credit food
12 1004 Mark Nunez auto zone 20.87 credit auto-repair
13 1004 Mark Nunez five guys 20.87 credit food
14 1004 Mark Nunez walmart 86.18 debit groceries
//NB You could consider using pivot_longer from janitor
one option for transforming to long form is with pivot_longer from pyjanitor; it has a lot of options, for this particular use case, we pull out multiple values and multiple names (that are paired with the appropriate regex), before using other Pandas functions to rename and add new columns:
# pip install pyjanitor
import pandas as pd
import janitor
mapper = {'au_zo':'autozone',
'fi_gu':'five guys',
'wa':'walmart'}
store_mapper = {'autozone':'repair',
'five guys':'food',
'walmart':'groceries'}
(df
.assign(city = 'New York')
.pivot_longer(
index = 'c*',
names_to = ['ignore', 'store'],
values_to = ['trans_type', 'amount'],
names_pattern = ['.+pay$', '.+'],
sort_by_appearance=True)
.dropna()
.drop(columns='ignore')
.replace(mapper)
.assign(classification = lambda df: df.store.map(store_mapper))
)
cust_id cust_first cust_last city trans_type store amount classification
0 1000 Andrew Jones New York debit autozone 50.85 repair
2 1000 Andrew Jones New York debit walmart 69.12 groceries
4 1001 Fatima Lee New York debit five guys 18.16 food
8 1002 Sophia Lewis New York credit walmart 159.54. groceries
9 1003 Edward Bush New York credit autozone 45.29 repair
10 1003 Edward Bush New York credit five guys 59.63 food
12 1004 Mark Nunez New York credit autozone 20.87 repair
13 1004 Mark Nunez New York credit five guys 20.87 food
14 1004 Mark Nunez New York debit walmart 86.18 groceries

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)
Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

Pandas group by but keep another column

Say that I have a dataframe that looks something like this
date location year
0 1908-09-17 Fort Myer, Virginia 1908
1 1909-09-07 Juvisy-sur-Orge, France 1909
2 1912-07-12 Atlantic City, New Jersey 1912
3 1913-08-06 Victoria, British Columbia, Canada 1912
I want to use pandas groupby function to create an output that shows the total number of incidents by year but also keep the location column that will display one of the locations that year. Any which one works. So it would look something like this:
total location
year
1908 1 Fort Myer, Virginia
1909 1 Juvisy-sur-Orge, France
1912 2 Atlantic City, New Jersey
Can this be done without doing funky joining? The furthest I can get is using the normal groupby
df = df.groupby(['year']).count()
But that only gives me something like this
location
year
1908 1 1
1909 1 1
1912 2 2
How can I display one of the locations in this dataframe?
You can use groupby.agg and use 'first' to extract the first location in each group:
res = df.groupby('year')['location'].agg(['first', 'count'])
print(res)
# first count
# year
# 1908 Fort Myer, Virginia 1
# 1909 Juvisy-sur-Orge, France 1
# 1912 Atlantic City, New Jersey 2

How do I replace the values in the second dataframe based on the values in the first dataframe

I have two dataframes i.e df and df1,
df:
Product_name Name City
Rice Chetwynd Chetwynd, British Columbia, Canada
Wheat Yuma Yuma, AZ, United States
Sugar Dochra Singleton, New South Wales, Australia
Milk India Hyderabad, India
df1:
Product_ID Unique_ID Origin_From Deliver_To
231 125 Sugar Milk
598 125 Milk Wheat
786 125 Rice Sugar
568 125 Sugar Wheat
122 125 Wheat Rice
269 125 Milk Wheat
Final Output (df2): Get the values of "Origin_From" and "Deliver_To" values in df1 then search each values in df, if found then replace "Origin_From" and "Deliver_To" values in df1 with df[city] + df[Origin_From/Origin_To]. output (df2) would be something like below.
df2:
Product_ID unique_ID Origin_From Deliver_To
231 125 Singleton, New South Wales, Australia, (Sugar) Hyderabad, India, (Milk)
598 125 Hyderabad, India, (Milk) Yuma, AZ, United States, (Wheat)
786 125 Chetwynd, British Columbia, Canada, (Rice) Singleton, New South Wales, Australia, (Sugar)
568 125 Singleton, New South Wales, Australia, (Sugar) Yuma, AZ, United States, (Wheat)
122 125 Yuma, AZ, United States, (Wheat) Chetwynd, British Columbia, Canada, (Rice)
269 125 Hyderabad, India, (Milk) Yuma, AZ, United States, (Wheat)
I am struggling a bit with it so a couple of shoves in the right direction would really help.
Thanks in advance.
Setup
from io import StringIO
import pandas as pd
df_txt = """Product_name Name City
Rice Chetwynd Chetwynd, British Columbia, Canada
Wheat Yuma Yuma, AZ, United States
Sugar Dochra Singleton, New South Wales, Australia
Milk India Hyderabad, India"""
df1_txt = """Product_ID Unique_ID Origin_From Deliver_To
231 125 Sugar Milk
598 125 Milk Wheat
786 125 Rice Sugar
568 125 Sugar Wheat
122 125 Wheat Rice
269 125 Milk Wheat"""
df = pd.read_csv(StringIO(df_txt), sep='\s{2,}', engine='python')
df1 = pd.read_csv(StringIO(df1_txt), sep='\s{2,}', engine='python')
Solution
option 1
m = df.set_index('Product_name').City
df2 = df1.copy()
df2.Origin_From = df1.Origin_From.map(m) + ', (' + df1.Origin_From + ')'
df2.Deliver_To = df1.Deliver_To.map(m)+ ', (' + df1.Deliver_To + ')'
df
option 2
m = df.set_index('Product_name').City
c = ['Origin_From', 'Deliver_To']
fnt = df1[c].stack()
df2 = df1.drop(c, 1).join(fnt.map(m).add(fnt.apply(', ({})'.format)).unstack())
option 3
using merge
c = ['Origin_From', 'Deliver_To']
ds = df1[c].stack().to_frame('Product_name')
ds['City'] = ds.merge(df)['City'].values
df2 = df1.drop(c, 1).join(ds.City.add(', (').add(ds.Product_name).add(')').unstack())
Deeper explanation of option 3
assign the target columns to variable c for convenience
use stack to convert 2-column dataframe into a series object with a multi-index
anticipating that I'm going to merge, I use to_frame to convert the series object into a single column dataframe. pd.merge only works on dataframes`
more anticipation, I pass the name of the single column to the to_frame method. This is be a coincident column name that will be merged on.
add a column named 'City' that is the results of the merge. I add the values to the column with the values attribute in order to ignore the index of the resulting merge and focus on just the resulting values.
ds now has the index I want in it's first level. I leave stacked while I do some convenient string operations, then I unstack. In this form, the indices are aligned and a can leverage join
I hope that's clear.

Categories

Resources