Vlookup based on multiple columns in Python DataFrame - python

I have two dataframes. I am trying to Vlookup 'Hobby' column from the 2nd dataframe and update the 'Interests' column of the 1st dataframe. Please note that the columns: Key, Employee and Industry should match exactly between the two dataframes but in case of City, even if the first part of the city matches between the two dataframe it should be acceptable. Though, it is starightforward in Excel, it looks a bit complicated to implement it on Python. Any cue on how to proceed will be really helpful. (Please see below the screenshot for the expected output.)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)

Set in index of df1 to Key (you can set the index to whatever you want to match on) and the use update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO

Related

How do I coalesce Pandas columns only where the beginnings of the columns don't match?

I have a table with some company information that we're trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.
Name
Info
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes
Reebok
Adidas
None
We're working with this dataset in Pandas. We'd like to follow the rules below.
If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For "Nike", it should check the first 4 letters of the Info column, for "ASG Shoes", it should check the first 9 characters.
If rule 1 is false, use the Info column.
If Info is None, use the Name column.
The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There's a lot going on here and I want to ensure I'm doing this properly. How can I achieve this output with the most efficient Python code possible?
Name
Info
Clean
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
Nike
ASG Shoes
Reebok
Reebok
Adidas
None
Adidas
You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.
The code implementation is given below:
import pandas as pd
def slicer(strings, slicers):
return strings[:slicers] if isinstance(strings, str) else strings
df = pd.DataFrame({
"Name": ["Nike", "ASG Shoes", "Adidas"],
"Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None]
})
# Define length column
df["Slicers"] = df["Name"].str.len()
# Slice Info column by length column and overwrite
df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))
# Check whether sliced str column and name column are equal
mask = df["Name"].eq(df["Slicers"])
# Overwrite if they are equal
df.loc[mask, "Clean"] = df.loc[mask, "Name"]
# Apply coalesce
coalesce_rules = ["Clean", "Info", "Name"]
df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])
Output:
Name Info Clean
0 Nike Nike, a footwear manufacturer is headquartered... Nike
1 ASG Shoes Reebok Reebok
2 Adidas None Adidas
It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it's an efficient one.

Merging rows in a dataframe by ID

I have a large excel document of people who have had vaccinations.
I am trying to use python and pandas to work with this data to help work out who still needs further vaccinations and who does not.
I have happily imported the document into pandas as a dataframe
Each person has a unique ID in the dataframe
However, each vaccination has a separate row (rather than each person)
i.e. people who have had more than a single dose of vaccine have multiple rows in the document
I want to join all of the vaccinations together so that each person has a single row and all the vaccinations they have had are listed in their row.
ID
NAME
VACCINE
VACCINE DATE
0
JP
AZ
12/01/2021
1
PL
PF
13/01/2021
0
JP
MO
24/01/2021
1
PL
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
Should become:
ID
NAME
VACCINE
VACCINE DATE
VACCINE2
VACCINE2 DATE
0
JP
AZ
12/01/2021
MO
24/01/2021
1
PL
PF
13/01/2021
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
So I want to store all vaccine information for each individual in a single entry.
I have tried to use groupby to do this but it seems to entirely delete the ID field??
Am I using completely the wrong tool?
I don't want to resort to using a for loop to just iterate though every entry as this feels like very much the wrong way to accomplish the task.
old_df = pd.read_excel(filename, sheet_name="Report Data")
new_df = old_df.groupby(["PATIENT ID"]).ffill()
I am trying my best to use this as a way of teaching myself to use pandas but struggling to get anywhere so please forgive my novice level.
EDIT:
I have found this code:
s = raw_file.groupby('ID')['Vaccine Date'].apply(list)
new_file = pd.DataFrame(s.values.tolist(), index=s.index).add_prefix('Vaccine Date ').reset_index()
I modified this from what seemed to be a similar problem I found:
Python, Merging rows with same value in one column
Which seems to be doing part of what I want. It creates new columns for each vaccine date with a slightly adjusted column label. However, I cannot see a way to do this for both Vaccine date AND Vaccine brand at the same time and without losing all other data in the table.
I suppose I could just do it twice and then merge the outputs with the original dataframe to make a new complete dataframe but thought there might be a more elegant solution.
This is what I did for the same problem.
Seperated the first vaccination row of every name from the dataframe and saved as a nw dataframe.
Create a third dataframe which is initial dataframe minus the new datafame in (1)
left join both dataframes and drop NA
df = pd.read_csv('<your_dataset_name>.csv')
first_dataframe = df.drop_duplicates(subset = 'ID', keep = 'first')
data = df[~df.isin(first_dataframe )].dropna()
final = first.merge(data,how = 'left',on = ['ID','NAME'])

pandas dataframe - select rows that are similar

Is there a way to select rows that are 'similar', (NOT DUPLICATES!) in a pandas dataframe?
I have a dataframe that has columns including 'school_name' and 'district'.
I want to see if there are any schools that have similar names in different districts.
All I can think of is to select a random school name and manually check if similar names exist in the dataframe, by running something like:
df[df['school_name'].str.contains('english grammar')]
Is there a more efficient way to do this?
edit: I am ultimately going to string match this particular dataframe with another dataframe, on school_name column, while blocking on district column. The issue is that some district names in the two dataframes don't exactly match, due to district names changing over time - df1 is from year 2000 and it has districts in 'city-i', 'city-ii'... format, whereas df2, which is from year 2020, has districts in 'city-south', 'city-north' ... format. The names have changed over time, and the districts have been reshuffled to get merged / separated etc.
So I am trying to remove the 'i', 'ii', 'south' etc to just block on 'city'. Before I do that I want to make sure that there are no similar-sounding names in 'city-i' and 'city-south', (because I wouldn't want to match the similar sounding names together if the schools are actually in a completely different district).
Let`s say this is your DataFrame.
df = pd.DataFrame({'school':['High School',
'High School',
'Middle School',
'Middle School'],
'district':['Main District',
'Unknown District',
'Unknown District',
'Some other District']
})
Than you can use a combination of pandas.DataFrame.duplicated() and pandas.groupby().
df[df.duplicated('school', keep=False)].groupby('school')
.apply(lambda x: ', '.join(x['district']))
.to_frame('duplicated names by district')
This returns a DataFrame which looks like this:
To get all duplicated rows only, based on a column school_name:
df[df['school_name'].duplicated(keep=False)]
If you want to get rows for a single school_name value:
# If we are searching for school let's say 'S1'
name = "S1"
df.loc[df['school_name'] == name]
To select rows, which have same school_name based on a given list, use isin
dupl_list = ['S1', 'S2', 'S3']
df.loc[df['school_name'].isin(dupl_list)]
You can make two sets one will contain the school name and other will contain duplicates if it encounters duplicate while traversing each school name in the column
school=set()
duplicate=set()
for i in df1.schl_name:
if i not in school:
school.add(i)
else:
duplicate.add(i)
df1[df1['schl_name'].isin(duplicate)]

Pandas: Merge two dataframes with duplicate rows

Short Question
Within Pandas, what's the most convenient way to merge two dataframes, such that all entries in the left dataframe receive the first matching value from the right dataframe?
Longer Question
Say I have two spreadsheets: people.csv and orders.csv. people.csv contains several columns of information on the person, whereas orders.csv contains the person's full name, and a row of the # of orders that person placed.
I need to create a third csv, output.csv which contains all of the columns from people.csv plus a column from output.csv matched on one of the columns within both spreadsheets (called "FULL_NAME" in one, and "CUSTOMER_FULL_NAME" in the other)
people.csv is sorted on the FULL_NAME field, but contains duplicate rows, such that there are multiple rows with "John Smith" in the FULL_NAME column. There are also duplicate rows within orders.csv but not the same number of duplicates (for example, people.csv may have 4 John Smith entries, but orders.csv may only have two).
If I use the following code:
people = pd.read_csv('people.csv')
orders = pd.read_csv('orders.csv')
full = pd.merge(
people,
orders,
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
)
result.to_csv("output.csv")
... I get a CSV where only two of the rows with "John Smith" in the FULL_NAME field has John Smith's number of orders. The rows directly below have no value in that field. That's because output.csv only contained two rows with matching values for John Smith, whereas people.csv had 4.
Is there a convenient method within Pandas to set the value of one column to the first matching column in the other dataframe, such that all 4 entries contain the first matching value from orders.csv?
EDIT
Full current version of my script, returning CSVs containing rows not set with the expected values:
import pandas as pd
community = pd.read_csv("orders.csv")
full = pd.read_csv("people.csv")
result = pd.merge(
full,
community.drop_duplicates(subset=['FULL_NAME'], keep='first'),
left_on="CUSTOMER_FULL_NAME",
right_on="FULL_NAME",
how='left',
)
result.to_csv("output.csv")
So I think I'm missing something else here, because some of the rows are matching in the expected way. Here's an example from the ouput file:
ID FULL_NAME EMPLOYER DIVISION ORDER #
7350 John Smith RiteAid Clinical Research 25
7351 John Smith RiteAid Clinical Research 25
7352 John Smith Costco Sales
7353 John Smith Costco Sales
This John Smith rows doesn't have a duplicate value within the orders.csv file so I do think this is working since two of the rows got it. However, I didn't get a match on the John Smith rows that list Costco rather than RiteAid (or other different fields). This surprises me since I thought the index check was only on the FULL_NAME field.
Any ideas on why the other rows might not be filled in?
You can use drop_duplicates on subset=['CUSTOMER_FULL_NAME'] in the merge with how='left' to keep all rows from people such as:
full = pd.merge(
people,
orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
how='left' #and add the how='left'
)
So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first') will contain only once each name and during the merge, the matching will be with only this unique name

Trying to parse string and create new columns in data frame in Python pandas

I have the following data frame.
Team Opponent Detail
Redskins Rams Kirk Cousins .... Penaltyon Bill Smith, Holding:10 yards
What I want to do is create THREE columns using pandas which would give my the name (in this case Bill Smith), the type of infraction(Offensive holding), and how much it cost the team(10 yards). So it would look like this
Team Opponent Detail Name Infraction Yards
Redskins Rams Bill Smith Holding 10 yards
I used some string manipulation to actually extract the fields out, but don't know how to create a new column. I have looked through some old columns, but cannot seem to get it to work. Thanks!
You function should return 3 values, such as...
def extract(r):
return r[28:38], r[-8:], r[-16:-9]
First create empty columns:
df["Name"] = df["Infraction"] = df["Yards"] = ""
... and then cast the result of "apply" to a list.
df[["Name", "Infraction", "Yards"]] = list(df.Detail.apply(extract))
You could be interested in this more specific but more extended answer.
In order to create a new column, you can simply do:
your_df['new column'] = something
For example, imagine you want a new column that contains the first word of the column Details
#toy dataframe
my_df = pd.DataFrame.from_dict({'Team':['Redskins'], 'Oponent':['Rams'],'Detail':['Penaltyon Bill Smith, Holding:10 yards ']})
#apply a function that retrieves the first word
my_df['new_word'] = my_df.apply(lambda x: x.Detail.split(' ')[0], axis=1)
This creates the a column that contains "Penaltyon"
Now, imagine I now want to have two new columns, one for the first word and another one for the second word. I can create a new dataframe with those two columns:
new_df = my_df.apply(lambda x: pd.Series({'first':x.Detail.split(' ')[0], 'second': x.Detail.split(' ')[1]} ), axis=1)
and now I simply have to concatenate the two dataframes:
pd.concat([my_df, new_df], axis=1)

Categories

Resources