I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by #laurent but I have some added specificities now. I would now like to:
Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe
Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.
I post here an example of my data. The solution by #laurent for the original problem can be found in the above link.
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
Expected output is:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname: {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.
I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.
Here is one way to do it:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN
Related
I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?
If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN
I am trying to create a new dataframe new_df with a new column containing the difference in values from subtracting identical columns in 2 separate dataframes: df1 df2
I have tried to use the code new_df.loc['difference'] = df1.loc['s_values'] - df2.loc['s_values']
but I cannot achieve my result.
where df1 =
stats s_values
gender year
women 2007 height 40
2007 cigarette use 31
and df2 =
stats s_values
gender year
Men 2007 height 10
2007 cigarette use 11
desired output achieved (I do not want to include the gender index)
new_df =
stats difference
year
2007 height 30
2007 cigarette use 20
You can try this (full example):
Input:
import pandas as pd
df1 = pd.DataFrame({'gender': {0: 'woman', 1: 'woman'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 40, 1: 31}})
df2 = pd.DataFrame({'gender': {0: 'men', 1: 'men'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 10, 1: 11}})
Code:
df = pd.concat([df1,df2], ignore_index=True)
df['s_values'] = df.groupby(['year', 'stats'])['s_values'].diff().abs()
df.dropna(subset=['s_values']).drop('gender', axis=1)
Output:
year stats s_values
2 2007 height 30.0
3 2007 cigarette use 20.0
Note:
If both dataframes are completely identicaly structured, its even shorter:
df1.drop('gender', axis=1).assign(s_values=df1['s_values'] - df2['s_values'])
new_df = pd.DataFrame()
new_df["year"] = df1["year"]
new_df["stats"] = df1["stats"]
for i, (val1, val2) in enumerate(zip(df1["s_values"],df2["s_values"])):
new_df.at[i,"difference"] = val1-val2
This is my first post so pardon any missing information. I do have dataset like this below
Dataset:-
My expected final output should be like this
Final Output:-
Basically I would like to iterate over Discounted price and apply last discounted price to next year
For example in 2019 , NYC Budget is $10,000 and Discount is 0.05 so Discounted Price is $9,500. Next year discount becomes 0.64 which should be calculated on $9,500, which is $3,420 and in 2021 , 0.04 which comes out to be $3,283 as final discounted price for NYC.
I need to code it in Python using pandas data frame. I think I need to write For loop and then IF inside it . But struggling so far.
Really appreciate any help.
You could use groupby and apply the function to obtain the total discount amount for all the years for a particular city. Then use first over the Budget column groups to get the first row in the group with the rows indexed as the original dataframe. This holds true since "Groupby preserves the order of rows within each group." is a guaranteed behavior.
import pandas as pd
data = {'Year': {0: 2019, 1: 2020, 2: 2021, 3: 2019, 4: 2020, 5: 2019},
'City': {0: 'NYC', 1: 'NYC', 2: 'NYC', 3: 'Edison', 4: 'Edison', 5: 'Princeton'},
'Budget': {0: 10000, 1: 10000, 2: 10000, 3: 5000, 4: 5000, 5: 2000},
'Discount': {0: 0.05, 1: 0.64, 2: 0.04, 3: 0.35, 4: 0.06, 5: 0.45}}
df = pd.DataFrame(data)
g = df.groupby("City")
disc_prod = g['Discount'].apply(lambda x: (1-x).prod())
budget = g['Budget'].first()
result = disc_prod * budget
print(result)
City
Edison 3055.0
NYC 3283.2
Princeton 1100.0
I have data of a certain country that gives the certain age group population in a time series. I am trying to multiply the number of the female population with -1 to display it on the other side of the pyramid graph. I have achieved that for one year i.e 1960 (see code below). Now I want to achieve the same results for all the columns from 1960-2020
PakPopulation.loc[PakPopulation['Gender']=="Female",['1960']]=PakPopulation['1960'].apply(lambda x:-x)
I have also tried the following solution but no luck:
PakPopulation.loc[PakPopulation['Gender']=="Female",[:,['1960':'2019']]=PakPopulation[:,['1960':'2019']].apply(lambda x:-x)
Schema:
Country
Age Group
Gender
1960
1961
1962
XYZ
0-4
Male
5880k
5887k
6998k
XYZ
0-4
Female
5980k
6887k
7998k
You could build a list of years and use that list as part of your selection:
import pandas as pd
PakPopulation = pd.DataFrame({
'Country': {0: 'XYZ', 1: 'ABC'},
'Age Group': {0: '0-4', 1: '0-4'},
'Gender': {0: 'Male', 1: 'Female'},
'1960': {0: 5880, 1: 5980},
'1961': {0: 5887, 1: 6887},
'1962': {0: 6998, 1: 7998},
})
start_year = 1960
end_year = 1962
years_lst = list(map(str, range(start_year, end_year + 1)))
PakPopulation.loc[PakPopulation['Gender'] == "Female", years_lst] = \
PakPopulation[years_lst].apply(lambda x: -x)
print(PakPopulation)
Output:
Country Age Group Gender 1960 1961 1962
0 XYZ 0-4 Male 5880 5887 6998
1 ABC 0-4 Female -5980 -6887 -7998
I have 2 non indexed data frames as follow:
df1
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Maria Murphy 30/03/1989
Seth Black 21/06/1991
and df2
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Seth Black 21/06/1991
Joe Maher 28/09/1990
Debby White 03/01/1992
I want to have a data delta, where only the records that are in df2 and not df1 will appear: i.e.
Joe Maher 28/09/1990
Debby White 03/01/1992
I there a way to achieve this?
I tried an inner join, but I couldn't find a way to subtract it from df2.
Any help is much appreciated.
You can use a list comprehension together with join to create unique keys of each table consisting of the the first name, last name and the date field (I assumed date of birth). Each field needs to be converted to a string if it is not already.
You then use another list comprehension together with enumerate to get the index location of each key in key2 that is not also in key1.
Finally, use iloc to get all rows in df2 based on the indexing from the previous step.
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
key1 = ["".join([first, last, dob])
for first, last, dob in zip(df1.First, df1.Last, df1.dob)]
key2 = ["".join([first, last, dob])
for first, last, dob in zip(df2.First, df2.Last, df2.dob)]
idx = [n for n, k in enumerate(key2)
if k not in key1]
>>> df2.iloc[idx, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
Assuming you do not have any other columns in your dataframe, you could use drop_duplicates as suggested by #SebastianWozny. However, you need to only select the new rows added (not df1). You can do that as follows:
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
You can append the two frames and use drop_duplicates to get the unique rows, then as suggested by #Alexander you can use iloc to get the rows you want:
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
>>> df1.append(df2).drop_duplicates()
First Last dob
0 John Mullen 12/08/1993
1 Lisa Bush 06/12/1990
2 Maria Murphy 30/03/1989
3 Seth Black 21/06/1991
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992