I am trying to clean up typos in this dataset.
Database of employee names
First Last Location
John Smith Calgary
John Smith Toronto
Joh Smith Toronto
Steph Sax Vancouver
Steph Sa Vancouver
Victor Jones Toronto
Stacy Lee Markham
Stac Lee Markham
Stacy Lee Calgary
There are some typos, in columns in first and last name. I tried to create an unique identifier and use a groupby statement to isolate likely typos.
Likely typos I think would fall under this category
Match
if ID1 matches another ID1 and ID2 doesn't match
if ID2 matches another ID2 and ID1 doesn't match
This is my desired dataset of likely typos
First Last Location ID1 ID2
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
This is the code I tried so far
df["ID1"] = df["First"] + df["Location"]
df["ID2"] = df["Last"] + df["Location"]
m1 = df.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df[m1|m2]
EDIT: full code, that isn't working. It isn't filtering out for matching in ID2 but not matching in ID1. Not picking up Stacy Lee from Markham...
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Sa','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markahm','Calgary']}
Create DataFrame
df10 = pd.DataFrame(data)
df10["ID1"] = df10["First"] +df10["Location"]
df10["ID2"] = df10["Last"] + df10["Location"]
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]
Code works,
I just had typos...
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]
Related
I have a huge dataframe and they are many typos for the spelling of names.
This dataframe I been working on
First Last Location ID1 ID2
John Smith Calgary JohnCalgary SmithCalgary
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Victor Jones Toronto VictorToronto JonesToronto
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
Stacy Lee Calgary StacyCalgary LeeCalgary
This is the code
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Saxe','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markham','Calgary'],
'ID1': ['JohnCalgary', 'JohnToronto', 'JohToronto', 'StephVancouver','StephVancouver',VictorToronto', 'StacyMarkham', 'StacMarkham',StacyCalgary'],
'ID2':['SmithCalgary','SmithToronto', 'SmithToronto', 'SaxeVancouver', 'SaVancouver', 'JonesToronto', 'LeeMarkham', 'LeeMarkham', 'LeeCalgary']
}
Even trying to do a groupby using ID1 and ID2 where if the ID2 matches another ID2 and the first name is different and vice verse - there are still so many typos so I need to filter more.
How can I make it so that where
ID1 - Matches another ID1 and the Last3 are the same
ID2 - Matches another ID2 and the First 3 are the same
Desired, narrowed down dataset
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
This is what I was trying so far
m1 = df10.groupby('ID1')['ID2'],('Last3').transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1']('First3').transform('nunique').gt(1)
out10 = df10[m1|m2]
It has been sometime since I tried working in python.
I have below data frame with many columns too many to name.
last/first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
Rogers Dave Toronto A4 HR
How to I remove caps in the last/first column and also split the last/first column by " "?
Goal:
last first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
rogers dave Toronto A4 HR
IIUC, you could use str.lower and str.split:
df[['last', 'first']] = (df.pop('last/first')
.str.lower()
.str.split(n=1, expand=True)
)
output:
location job department last first
0 Vancouver A1 servers smith john
1 Toronto A2 eng rogers steve
2 Toronto A4 HR rogers dave
I am playing around with some NFL data and I have a column in a dataframe that looks like:
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Russell WilsonR. Wilson SEA
There are 3 bits of information in each cell - FullName, ShortName and Team whihc i am hoping to create new columns for.
Expected output:
FullName ShortName Team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Russell Wilson R. Wilson SEA
Ive managed to get the Team but I'm not quite sure how to do all three in the one line.
I was thinking of splitting the string by finding the previous character from the fullstop however there are some names that appear such as:
Anthony McFarland Jr.A. McFarland PIT
which have multiple full stops.
Anyone have an idea of the best way to approach this? Thanks!
The pandas Series str.extract method is what you're looking for. This regex works for all of the cases you've presented, though there may be some other edge cases.
df = pd.DataFrame({
"bad_col": ["Lamar JacksonL. Jackson BAL", "Patrick Mahomes IIP. Mahomes KC",
"Dak PrescottD. Prescott DAL", "Josh AllenJ. Allen BUF",
"Josh AllenJ. Allen SEA", "Anthony McFarland Jr.A. McFarland PIT"],
})
print(df)
bad_col
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Josh AllenJ. Allen SEA
5 Anthony McFarland Jr.A. McFarland PIT
pattern = r"(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)"
new_df = df["bad_col"].str.extract(pattern, expand=True)
print(new_df)
full_name short_name team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Josh Allen J. Allen SEA
5 Anthony McFarland Jr. A. McFarland PIT
Breaking down that regex:
(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)
(?P<full_name>.+)(?=[A-Z]\.)
captures any letter UNTIL we see a capital letter followed by a fullstop/period we use a lookahead (?=...) to not consume the capital letter and fullstop because this part of the string belongs to the short name
(?P<short_name>[A-Z]\.\s.*.)\s
captures a capital letter (the players first initial), then a fullstop (the period that comes after their first initial), then a space (between first initial and last name), then all characters until we hit a space (the players last name). The space is not included in the capture group.
(?P<team>[A-Z]+)
capture all of the remaining capital letters in the string (ends up being the players team)
You've probably noticed that I've used named capture groups as denoted by the (?Ppattern) structure. In pandas, the name of the capture group becomes the name of the column and whatever is captured in that group becomes the values in that column.
Now to join the new dataframe back to our original one to come full circle:
df = df.join(new_df)
print(df)
bad_col full_name short_name \
0 Lamar JacksonL. Jackson BAL Lamar Jackson L. Jackson
1 Patrick Mahomes IIP. Mahomes KC Patrick Mahomes II P. Mahomes
2 Dak PrescottD. Prescott DAL Dak Prescott D. Prescott
3 Josh AllenJ. Allen BUF Josh Allen J. Allen
4 Josh AllenJ. Allen SEA Josh Allen J. Allen
5 Anthony McFarland Jr.A. McFarland PIT Anthony McFarland Jr. A. McFarland
team
0 BAL
1 KC
2 DAL
3 BUF
4 SEA
5 PIT
My guess is that short names would not contain fullstops. So, you can search for the first fullstop from the end of the line. So, from one character before that fullstop until the first space is your short name. Anything coming before one letter before that fullstop is going to be FullName.
This might help.
import re
name = 'Anthony McFarland Jr.A. McFarland PIT'
short_name = re.findall(r'(\w\.\s[\w]+)\s[\w]{3}', name)[0]
full_name = name.replace(short_name, "")[:-4]
team = name[-3:]
print(short_name)
print(full_name)
print(team)
Ouput:
A. McFarland
Anthony McFarland Jr.
PIT
import pandas as pd
import numpy as np
df = pd.DataFrame({'players':['Lamar JacksonL. Jackson BAL', 'Patrick Mahomes IIP. Mahomes KC',
'Anthony McFarland Jr.A. McFarland PIT']})
def splitName(name):
last_period_pos = np.max(np.where(np.array(list(name)) == '.'))
full_name = name[:(last_period_pos - 1)]
short_name_team = name[(last_period_pos - 1):]
team_pos = np.max(np.where(np.array(list(short_name_team)) == ' '))
short_name = short_name_team[:team_pos]
team = short_name_team[(team_pos + 1):]
return full_name, short_name, team
df['full_name'], df['short_name'], df['team'] = zip(*df.players.apply(splitName))
I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!
Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto
Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto
I am looking for an efficient way to select matching rows in 2 x dataframes based on a shared row value, and upsert these into a new dataframe I can use to map differences between the intersection of them into a third slightly different dataframe that compares them.
**Example:**
DataFrame1
FirstName, City
Mark, London
Mary, Dallas
Abi, Madrid
Eve, Paris
Robin, New York
DataFrame2
FirstName, City
Mark, Berlin
Abi, Delhi
Eve, Paris
Mary, Dallas
Francis, Rome
In the dataframes, I have potential matching/overlapping on 'name', so the intersection on these is:
Mark, Mary, Abi, Eve
excluded from the join are:
Robin, Francis
I construct a dataframe that allows values from both to be compared:
DataFrameMatch
FirstName_1, FirstName_2, FirstName_Match, City_1, City_2, City_Match
And insert/update (upsert) so my output is:
DataFrameMatch
FirstName_1 FirstName_2 FirstName_Match City_1 City_2 City_Match
Mark Mark True London Berlin False
Abi Abi True Madrid Delhi False
Mary Mary True Dallas Dallas True
Eve Eve True Paris Paris True
I can then report on the difference between the two lists, and what particular fields are different.
merge
According to your output. You only want rows where 'FirstName' matches. You then want another column that evaluates whether cities match.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2']).eval('City_Match = City_1 == City_2')
FirstName City_1 City_2 City_Match
0 Mark London Berlin False
1 Mary Dallas Dallas True
2 Abi Madrid Delhi False
3 Eve Paris Paris True
Details
You could do a simple merge and end up with
FirstName City
0 Mary Dallas
1 Eve Paris
Which takes all common columns by default. So I had to restrict the columns via the on argument, hence on='FirstName'
d1.merge(d2, on='FirstName')
FirstName City_x City_y
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Which gets us closer but now I want to adjust those suffixes.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2'])
FirstName City_1 City_2
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Lastly, I'll add a new column that shows the evaluation of 'city_1' being equal to 'city_2'. I chose to use pandas.DataFrame.eval. You can see the results above.