Need help to clean data - python multiple groupby - python

I have a huge dataframe and they are many typos for the spelling of names.
This dataframe I been working on
First Last Location ID1 ID2
John Smith Calgary JohnCalgary SmithCalgary
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Victor Jones Toronto VictorToronto JonesToronto
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
Stacy Lee Calgary StacyCalgary LeeCalgary
This is the code
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Saxe','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markham','Calgary'],
'ID1': ['JohnCalgary', 'JohnToronto', 'JohToronto', 'StephVancouver','StephVancouver',VictorToronto', 'StacyMarkham', 'StacMarkham',StacyCalgary'],
'ID2':['SmithCalgary','SmithToronto', 'SmithToronto', 'SaxeVancouver', 'SaVancouver', 'JonesToronto', 'LeeMarkham', 'LeeMarkham', 'LeeCalgary']
}
Even trying to do a groupby using ID1 and ID2 where if the ID2 matches another ID2 and the first name is different and vice verse - there are still so many typos so I need to filter more.
How can I make it so that where
ID1 - Matches another ID1 and the Last3 are the same
ID2 - Matches another ID2 and the First 3 are the same
Desired, narrowed down dataset
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
This is what I was trying so far
m1 = df10.groupby('ID1')['ID2'],('Last3').transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1']('First3').transform('nunique').gt(1)
out10 = df10[m1|m2]

Related

After updating a df from filtred df old values reappear

i have a df where i have a requirement to filter it into new df and work on it and after working i wanted to update it to the original df like.
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
my intened output is
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In
46514
So firstly i filtered the original dataframe into a new df and worked on it.
with following code
Filter3_df= Final[Final['State'].isnull()]
Filter3_df['temp'] = Filter3_df['City'].str.extract('([A-Za-z]+)')
mask2= Filter3_df['temp'].notnull()
Filter3_df.loc[mask2, 'Zip'] = Filter3_df.loc[mask2, 'City'].str[-5:]
Filter3_df.loc[mask2, 'State'] = Filter3_df.loc[mask2, 'temp']
del Filter3_df['temp']
Filter3_df['City']= float('NaN')
after this the table for Filter3_df looks like this
Street
City
State
Zip
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
4000 E Bristol St Ste 3 Elkhart
In
46514
but when i update this filtered_df back to the original df using
Final.update(Filter3_df)
I am not getting the intended output instead I am getting the output as
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
In
46514
kindly let me know where am i going wrong.
From the docs, pandas.DataFrame.update:
Modify in place using non-NA values from another DataFrame.
Replace Filter3_df['City']= float('NaN'), which is NA for floats, with the value you really want:
Filter3_df['City'] = ""

Matching columns using if statements

I am trying to clean up typos in this dataset.
Database of employee names
First Last Location
John Smith Calgary
John Smith Toronto
Joh Smith Toronto
Steph Sax Vancouver
Steph Sa Vancouver
Victor Jones Toronto
Stacy Lee Markham
Stac Lee Markham
Stacy Lee Calgary
There are some typos, in columns in first and last name. I tried to create an unique identifier and use a groupby statement to isolate likely typos.
Likely typos I think would fall under this category
Match
if ID1 matches another ID1 and ID2 doesn't match
if ID2 matches another ID2 and ID1 doesn't match
This is my desired dataset of likely typos
First Last Location ID1 ID2
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
This is the code I tried so far
df["ID1"] = df["First"] + df["Location"]
df["ID2"] = df["Last"] + df["Location"]
m1 = df.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df[m1|m2]
EDIT: full code, that isn't working. It isn't filtering out for matching in ID2 but not matching in ID1. Not picking up Stacy Lee from Markham...
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Sa','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markahm','Calgary']}
Create DataFrame
df10 = pd.DataFrame(data)
df10["ID1"] = df10["First"] +df10["Location"]
df10["ID2"] = df10["Last"] + df10["Location"]
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]
Code works,
I just had typos...
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]

Data Cleaning How to split Pandas column

It has been sometime since I tried working in python.
I have below data frame with many columns too many to name.
last/first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
Rogers Dave Toronto A4 HR
How to I remove caps in the last/first column and also split the last/first column by " "?
Goal:
last first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
rogers dave Toronto A4 HR
IIUC, you could use str.lower and str.split:
df[['last', 'first']] = (df.pop('last/first')
.str.lower()
.str.split(n=1, expand=True)
)
output:
location job department last first
0 Vancouver A1 servers smith john
1 Toronto A2 eng rogers steve
2 Toronto A4 HR rogers dave

Remove unwanted parts from strings in Dataframe

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
My dataframe:
Passengers
1 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
2 Sally Muller, President, Mark Smith, Vicepresident
3 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 Sally Muller, President, John Doe, Chief of Staff, Peter Parker, Special Effects, Lydia Johnson, Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President,','', regex=True)
df = df.replace(r'Vicepresident,','', regex=True)
df = df.replace(r'Chief of Staff,','', regex=True)
df = df.replace(r'Special Effects,','', regex=True)
df = df.replace(r'Vice Chief of Staff,','', regex=True)
...
Is there a more comfortable way to do this?
Edit
More accurate example of original df:
Passengers
1 Sally Muller, President, EL Mark Smith, John Doe, Chief of Staff, Peter Gordon, Director of Central Command
2 Sally Muller, President, EL Mark Smith, Vicepresident
3 Sally Muller, President, EL Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Gordon, Dir CC
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 President Sally Muller, John Doe Chief of Staff, Peter Parker, Special Effects, Lydia Johnson , Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe, Peter Gordon
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President','', regex=True)
df = df.replace(r'Director of Central Command,','', regex=True)
df = df.replace(r'Dir CC','', regex=True)
df = df.replace(r'Vicepresident','', regex=True)
df = df.replace(r'Chief of Staff','', regex=True)
df = df.replace(r'Special Effects','', regex=True)
df = df.replace(r'Vice Chief of Staff','', regex=True)
...
messy output is like:
Passengers
1 Sally Muller, , Mark Smith, John Doe, , Peter Gordon,
2 Sally Muller, Mark Smith,
3 Sally Muller, Mark Smith,, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller,, John Doe, Peter Parker , Lydia Johnson,
...
If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:
out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)
or str.split + explode and apply a lambda that does the selection + join
out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))
Output:
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...
Edit:
If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:
titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}
out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])
If order matters, then you can use a nested list comprehension:
out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])
This is one case where apply is actually faster that explode:
df2 = df['Passengers'].apply(lambda x: ', '.join(x.split(', ')[::2])) #.to_frame() # if dataframe needed
output:
Passengers
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia Jo...
We can create a full regex pattern match on every string you need to remove and replace.
This can handle situations were the passengers will not have a title.
df2 = df['Passengers'].str.replace("(President)|(Vicepresident)|(Chief of Staff)|(Special Effects)|(Vice Chief of Staff)", "",regex=True).replace("( ,)", "", regex=True).str.strip().str.rstrip(",")

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!
Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto
Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto

Categories

Resources