Data Cleaning How to split Pandas column

Data Cleaning How to split Pandas column - python

It has been sometime since I tried working in python.
I have below data frame with many columns too many to name.
last/first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
Rogers Dave Toronto A4 HR
How to I remove caps in the last/first column and also split the last/first column by " "?
Goal:
last first location job department
smith john Vancouver A1 servers
rogers steve Toronto A2 eng
rogers dave Toronto A4 HR

IIUC, you could use str.lower and str.split:
df[['last', 'first']] = (df.pop('last/first')
.str.lower()
.str.split(n=1, expand=True)
)
output:
location job department last first
0 Vancouver A1 servers smith john
1 Toronto A2 eng rogers steve
2 Toronto A4 HR rogers dave

Related

Need help to clean data - python multiple groupby

I have a huge dataframe and they are many typos for the spelling of names.
This dataframe I been working on
First Last Location ID1 ID2
John Smith Calgary JohnCalgary SmithCalgary
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Victor Jones Toronto VictorToronto JonesToronto
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
Stacy Lee Calgary StacyCalgary LeeCalgary
This is the code
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Saxe','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markham','Calgary'],
'ID1': ['JohnCalgary', 'JohnToronto', 'JohToronto', 'StephVancouver','StephVancouver',VictorToronto', 'StacyMarkham', 'StacMarkham',StacyCalgary'],
'ID2':['SmithCalgary','SmithToronto', 'SmithToronto', 'SaxeVancouver', 'SaVancouver', 'JonesToronto', 'LeeMarkham', 'LeeMarkham', 'LeeCalgary']
}
Even trying to do a groupby using ID1 and ID2 where if the ID2 matches another ID2 and the first name is different and vice verse - there are still so many typos so I need to filter more.
How can I make it so that where
ID1 - Matches another ID1 and the Last3 are the same
ID2 - Matches another ID2 and the First 3 are the same
Desired, narrowed down dataset
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
This is what I was trying so far
m1 = df10.groupby('ID1')['ID2'],('Last3').transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1']('First3').transform('nunique').gt(1)
out10 = df10[m1|m2]

Matching columns using if statements

I am trying to clean up typos in this dataset.
Database of employee names
First Last Location
John Smith Calgary
John Smith Toronto
Joh Smith Toronto
Steph Sax Vancouver
Steph Sa Vancouver
Victor Jones Toronto
Stacy Lee Markham
Stac Lee Markham
Stacy Lee Calgary
There are some typos, in columns in first and last name. I tried to create an unique identifier and use a groupby statement to isolate likely typos.
Likely typos I think would fall under this category
Match
if ID1 matches another ID1 and ID2 doesn't match
if ID2 matches another ID2 and ID1 doesn't match
This is my desired dataset of likely typos
First Last Location ID1 ID2
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
This is the code I tried so far
df["ID1"] = df["First"] + df["Location"]
df["ID2"] = df["Last"] + df["Location"]
m1 = df.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df[m1|m2]
EDIT: full code, that isn't working. It isn't filtering out for matching in ID2 but not matching in ID1. Not picking up Stacy Lee from Markham...
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Sa','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markahm','Calgary']}
Create DataFrame
df10 = pd.DataFrame(data)
df10["ID1"] = df10["First"] +df10["Location"]
df10["ID2"] = df10["Last"] + df10["Location"]
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]

Code works,
I just had typos...
m1 = df10.groupby('ID1')['ID2'].transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1'].transform('nunique').gt(1)
out = df10[m1|m2]

Replacement Values into the integer on dataset columns

House Number
Street
First Name
Surname
Age
Relationship to Head of House
Marital Status
Gender
Occupation
Infirmity
Religion
0
1
Smith Radial
Grace
Patel
46
Head
Widowed
Female
Petroleum engineer
None
Catholic
1
1
Smith Radial
Ian
Nixon
24
Lodger
Single
Male
Publishing rights manager
None
Christian
2
2
Smith Radial
Frederick
Read
87
Head
Divorced
Male
Retired TEFL teacher
None
Catholic
3
3
Smith Radial
Daniel
Adams
58
Head
Divorced
Male
Therapist, music
None
Catholic
4
3
Smith Radial
Matthew
Hall
13
Grandson
NaN
Male
Student
None
NaN
5
3
Smith Radial
Steven
Fletcher
9
Grandson
NaN
Male
Student
None
NaN
6
4
Smith Radial
Alison
Jenkins
38
Head
Single
Female
Physiotherapist
None
Catholic
7
4
Smith Radial
Kelly
Jenkins
12
Daughter
NaN
Female
Student
None
NaN
8
5
Smith Radial
Kim
Browne
69
Head
Married
Female
Retired Estate manager/land agent
None
Christian
9
5
Smith Radial
Oliver
Browne
69
Husband
Married
Male
Retired Merchandiser, retail
None
None
Hello,
I have a dataset that you can see below. When I tried to convert Age to int. I got that error: ValueError: invalid literal for int() with base 10: '43.54302670766108'
This means there is float data inside that data. I tried to replace '.' to '0' then tried to convert but I failed. Could you help me to do that?
df['Age'] = df['Age'].replace('.','0')
df['Age'] = df['Age'].astype('int')
I still got the same error. I think replace line is not working. Do you know why?
Thanks

Try:
df['Age'] = df['Age'].replace('\..*$', '', regex=True).astype(int)
Or, more drastic:
df['Age'] = df['Age'].replace('^(?:.*\D.*)?$', '0', regex=True).astype(int)

You do not need to manipulate the strings; you might first convert values to float then to int like:
df["Age"] = df["Age"].astype('float').astype('int')

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!

Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto

Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto

Pandas 'concat/upsert' dataframes

I am looking for an efficient way to select matching rows in 2 x dataframes based on a shared row value, and upsert these into a new dataframe I can use to map differences between the intersection of them into a third slightly different dataframe that compares them.
**Example:**
DataFrame1
FirstName, City
Mark, London
Mary, Dallas
Abi, Madrid
Eve, Paris
Robin, New York
DataFrame2
FirstName, City
Mark, Berlin
Abi, Delhi
Eve, Paris
Mary, Dallas
Francis, Rome
In the dataframes, I have potential matching/overlapping on 'name', so the intersection on these is:
Mark, Mary, Abi, Eve
excluded from the join are:
Robin, Francis
I construct a dataframe that allows values from both to be compared:
DataFrameMatch
FirstName_1, FirstName_2, FirstName_Match, City_1, City_2, City_Match
And insert/update (upsert) so my output is:
DataFrameMatch
FirstName_1 FirstName_2 FirstName_Match City_1 City_2 City_Match
Mark Mark True London Berlin False
Abi Abi True Madrid Delhi False
Mary Mary True Dallas Dallas True
Eve Eve True Paris Paris True
I can then report on the difference between the two lists, and what particular fields are different.

merge
According to your output. You only want rows where 'FirstName' matches. You then want another column that evaluates whether cities match.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2']).eval('City_Match = City_1 == City_2')
FirstName City_1 City_2 City_Match
0 Mark London Berlin False
1 Mary Dallas Dallas True
2 Abi Madrid Delhi False
3 Eve Paris Paris True
Details
You could do a simple merge and end up with
FirstName City
0 Mary Dallas
1 Eve Paris
Which takes all common columns by default. So I had to restrict the columns via the on argument, hence on='FirstName'
d1.merge(d2, on='FirstName')
FirstName City_x City_y
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Which gets us closer but now I want to adjust those suffixes.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2'])
FirstName City_1 City_2
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Lastly, I'll add a new column that shows the evaluation of 'city_1' being equal to 'city_2'. I chose to use pandas.DataFrame.eval. You can see the results above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Cleaning How to split Pandas column - python

IIUC, you could use str.lower and str.split: df[['last', 'first']] = (df.pop('last/first') .str.lower() .str.split(n=1, expand=True) ) output: location job department last first 0 Vancouver A1 servers smith john 1 Toronto A2 eng rogers steve 2 Toronto A4 HR rogers dave

Related

Need help to clean data - python multiple groupby

Matching columns using if statements

Replacement Values into the integer on dataset columns

How to find records with same value in one column but different value in another column

Pandas 'concat/upsert' dataframes

Categories

Resources