What's a better way to clean data before a merge?

What's a better way to clean data before a merge? - python

I have two different data frames that I need to merge and the merge column ('title') needs to be cleaned up before the merge can happen. Sample data example looks like the following;
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
As expected the merge will not succeed on the first title. I have been using the replace() method but it get's unmanageable very quickly because I have 100s of titles to correct due to spellings, case sensitivity, etc.
Any other suggestions regarding how to best cleanup and merge the data?
Full Example:
import pandas as pd
import numpy as np
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
data2['title'].replace(regex=True,inplace=True,to_replace=r"\s\(.*\)",value=r'')
replacements = {
'title': {
r'a.b. company *.*': 'ab company'
}
}
data1.replace(replacements, regex=True, inplace=True)
pd.merge(data1, data2, on='title')

First things first, there is no perfect solution for this problem, but I suggest doing two things:
Do any easy cleaning you can do before hand, including removing any characters you don't expect.
Apply some fuzzy matching logic
You'll see this isn't perfect since even this example doesn't work 100% percent.
First, let's start by making your example a tiny bit more complicated, introducing a regular typo (coampany_b instead of company_b, something that won't get picked up by the easy cleaning below)
data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9', 'csdfsjkbku'], 'title': ['a.b. company','company_b', 'coampany_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})
Then let's assume you only expect [a-z] characters as #Maarten Fabré mentioned. So let's lowercase everything and remove anything else.
data1['cleaned_title'] = data1['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')
data2['cleaned_title'] = data2['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')
Now, let's use difflib's get_close_matches (read more and other options here)
import difflib
data1['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])
data2['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])
Here is the resulting data1, looking good!
id title cleaned_title closestmatch
0 a12bcde0 a.b. company abcompany abcompany
1 b20bcde9 company_b companyb companyb
2 csdfsjkbku coampany_b coampanyb companyb
Now, here is data2, looking a bit less good... We asked it to find the closest match, so it found one for company_f, while you clearly don't want it.
serial_number title cleaned_title closestmatch
0 01a2b345 ab company abcompany abcompany
1 10ab2030 company_b (123) companyb companyb
2 40ab4060 company_f companyf companyb
The ideal case scenario is if you have a clean list of company titles on the side, in which case you should find the closest match based on that. If you don't, you'll have to get creative or manually clean up the hit and miss.
To wrap this up, you can now perform a regular merge on 'closestmatch'.

You could try to make an simplified_name column in each of the 2 dataframes by setting all characters to lowercase and removing all the non [a-z ] characters and join on this column if this doesn't lead to collisions

Related

Python sort table on multiple columns

I am busy with making a system that can sort some things from a excel document, i have added a part of the document here: shorturl.at/DKNP7
It has the following inputs: Day, time, sort, number, gourmet/fondue, sort_exclusive
I want to have this sorted as follows, it must contain the sum of each of the different types.
I have some code but i doubt it is efficient, the start of the code i have included below.
df = pd.read_excel('Example_excel.xlsm', sheet_name="INVOER")
gourmet = df[['Day', 'Time', 'Sort', 'number', 'Gourmet/Fondue', 'sort exclusive']]
gourmet1 = gourmet.dropna(subset=['Sort'], inplace=False) #if 'Sort' is not filled in it is dropped.
gourmet1.to_excel('test.xlsx', index=False, sheet_name='gourmet')
Maybe it is needed to split it in 2 parts, where 1 part is 'exclusief' with 'sort exclusive' and another part for 'populair' and 'deluxe' from the 'sort'column.
Looking forward to your reply!
One of the things I have tried is to split it;
gourmet_pop_del = gourmet1.groupby(['Day','Sort', 'Gourmet/Fondue' ])['number'].sum()
gourmet_pop_del = gourmet_pop_del.reset_index()
gourmet_pop_del.sort_values(by=['Day', 'Sort','Gourmet/Fondue'], inplace=True)

Applying Regex across entire column of a Dataframe

I have a Dataframe with 3 columns:
id,name,team
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n
I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.
This is what I have tried thus far:
df['team'] = re.sub(r'[\n\r]*','',df['team'])
But this throws an error AttributeError: 'Series' object has no attribute 're'
Could anyone advice how could I loop this regex through the entire Dataframe df['team'] column

You are almost there, there are two simple ways of doing this:
# option 1 - faster way
df['team'] = [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]
# option 2
df['team'] = df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))

As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)
Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.
RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)

Another example (but without regex) but maybe still usefull for someone.
id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series([' marketing','admin\n', 'finance\n'])
testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,

In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

How to merge two datasets by specific column in pandas

I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18-dataset.
My problem is the name-column in these two datasets are using different format.
For example: "lionel messi" in one dataset and in the other it is "L. Messi"
I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.
What would be the most intelligent way to go about this?

One simple way is to convert the names in both dataframes into a common format so they can be matched.* Let's assume that in df1 names are in the L. Messi format and in df2 names are in the lionel messi format. What would a common format look like? You have several choices, but one option would be all lowercase, with just the first initial followed by a period: l. messi.
df1 = pd.DataFrame({'names': ['L. Messi'], 'x': [1]})
df2 = pd.DataFrame({'names': ['lionel messi'], 'y': [2]})
df1.names = df1.names.str.lower()
df2.names = df2.names.apply(lambda n: n[0] + '.' + n[n.find(' '):])
df = df1.merge(df2, left_on='names', right_on='names')
*Note: This approach is totally dependent on the names being "matchable" in this way. There are plenty of cases that could cause this simple approach to fail. If a team has two members, Abby Wambach and Aaron Wambach, they'll both look like a. wambach. If one dataframe tries to differentiate them by using other initials in their name, like m.a. wambach and a.k. wambach, the naive matching will fail. How you handle this depends on the size your data - maybe you can try match most players this way, and see who gets dropped, and write custom code for them.

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.

With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)

For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction

The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's a better way to clean data before a merge? - python

You could try to make an simplified_name column in each of the 2 dataframes by setting all characters to lowercase and removing all the non [a-z ] characters and join on this column if this doesn't lead to collisions

Related

Python sort table on multiple columns

Applying Regex across entire column of a Dataframe

pandas groupby is returning two groups for the same unique id

How to merge two datasets by specific column in pandas

Pandas: query string where column name contains special characters

Categories

Resources