Python Record Linkage, Fuzzy Match and Deduplication - python

I have 3 dataset of customers with 7 columns.
CustomerName
Address
Phone
StoreName
Mobile
Longitude
Latitude
every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it????
Do you know good library for my case?

I think Recordlinkage library would suit your purposes
you can use to the Compare object , requiring various kinds of matches:
compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')
then defining the match you can customize how you want results, ie if you want 2 features to be matched at least
features = compare_cl.compute(pairs, df)
matches = features[features.sum(axis=1) > 3]

Related

searching values from one dataframe in another dataframe using pandas

I have two datasets, patient data and disease data.
The patient dataset has diseases written in alphanumeric code format which I want to search in the disease dataset to display the disease name.
Patient dataset snapshot
Disease dataset snapshot
I want use groupby function on the ICD column and find out the occurrence of a disease and rank it in descending order to display the top 5. I have been trying to find a reference for the same, but could not.
Would appreciate the help!
EDIT!!
avg2 = joined.groupby('disease_name').TIME_DELTA.mean().disease_name.value_counts()
I am getting this error "'Series' object has no attribute 'disease_name'"
Assuming that the data you have are in two pandas dataframes called patients and diseases and that the diseases dataset has the column names disease_id and disease_name this could be a solution:
joined = patients.merge(diseases, left_on='ICD', right_on='disease_id')
top_5 = joined.disease_name.value_counts().head(5)
This solution joins the data together and then use value_counts instead of grouping. It should solve what I preceive to be what you are asking for even if it is not exactly the functionality you asked for.

Is there a way to speed up Record Linkage in Python for comparing similar records

I am using the recordlinkage library in Python to detect duplicates in several datasets of estate properties that are web scraped weekly from a couple of websites. For the process I use the following numeric variables as block index: area, rooms, bathrooms and garages and these categorical variables: stratum (6 categories) and type (2 categories). For comparing I use the geographical coordinates, the price and the description using the lcs method; the description is a string that may be up to 1000 characters on some records, but normally it contains 300-500 characters. The issue is that it takes a really long time to compute the comparison, even with 8 jobs (I have tried with less cores and it takes even longer). For example, in one dataset I have 60000 records and when comparing it with itself, it will take roughly 10 hours to compute 20000 possible duplicates, but it shouldn't take that long, right? Is there a way to tweak the process to make it faster?
Here is the code I have been using:
## df_in is a pandas DataFrame with all the required columns
block_vars = ['area', 'rooms', 'bathrooms', 'garages', 'stratum', 'type']
compare_vars = [
String('description', 'description', method='lcs',
label='description', threshold=0.95),
Numeric('originPrice', 'originPrice', method='gauss',
label='originPrice', offset=0.2, scale=0.2),
Geographic('latitude', 'longitude', 'latitude', 'longitude',
method='gauss', offset=0.2, label='location')
]
indexer = rl.index.Block(block_vars)
candidate_links = indexer.index(df_in)
njobs = 8
## This is the part that takes hours
comparer = rl.Compare(compare_vars, n_jobs=njobs)
compare_vectors = comparer.compute(pairs=candidate_links, x=df_in)
## Model training doesn't take too long
ecm = rl.ECMClassifier(binarize=0.5)
ecm.fit(compare_vectors)
pairs_ecm = ecm.predict(compare_vectors)
I was playing with the String comparison methods and found out that some of them were not imported from jellyfish, so they are not optimized. One of them was the lcs method which, for one dataset, could take a whole weekend to run as stated in the question. I changed it to damerau_levenshtein and it now takes less than two minutes.
Hope this helps someone in the future :D

pandas.Series.str.contains - how to stop at first match?

I have a df with two columns (company_name and sales).
The company_name column includes the name of the company plus a short description (e.g. company X - medical insurance; company Y - travel and medical insurance; company Z - medical and holiday insurance etc.)
I want to add a third column with a binary classification (medical_insurance or travel_insurance) based on the first matching string value included in the company_name.
I have tried using str.contains but when matching words from different groups are present in the company_name column (e.g., medical and travel), str.contains doesn't necessarily classify it matching the first instance (which is what I need).
medical_focused = df.loc[df['company_name'].str.contains(
'medical|hospital', flags=re.IGNORECASE, na=False),'classification'] = 'medical_focused'
travel_focused = df.loc[df['company_name'].str.contains(
'travel|holiday', flags=re.IGNORECASE, na=False),'classification'] = 'travel_focused'
How can I force str.contains to stop at the first instance?
Thanks!

Merging two Python dataframes using a custom function for approximate matching and threshold score

I have two dataframes that contain web address and top level domains. df1 has ~million rows and df2 has ~700k rows. I need to merge the two dataframes to obtain the common web addresses and corresponding domains along with columns unique to each data frame. Because transcribing web addresses and domains can lead to spelling mistakes, I need to merge using approximate merging.
Here is an example:
df1
address tld test
0 google .com 14100
1 stackoverflow .net 19587
2 yahoo! .com 21633
3 bbcc .com 9633
4 nytimes .net 61933
df2
address tld type
0 google .com 1
1 stackoverrfloow .net 5
2 bbc .com 4
4 nytimes .com 1
Here is the output that I expect:
output
address tld test type
0 google .com 14100 1
1 stackoverflow .net 19587 5
2 bbcc .com 9633 4
I created a function that returns percentage match using Levenshtein distance. It is a simple function that takes two strings as inputs, and returns the percent match. For example:
string1 = "stackoverflow"
string2 = "stackoverrfloow"
pct_match = pctLevenshtein(string1, string2)
This gives me a percentage match of 0.87. How can I use this function, along with a threshold score above which the approximate match is good enough, to do approximate match on address and tld columns to create the output data frame? The output is only a sample and it may also pick "nytimes" depending on the threshold score. I have tried the following using difflib's get_close_matches to find approximate matches and then merge but this is not exactly what I am trying to do.
df2['key1'] = df2.address.map(lambda x: difflib.get_close_matches(x, df1.address)[0])
df2['key2'] = df2.tld.map(lambda x: difflib.get_close_matches(x, df1.tld)[0])
Everything I have tried so far has not worked. I am looking for something like this to work:
df2['key1'] = df2['address'].map(lambda x: pctMatchLevenshtein(x, df1['address']) if pctMatchLevenshtein(x, df1['address'])>0.85 else 0)
Any tips on how to proceed are greatly appreciated. Thanks!

Fuzzy Match between large number of records

I have two data frames. One contains 33765 companies. Another contains 358839 companies. I want to find the matching between the two using fuzzy match. Because the number of records are too high, I am trying to break down the records of both data frames according to 1st letter of the company name.
For example: For all the companies starting with letter "A", 1st data frame has 2600 records, and 2nd has 25000 records. I am implementing full merge between them and then applying fuzzy match to get all the companies with fuzz value more than 95.
This still does not work because number of records are still too high to perform full merge between them and then implement fuzzy. Kernel dies every time I do these operations. The same approach was working fine when the number of records in both frames was 4-digit.
Also, suggest if there is a way to automates this for all letters 'A' to 'Z', instead of manually running the code for each letter (without making kernel die).
Here's my code:
c='A'
df1 = df1[df1.companyName.str[0] == c ].copy()
df2 = df2[df2.companyName.str[0] == c].copy()
df1['Join'] =1
df2['Join'] =1
df3 = pd.merge(df1,df2, left_on='Join',right_on='Join')
df3['Fuzz'] = df3.apply(lambda x: fuzz.ratio(x['companyName_x'], x['companyName_y']) , axis=1)
df3.sort_values(['companyName_x','Fuzz'],ascending=False, inplace=True)
df4 = df3.groupby('companyName_x',as_index=False).first()
df5=df4[df4.Fuzz>=95]
You started going down the right path by chunking records based on a shared attributed (the first letter). In the record linkage literature, this concept is called blocking and it's critical to reducing the number of comparisons to something tractable.
The way forward is to find even better blocking rules: maybe first five characters, or a whole word in common.
The dedupe library can help you find good blocking rules. (I'm a core dev for this library)

Categories

Resources