Fuzzy Match between large number of records

Fuzzy Match between large number of records - python

I have two data frames. One contains 33765 companies. Another contains 358839 companies. I want to find the matching between the two using fuzzy match. Because the number of records are too high, I am trying to break down the records of both data frames according to 1st letter of the company name.
For example: For all the companies starting with letter "A", 1st data frame has 2600 records, and 2nd has 25000 records. I am implementing full merge between them and then applying fuzzy match to get all the companies with fuzz value more than 95.
This still does not work because number of records are still too high to perform full merge between them and then implement fuzzy. Kernel dies every time I do these operations. The same approach was working fine when the number of records in both frames was 4-digit.
Also, suggest if there is a way to automates this for all letters 'A' to 'Z', instead of manually running the code for each letter (without making kernel die).
Here's my code:
c='A'
df1 = df1[df1.companyName.str[0] == c ].copy()
df2 = df2[df2.companyName.str[0] == c].copy()
df1['Join'] =1
df2['Join'] =1
df3 = pd.merge(df1,df2, left_on='Join',right_on='Join')
df3['Fuzz'] = df3.apply(lambda x: fuzz.ratio(x['companyName_x'], x['companyName_y']) , axis=1)
df3.sort_values(['companyName_x','Fuzz'],ascending=False, inplace=True)
df4 = df3.groupby('companyName_x',as_index=False).first()
df5=df4[df4.Fuzz>=95]

You started going down the right path by chunking records based on a shared attributed (the first letter). In the record linkage literature, this concept is called blocking and it's critical to reducing the number of comparisons to something tractable.
The way forward is to find even better blocking rules: maybe first five characters, or a whole word in common.
The dedupe library can help you find good blocking rules. (I'm a core dev for this library)

Related

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..

Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

PySpark algorithem slowed after join

I'm relatively new to PySpark and I'm currently trying to implement the SVD algorithm for predicting user-item ratings. The input is a matrix with columns - user_id, item_id and rating. In the first step I initialize the biases (bu, bi) and the factor matrices (pu, qi) for each user and each item. So I start the algorithm with the following dataframe:
Initial dataframe
In the current case the number of partitions is 7 and the time needed to count all the rows takes 0.7 seconds. The number of rows is 2.5 million.
Partitions and count time
In the next step I add a column to my dataframe - error. I use a UDF function which calculates the error for each row with regards to all the other columns (I don't think the equation is relevant). After the count function takes about the same amount of time.
Now comes the tricky part. I have to create 2 new dataframes. In the first I group together all the users (named userGroup) and in the second I group together all the items (named itemGroup). I have another UDF function implemented that updates the biases (update_b) and one that updates the factor matrices (update_factor_F). The userGroup dataframe has 1.5 million rows and the itemGroup has 72000 rows.
Updated biases and factors for each user
I then take the initial dataframe and join it firstly by user - I take the user_id, item_id and rating from the initial and the biases and factors from the userGroup dataframe. I repeat the same process with the itemGroup.
train = train.join(userGroup, train.u_id == userGroup.u_id_ug, 'outer') \
.select(train.u_id, train.i_id, train.rating, userGroup.bu, userGroup.pu)
I end up having a dataframe with the same size as the initial one. However if I do a .count() it now takes around 8 seconds. I would have to repeat the above steps iteratively and each iteration slows the time to do the .count() action even further.
I know the issue lies in the join of the dataframes and have searched for solutions to my issues. So far I haver tried different combinations of partitioning (I used .repartition(7, "u_id") on the userGroup dataframe) to try and match the number of partitions. I also tried repartitioning the final dataframe, but the .count() remains high.
My goal is to not loose performance after each iteration.

As some of your dataframes can be used multiple times, you will want to cache them so that they are not re-evaluated every time you need them. To do this you can rely on cache() or persist() operations.
Also, the logical plan of your dataframe will grow as you move forward on your iterative algorithm. This will increase computations exponentially as you move forward on your iterations. To cope with this issue, you will need to rely on checkpoint() operation to regularly break the lineage of your dataframes.

Python - Ways to create dataframe with multiple sources and conditions

Let's say we have several dataframes that contain relevant information that need to be compiled into one single dataframe. There are several conditions involved in choosing which pieces of data can be brought over to the results dataframe.
Here are 3 dataframes (columns only) that we need to pull and compile data from:
df1 = ["Date","Order#","Line#","ProductID","Quantity","Sale Amount"]
df2 = ["Date","PurchaseOrderID","ProductID","Quantity","Cost"]
df3 = ["ProductID","Quantity","Location","Cost"]
df3 is the only table in this set that actually contains a unique non-repeating key "productid". The other two dataframes have keys, but they can repeat. the only way to find uniqueness is to refer to date and the other foreign keys.
Now, we'd like the desired result to show which all products grouped by product where df1.date after x date, where df2.quantity<5, where df3.quantity>0. Ideally the results would show the df3.quantity, df.cost (sum both in grouping), most recent purchase date from df2.date, and total number of sale by part from df1.count where all above criteria met.
This is the quickest example I could come up with on this issue. I'm able to accomplish this in VBA with only one problem... it's EXCRUCIATINGLY slow. I understand how list comprehension and perhaps other means of completing this task would be faster than VBA (maybe?), but it would still take a while with all of the logic and decision making that happens behind the scenes.
This example doesn't exactly show the complexities but any advice or direction you have to offer may help me and others understand how to treat these kinds of problems in Python. Any expert opinion, advice, direction is very much appreciated.

If I understand correctly:
You simply need to apply the conditions as filters on each dataframe, then group by ProductID and put it together.
df1 = df1[df1.Date > x].groupby('ProductID').agg({'Quantity':'sum','Sale Amount':'sum'})
df2 = df2.groupby('ProductID').agg({'Date':'max','Quantity':'sum','Cost':'sum'})
df2 = df2[df2.Quantity > 5].copy()
df3 = df3[df3.Quantity > 0].copy()
Once you have all of those, probably something like:
g = [i for i in list(df3.index) if i in list(df2.index) and i in list(df1.index)]
df = df3.loc[g] #use df3 as a frame, with only needed indexes
I am not sure what you want to pull from df1 and df2 - but it will look something like:
df = df.join(df2['col_needed'])
You may need to rename columns to avoid overlap.
This avoids inefficient looping and should be orders of magnitude faster than a loop in VBA.

Detecting bad information (python/pandas)

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!

First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]

Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

how to do a nested for-each loop with PySpark

Imagine a large dataset (>40GB parquet file) containing value observations of thousands of variables as triples (variable, timestamp, value).
Now think of a query in which you are just interested in a subset of 500 variables. And you want to retrieve the observations (values --> time series) for those variables for specific points in time (observation window or timeframe). Such having a start and end time.
Without distributed computing (Spark), you could code it like this:
for var_ in variables_of_interest:
for incident in incidents:
var_df = df_all.filter(
(df.Variable == var_)
& (df.Time > incident.startTime)
& (df.Time < incident.endTime))
My question is: how to do that with Spark/PySpark? I was thinking of either:
joining the incidents somehow with the variables and filter the dataframe afterward.
broadcasting the incident dataframe and use it within a map-function when filtering the variable observations (df_all).
use RDD.cartasian or RDD.mapParitions somehow (remark: the parquet file was saved partioned by variable).
The expected output should be:
incident1 --> dataframe 1
incident2 --> dataframe 2
...
Where dataframe 1 contains all variables and their observed values within the timeframe of incident 1 and dataframe 2 those values within the timeframe of incident 2.
I hope you got the idea.
UPDATE
I tried to code a solution based on idea #1 and the code from the answer given by zero323. Work's quite well, but I wonder how to aggregate/group it to the incident in the final step? I tried adding a sequential number to each incident, but then I got errors in the last step. Would be cool if you can review and/or complete the code. Therefore I uploaded sample data and the scripts. The environment is Spark 1.4 (PySpark):
Incidents: incidents.csv
Variable value observation data (77MB): parameters_sample.csv (put it to HDFS)
Jupyter Notebook: nested_for_loop_optimized.ipynb
Python Script: nested_for_loop_optimized.py
PDF export of Script: nested_for_loop_optimized.pdf

Generally speaking only the first approach looks sensible to me. Exact joining strategy on the number of records and distribution but you can either create a top level data frame:
ref = sc.parallelize([(var_, incident)
for var_ in variables_of_interest:
for incident in incidents
]).toDF(["var_", "incident"])
and simply join
same_var = col("Variable") == col("var_")
same_time = col("Time").between(
col("incident.startTime"),
col("incident.endTime")
)
ref.join(df.alias("df"), same_var & same_time)
or perform joins against particular partitions:
incidents_ = sc.parallelize([
(incident, ) for incident in incidents
]).toDF(["incident"])
for var_ in variables_of_interest:
df = spark.read.parquet("/some/path/Variable={0}".format(var_))
df.join(incidents_, same_time)
optionally marking one side as small enough to be broadcasted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.