Tools or python libraries to detect records duplicate

Tools or python libraries to detect records duplicate - python

I’m trying to find duplicates in a single csv file by python so through my search I found dedupe.io which is a platform using python and machine learning algorithms to detect records duplicate but it’s not a free tool. However, I don’t want to use the traditional method which the compared columns should specified. I would like to find a way to detect duplicate with a high accuracy. Therefore, is there any tool or python library to find duplicates for text datasets?
Here is an example which could clarify that:
Title, Authors, Venue, Year
1- Clustering validity checking methods: part II, Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
2- Cluster validity methods: part I, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
3- Book reviews, Karl Aberer, ACM SIGMOD Record, 2003
4- Book review column, Karl Aberer, ACM SIGMOD Record, 2003
5- Book reviews, Leonid Libkin, ACM SIGMOD Record, 2003
So, we can decide that records 1 and 2 are not duplicate even though they are contain almost similar data but slightly different in the Title column. Records 3 and 4 are duplicate but record 5 is not referring to the same entity.

Pandas provides provides a very straightforward way to achieve this pandas.DataFrame.drop_duplicates.
Given the following file (data.csv) stored in the current working directory.
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
John Doe,25,50000
Louise Jones,25,50000
The following script can be used to remove duplicate records, writing the processed data to a csv file in the current working directory (processed_data.csv).
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates()
df.to_csv("processed_data.csv", index=False)
The resultant output in this example looks like:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Louise Jones,25,50000
pandas.DataFrame.drop_duplicates also allows dropping of duplicate attributes from a specific column (instead of just duplicates of entire rows), column names are specified using the subset argument.
e.g
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates(subset=["age"])
df.to_csv("processed_data.csv", index=False)
Will remove all duplicate values from the age column, maintaining only the first record containing a value duplicated in the age field of later records.
In this example case the output would be:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000

Thanks #JPI93 for your answer but some duplicate still exist and didn't removed. I think this method works for the exact duplicate; if this is the case, it's not what i'm looking for. I want to apply record linkage which identify the records that refer to the same entity and then can be removed.

Related

How could I find specific texts in one column of another dataset? Python

I have 2 datasets. One contains a column of companies name, and another contains a column of headlines of news. So the aim I want to achieve is to find all the news whose headline contains one company in the other datasets.Basically the two datasets are like this, and I wanna select the news with specific company names
I have tried to use for loop to achieve my goals, but I think it takes too much time and I think pandas or some other libraries can do this in an easier way.
I am a starter in python.

If I understand correctly you should have 2 data sets with different columns, first, you need to loop through the dataset that contains the company name to search in the headline, then you could use obj. find(“search”) to find matches in both datasets.
Also if every query is stored in a CSV format you could use the split() function to get the only column you wanna use

Supposing that you have saved your company names in a pd.Series called company and headlines and texts in a pd.DataFrame called df, this will be what you are looking for:
# it will add a column called "company" to your initial df
for org, headline in zip(company, df['headline']):
if org in headline:
df.loc[df['headline'] == headline, 'company'] = org
You should pay attention to lower and upper case letters, as this will only find the corresponding company if the exact same word appears in the headline.

Merging multiple CSV's with different columns

Lets say I have a CSV which is generated yearly by my business. Each year my business decides there is a new type of data we want to collect. So Year2002.csv looks like this:
Age,Gender,Address
A,B,C
Then year2003.csv adds a new column
Age,Gender,Address,Location,
A,B,C,D
By the time we get to year 2021, my CSV now has 7 columns and looks like this:
Age,Gender,Address,Location,Height,Weight,Race
A,B,C,D,E,F,G,H
My business wants to create a single CSV which contains all of the data recorded. Where data is not available, (for example, Address data is not recorded in the 2002 CSV) there can be a 0 or a NAAN or a empty cell.
What is the best method available to merge the CSV's into a single CSV? It may be worthwhile saying, that I have 15,000 CSV files which need to be merged. ranging from 2002-2021. 2002 the CSV starts off with three columns, but by 2020, the csv has 10 columns. I want to create one 'master' spreadsheet which contains all of the data.
Just a little extra context... I am doing this because I will then be using Python to replace the empty values using the new data. E.g. calculate an average and replace CSV empty values with that average.
Hope this makes sense. I am just looking for some direction on how to best approach this. I have been playing around with excel, power bi and python but I can not figure out the best way to do this.

With pandas you can use pandas.read_csv() to create Dataframe, which you can merge using pandas.concat().
import pandas as pd
data1 = pd.read_csv(csv1)
data2 = pd.read_csv(csv2)
data = pd.concat(data1, data2)

You should take a look at python csv module.
A good place to start: https://www.geeksforgeeks.org/working-csv-files-python/
It is simple and useful for reading CSVs and creating new ones.

Is there a way to delete parts of a dataframe that are not important for my analysis?

I'm a newbie on Python, and i decide to analyses electoral data from local elections in Brazil.
enter image description here
Our electoral superior court has pretty decent data on the subject (https://www.tse.jus.br/eleicoes/estatisticas/repositorio-de-dados-eleitorais-1). However, they agregate it by state level, even for municipal elections.
Is there any way to clean all cities but the ones i'm interested? I'm try to conduct some exploratory analisys on city councilors for state capital, Fortaleza. However, the data I have bring information on all 184 state municipalities.
I've been trying to use pandas Groupby(), without sucess till now.
Any ideas?

Is there a field named city or something like that? In pandas you can drop rows based on a mask like this:
new_df = df.drop(df['city'] != 'Fortaleza')
Notice that by default it does not drop values inplace, so it will return a new DataFrame, and you'll most likely want to do reset_index to fix the indices in this new DataFrame.

Pursuant to the answer given by Thiago, I believe you could condense your dataframe based on a list of cities you wish to look at using the .isin() method:
df = df[df['city'].isin(list)]
Of course, prior to this you must set list equal to a list of whatever cities you are interested in. Also 'city' being the header for the column containing the city names - I can't tell from the provided picture due to language.

Record linking two large CSVs in Python?

I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g.
CSV_1 CSV_2
Name City Date Name_of_thing City_of_Origin Time
Examp. Bton 7/11 THE EXAMPLE, LLC Bton, USA 7/11/2020 00:00
Nomatch Cton 10/10 huh, inc. Lton, AMERICA 9/8/2020 00:00
Would output
CSV_3
Name City Date Name_of_thing City_of_Origin Time
Examp. Bton 7/11 THE EXAMPLE, LLC Bton, USA 7/11/2020 00:00
The data is not well structured, and CSV_2 has many more columns than CSV_1, which is why I have been attempting to find fuzzy matches based on the name column with the city column as an index block. Having trouble getting the matching stage to even execute, never mind efficiently, and haven't even tackled the concatenation step. Any help on how to tackle this?
Edit: The files are each very large (both ~1M lines with 8-20 columns, 80-200mb), even loading single columns with pandas is troublesome. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size).

For your problem statement and considering the size of the data involved, I recommend loading your data into a database. Then, I would use the following SQL to solve your problem, then I would read the result into my local python env / pandas dataframe:
select *
from csv_1
inner join csv_2
on csv_1.city = csv_2.city_of_origin
where STRPOS( lower(csv_1.name) , lower(csv_2.name_of_thing) )>0
or STRPOS( lower(csv_2.name_of_thing) , lower(csv_1.name) )>0

How to calculate the sum of conditional cells in excel, populate another column with results

EDIT: Using advanced search in Excel (under data tab) I have been able to create a list of unique company names, and am now able to SUMIF based on the cell containing the companies name!
Disclaimer: Any python solutions would be greatly appreciated as well, pandas specifically!
I have 60,000 rows of data, containing information about grants awarded to companies.
I am planning on creating a python dictionary to store each unique company name, with their total grant $ given (agreemen_2), and location coordinates. Then, I want to display this using Dash (Plotly) on a live MapBox map of Canada.
First thing first, how do I calculate and store the total value that was awarded to each company?
I have seen SUMIF in other solutions, but am unsure how to output this to a new column, if that makes sense.
One potential solution I thought was to create a new column of unique company names, and next to it SUMIF all the appropriate cells in col D.
PYTHON STUFF SO FAR
So with the below code, I take a much messier looking spreadsheet, drop duplicates, sort based on company name, and create a new pandas database with the relevant data columns:
corp_df is the cleaned up new dataframe that I want to work with.
and recipien_4 is the companies unique ID number, as you can see it repeats with each grant awarded. Folia Biotech in the screenshot shows a duplicate grant, as proven with a column i did not include in the screenshot. There are quite a few duplicates, as seen in the screenshot.
import pandas as pd
in_file = '2019-20 Grants and Contributions.csv'
# create dataframe
df = pd.read_csv(in_file)
# sort in order of agreemen_1
df.sort_values("recipien_2", inplace = True)
# remove duplicates
df.drop_duplicates(subset='agreemen_1', keep='first', inplace=True)
corp_dict = { }
# creates empty dict with only 1 copy of all corporation names, all values of 0
for name in corp_df_2['recipien_2']:
if name not in corp_dict:
corp_dict[name] = 0
# full name, id, grant $, longitude, latitude
corp_df = df[['recipien_2', 'recipien_4', 'agreemen_2','longitude','latitude']]
any tips or tricks would be greatly appreciated, .ittertuples() didn't seem like a good solution as I am unsure how to filter and compare data, or if datatypes are preserved. But feel free to prove me wrong haha.
I thought perhaps there was a better way to tackle this problem, straight in Excel vs. iterating through rows of a pandas dataframe. This is a pretty open question so thank you for any help or direction you think is best!

I can see that you are using pandas to read de the file csv, so you can use the method:
Group by
So you can create a new dataframe making groupings for the name of the company like this:
dfnew = dp.groupby(['recipien_2','agreemen_2']).sum()
Then dfnew have the values.
Documentation Pandas Group by:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The use of group_by followed by a sum may be the best for you:
corp_df= df.group_by(by=['recipien_2', 'longitude','latitude']).apply(sum, axis=1)
#if you want to transform the index into columns you can add this after as well:
corp_df=corp_df.reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.