Record linking two large CSVs in Python?

Record linking two large CSVs in Python? - python

I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g.
CSV_1 CSV_2
Name City Date Name_of_thing City_of_Origin Time
Examp. Bton 7/11 THE EXAMPLE, LLC Bton, USA 7/11/2020 00:00
Nomatch Cton 10/10 huh, inc. Lton, AMERICA 9/8/2020 00:00
Would output
CSV_3
Name City Date Name_of_thing City_of_Origin Time
Examp. Bton 7/11 THE EXAMPLE, LLC Bton, USA 7/11/2020 00:00
The data is not well structured, and CSV_2 has many more columns than CSV_1, which is why I have been attempting to find fuzzy matches based on the name column with the city column as an index block. Having trouble getting the matching stage to even execute, never mind efficiently, and haven't even tackled the concatenation step. Any help on how to tackle this?
Edit: The files are each very large (both ~1M lines with 8-20 columns, 80-200mb), even loading single columns with pandas is troublesome. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size).

For your problem statement and considering the size of the data involved, I recommend loading your data into a database. Then, I would use the following SQL to solve your problem, then I would read the result into my local python env / pandas dataframe:
select *
from csv_1
inner join csv_2
on csv_1.city = csv_2.city_of_origin
where STRPOS( lower(csv_1.name) , lower(csv_2.name_of_thing) )>0
or STRPOS( lower(csv_2.name_of_thing) , lower(csv_1.name) )>0

Related

Tools or python libraries to detect records duplicate

I’m trying to find duplicates in a single csv file by python so through my search I found dedupe.io which is a platform using python and machine learning algorithms to detect records duplicate but it’s not a free tool. However, I don’t want to use the traditional method which the compared columns should specified. I would like to find a way to detect duplicate with a high accuracy. Therefore, is there any tool or python library to find duplicates for text datasets?
Here is an example which could clarify that:
Title, Authors, Venue, Year
1- Clustering validity checking methods: part II, Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
2- Cluster validity methods: part I, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
3- Book reviews, Karl Aberer, ACM SIGMOD Record, 2003
4- Book review column, Karl Aberer, ACM SIGMOD Record, 2003
5- Book reviews, Leonid Libkin, ACM SIGMOD Record, 2003
So, we can decide that records 1 and 2 are not duplicate even though they are contain almost similar data but slightly different in the Title column. Records 3 and 4 are duplicate but record 5 is not referring to the same entity.

Pandas provides provides a very straightforward way to achieve this pandas.DataFrame.drop_duplicates.
Given the following file (data.csv) stored in the current working directory.
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
John Doe,25,50000
Louise Jones,25,50000
The following script can be used to remove duplicate records, writing the processed data to a csv file in the current working directory (processed_data.csv).
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates()
df.to_csv("processed_data.csv", index=False)
The resultant output in this example looks like:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Louise Jones,25,50000
pandas.DataFrame.drop_duplicates also allows dropping of duplicate attributes from a specific column (instead of just duplicates of entire rows), column names are specified using the subset argument.
e.g
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates(subset=["age"])
df.to_csv("processed_data.csv", index=False)
Will remove all duplicate values from the age column, maintaining only the first record containing a value duplicated in the age field of later records.
In this example case the output would be:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000

Thanks #JPI93 for your answer but some duplicate still exist and didn't removed. I think this method works for the exact duplicate; if this is the case, it's not what i'm looking for. I want to apply record linkage which identify the records that refer to the same entity and then can be removed.

Fastest approach to read and process 10k Excell cells in Python/Pandas?

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?

It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.

#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Advance processing multiple Data Frames in python

I got a few (15) data frames. They contain values based on one map, but they have fragmentary form.
List of samples looks like A1 - 3k records, A2 - 6k records. B1 - 12k records, B2- 1k records, B3 - 3k records. C1... etc.
All files have the same format and it looks that:
name sample position position_ID
String1 String1 num1 num1
String2 String2 num2 num2
...
All files come from a variety of biological microarrays. Different companies have different matrices, hence the scatter in the size of files. But each of them is based on one common, whole database. Just some of the data from the main database is selected. Therefore, individual records can be repeated between files. I want to see if they are compatible.
What do I want to achieve in this task?
I want to check that all records are the same in terms of name in all files have the same position and pos_ID values.
If the tested record with the same name differs in values in any file, it must be written to error.csv.
If it is everywhere the same - result.csv.
And to be honest I do not know how to bite it, so I am guided here with a hint that someone is taking good advise to me. I want do it in python.
I have two ideas.
Load in Pandas all files as one data frame and try to write a function filtering whole DF record by record (for loop with if statements?).
Open separate all files by python read file and adding unique rows to the new list, and when read function would encounter again the same recordName, it would check it with previous. If all rest of values are tha same it will pass it without writing, if no, the record will be written in error.csv.
I am afraid, however, that these may not be the most optimal methods, hence asking you for advice and directing me for something better? I have read about numpy, I have not studied it yet, but maybe it is worth it to be in the context of this task? Maybe there is a function that has already been created for this, and I do not know about it?
Can someone help a more sensible (maybe easier) solution?

I think I have a rough idea of where you are going. This is how I would approach it
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df1["filename"] ="file1.csv"
df2["filename"] ="file2.csv"
df_total = pd.concat([df1,df2],axis=1) # stacks them vertically
df_total_no_dupes = df_total.drop_duplicates() # drops duplicate rows
# this gives you the cases where id occures more than once
name_counts = df_total_no_dupes.groupby("name").size().reset_index(name='counts')
names_which_appear_more_than_once = name_counts[name_counts["counts"] > 1]["name"].unique()
filter_condition = df_total_no_dupes["name"].isin(names_which_appear_more_than_once)
# this should be your dataframe where there are at least two rows with same name but different values.
print(df_total_no_dupes[filter_condition].sort_values("name"))

Streaming Large (5gb) CSV’s FAST (in parallel?) on 16-core Machine?

I have 30 CSV files (saved as .txt files) ranging from 2GB to 11GB each on a server machine with 16 cores.
Each row of each CSV contains a date, a time, and an ID.
I need to construct a dense matrix of size datetime x ID (roughly 35,000 x 2000), where each cell is count of rows that had this datetime and ID (so each CSV row’s datetime and ID are used as matrix indices to update this matrix). Each file contains a unique range of datetimes, so this job is embarrassingly parallel across files.
Question: What is a faster/fastest way to accomplish this & (possibly) parallelize it? I am partial to Python, but could work in C++ if there is a better solution there. Should I re-write with MapReduce or MPI? Look into Dask or Pandas? Compile my python script somehow? Something else entirely?
My current approach (which I would happily discard for something faster):
Currently, I am doing this serially (one CSV at a time) in Python and saving the output matrix in h5 format. I stream a CSV line-by-line from the command line using:
cat one_csv.txt | my_script.py > outputfile.h5
And my python script works like:
# initialize matrix
…
for line in sys.stdin:
# Split the line into data columns
split = line.replace('\n','').split(',')
...(extract & process datetime; extract ID)...
# Update matrix
matrix[datetime, ID] = matrix[datetime, ID] +1
EDIT Below are a few example lines from one of the CSV's. The only relevant columns are 'dateYMDD' (formatted so that '80101' means jan. 1 2008), 'time', and 'ID'. So for example, the code should read use the first row of the CSV below to add 1 to the matrix cell corresponding to (Jan_1_2008_00_00_00, 12).
Also: There are many more unique times than unique ID's, and the CSV's are time-sorted.
Type|Number|dateYMDD|time|ID
2|519275|80101|0:00:00|12
5|525491|80101|0:05:00|25
2|624094|80101|0:12:00|75
5|623044|80102|0:01:00|75
6|658787|80102|0:03:00|4

First of all, you should probably profile your script to make sure the bottleneck is actually where you think.
That said, Python's Global Interpreter Lock will make parallelizing it difficult, unless you use multiprocessing, and I expect it will be faster to simply process them separately and merge the results: feed each Python script one CSV and output to one table, then merge the tables. If the tables are much smaller than the CSVs (as one would expect if the cells have high values) then this should be relatively efficient.
I don't think that will get you all-caps full-throttle FAST like you mentioned, though. If that doesn't meet your expectations I would think of writing it in C++, Rust or Cython.

Sorting in pandas for large datasets

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
Say your file is called stuff.csv, and looks like this:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted().
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html

Blaze might be the tool for you with the ability to work with pandas and csv files out of core.
http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.

If your csv file contains only structured data, I would suggest approach using only linux commands.
Assume csv file contains two columns, COL_1 and P_VALUE:
map.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.

Here is my Honest sugg./ Three options you can do.
I like Pandas for its rich doc and features but I been suggested to
use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
You can also use 'kind' parameter in your pandas-sort function you are using.
Godspeed my friend.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.