Removing duplicates in a huge .csv file

Removing duplicates in a huge .csv file - python

I have a csv file of this format
testname unitname time data
test1 1 20131211220159 123123
test1 1 20131211220159 12345
test1 1 20131211230180 1234
I am trying to remove all old data from this file and retain only the data with the latest timestamp.(First two of the abovv should be deleted because the last time stamp is greater than the first two timestamps). I want to keep all test data unless the same test and same unit was repeated at a later time. The input file is sorted by time (so older data goes down below).
The file is about 15 Mb.(output_Temp.csv). I copied it as output_temp2.csv
This is what i have.
file1=open("output_temp.csv","r")
file2=open("output_temp2.csv","r")
file3=open("output.csv","w")
flag=0
linecounter=0
for line in file1:
testname=line[0]
vid=line[1]
tstamp=line[2]
file2.seek(0) #reset
for i in range(linecounter):
file2.readline() #came down to the line #
for line2 in file2:
if testname==line2.split(",")[0] and vid==line2.split(",")[1] and tstamp!=line2.split(",")[2]:
flag==1
print line
if flag==1:
break
if flag==0:
file3.write(line)
linecounter=linecounter+1 #going down is ok dont go up.
flag=0
This is taking really long to process, I think it might be ok but its literally taking 10 minutes per 100kb and I have a long way to go.

The main reason this is slow is that you're reading the entire file (or, rather, a duplicate copy of it) for each line in the file. So, if there are 10000 lines, you're reading 10000 lines 10000 times, meaning 10000000 total line reads!
If you have enough memory to save the lines read so far, there's a really easy solution: Store the lines seen so far in a set. (Or, rather, for each line, store the tuple of the three keys that count for being a duplicate.) For each line, if it's already in the set, skip it; otherwise, process it and add it to the set.
For example:
seen = set()
for line in infile:
testname, vid, tstamp = line.split(",", 3)[:3]
if (testname, vid, tstamp) in seen:
continue
seen.add((testname, vid, tstamp))
outfile.write(line)
The itertools recipes in the docs have a function unique_everseen that lets you wrap this up even more nicely:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_everseen(infile, key=keyfunc):
outfile.write(line)
If the set takes too much memory, you can always fake a set on top of a dict, and you can fake a dict on top of a database by using the dbm module, which will do a pretty good job of keeping enough in memory to make things fast but not enough to cause a problem. The only problem is that dbm keys have to be strings, not tuples of three strings… but you can always just keep them joined up (or re-join them) instead of splitting, and then you've got a string.
I'm assuming that when you say the file is "sorted", you mean in terms of the timestamp, not in terms of the key columns. That is, there's no guarantee that two rows that are duplicates will be right next to each other. If there were, this is even easier. It may not seem easier if you use the itertools recipes; you're just replacing everseen with justseen:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_justseen(infile, key=keyfunc):
outfile.write(line)
But under the covers, this is only keeping track of the last line, rather than a set of all lines. Which is not only faster, it also saves a lot of memory.
Now that (I think) I understand your requirements better, what you actually want to get rid of is not all but the first line with the same testname, vid, and tstamp, but rather all lines with the same testname and vid except the one with the highest tstamp. And since the file is sorted in ascending order of tstamp, that means you can ignore the tstamp entirely; you just want the last match for each.
This means the everseen trick won't work—we can't skip the first one, because we don't yet know there's a later one.
If we just iterated the file backward, that would solve the problem. It would also double your memory usage (because, in addition to the set, you're also keeping a list so you can stack up all of those lines in reverse order). But if that's acceptable, it's easy:
def keyfunc(line):
return tuple(line.split(",", 2)[:2])
for line in reversed(list(unique_everseen(reversed(list(infile)), key=keyfunc))):
outfile.write(line)
If turning those lazy iterators into lists so we can reverse them takes too much memory, it's probably fastest to do multiple passes: reverse the file on disk, then filter the reversed file, then reverse it again. It does mean two extra file writes, but that can be a lot better than, say, your OS's virtual memory swapping to and from disk hundreds of times (or your program just failing with a MemoryError).
If you're willing to do the work, it wouldn't be that hard to write a reverse file iterator, which reads buffers from the end and splits on newlines and yields the same way the file/io.Whatever object does. But I wouldn't bother unless you turn out to need it.
If you ever do need to repeatedly read particular line numbers out of a file, the linecache module will usually speed things up a lot. Nowhere near as fast as not re-reading at all, of course, but a lot better than reading and parsing thousands of newlines.
You're also wasting time repeating some work in the inner loop. For example, you call line2.split(",") three times, instead of just splitting it once and stashing the value in a variable, which would be three times as fast. A 3x constant gain is nowhere near as important as a quadratic to linear gain, but when it comes for free by making your code simpler and more readable, might as well take it.

For this much file size(~15MB) Pandas would be excellent choice.
Like this:
import pandas as pd
raw_data = pd.read_csv()
clean_data = raw_data.drop_duplicates()
clean_data.to_csv('/path/to/clean_csv.csv')
I was able to process a CSV file about 151MB of size containing more than 5.9Million rows in less than a second with the above snippet.
Please note that the duplicate checking can be a conditional operation or a subset of fields to be matched for duplicate checking.
Pandas does provide lot of these features out of the box. Documentation here

Related

slurp/csv/loop a file to create a list of dictionaries

I have a large file (1.6 gigs) with millions of rows that has columns delimited with:
[||]
I have tried to use the csv module but it says I can only use a single character as a delimiter. So Here is what I have:
fileHandle = open('test.txt', 'r', encoding="UTF-16")
thelist = []
for line in fileHandle:
fields = line.split('[||]')
therow = {
'dea_reg_nbr':fields[0],
'bus_actvty_cd':fields[1],
'drug_schd':fields[3],
#50 more columns like this
}
thelist.append(therow)
fileHandle.close()
#now I have thelist which is what I want
And boom, now I have a list of dictionaries and it works. I want a list because I care about the order, and the dictionary because downstream it's being expected. This just feels like I should be taking advantage of something more efficient. I don't think this scales well with over a million rows and so much data. So, my question as follows:
What would be the more efficient way of taking a multi-character delimited text file (UTF-16 encoded) and creating a list of dictionaries?
Any thoughts would be appreciated!

One way to make it scale better is to use a generator instead of loading all million rows into memory at once. This may or may not be possible depending on your use-case; it will work best if you only need to make one pass over the full data set. Multiple passes will require you to either store all the data in memory in some form or another or to read it from the file multiple times.
Anyway, here's an example of how you could use a generator for this problem:
def file_records():
with open('test.txt', 'r', encoding='UTF-16') as fileHandle:
for line in fileHandle:
fields = line.split('[||]')
therow = {
'dea_reg_nbr':fields[0],
'bus_actvty_cd':fields[1],
'drug_schd':fields[3],
#50 more columns like this
}
yield therow
for record in file_records():
# do work on one record
The function file_records is a generator function because of the yield keyword. When this function is called, it returns an iterator that you can iterate over exactly like a list. The records will be returned in order, and each one will be a dictionary.
If you're unfamiliar with generators, this is a good place to start reading about them.
The thing that makes this scale so well is that you will only ever have one therow in memory at a time. Essentially what's happening is that at the beginning of every iteration of the loop, the file_records function is reading the next line of the file and returning the computed record. It will wait until the next row is needed before doing the work, and the previous record won't linger in memory unless it's needed (such as if it's referenced in whatever data structure you build in # do work on one record).
Note also that I moved the open call to a with statement. This will ensure that the file gets closed and all related resources are freed once the iteration is done or an exception is raised. This is much simpler than trying to catch all those cases yourself and calling fileHandle.close().

How to find same lines in two large text files?

I'd like to compare two large text files(200M) to get the same lines of them.
How to do that in Python?

since they are just 200M, allocate enough memory, read them, sort the lines in ascending order for each, then iterate through both collections of lines in parallel like in a merge operation and delete those that only occur in one set.
preserve line numbers in the collections and sort them by line number after the above, if you want to output them in original order.
merge operation: keep one index for each collection, if lines at both indexes match, increment both indexes, otherwise delete the smaller line and increment just that index. if either index is past the last line, delete all remaining lines in the other collection.
optimization: use a hash to optimize comparisons a little bit; do the hash in the initial read

Disclaimer: I really have no idea how efficient this will be for 200Mb but it's worth the try I guess:
I have tried the following for two ~80mb files and the result was around 2.7 seconds in a 3GB Ram intel i3 machine.
f1 = open("one")
f2 = open("two")
print set(f1).intersection(f2)

You may be able to use the standard difflib module. The module offers several ways of creating difference deltas from various kinds of input.

Here's an example from the docs:
>>> from difflib import context_diff
>>> fromfile = open('before.py')
>>> tofile = open('tofile.py')
>>> for line in context_diff(fromfile, tofile, fromfile='before.py', tofile='after.py'):
print line,

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.

This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.

Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Line by line comparison of large text files with python

I am working on some large (several million line) bioinformatics data sets with the general format:
chromosomeNumber locusStart locusStop sequence moreData
I have other files in this format:
chromosomeNumber locusStart locusStop moreData
What I need to be able to do is read one of each type of file into memory and if the locusStart of a line of the upper file is between the start and stop of any of the lines in the lower file, print the line to output file 1. If the locusStart of that line is not between the start and stop of any lines in the bottom file, then print it to output file 2.
I am currently reading the files in, converting them into dictionaries keyed on chromosome with the corresponding lines as values. I then split each value line into a string, and then do comparisons with the strings. This takes an incredibly long time, and I would like to know if there is a more efficient way to do it.
Thanks.

It seems that for the lower file (which I assuming has the second format), the only field you are concerned about is 'locusStart'. Since, from your description, you do not necessarily care about the other data, you could make a set of all of the locusStart:
locusStart_list = set()
with open(upper_file, 'r') as f:
for line in f:
tmp_list = line.strip().split()
locusStart_list.add(tmp_list[1])
This removes all of the unnecessary line manipulation you do for the bottom file. Then, you can easily compare the locusStart of a field to the set built from the lower file. The set would also remove duplicates, making it a bit faster than using a list.

It sounds like you are going to be doing lots of greater than/less than comparisons, as such I don't think loading your data into dictionaries is going to enhance the speed of code at all--based on what you've explained it sounds like you're still looping through every element in one file or the other.
What you need is a different data structure to load your data into and run comparison operations with. Check out the the Python bisect module, I think it may provide the data structure that you need to run your comparison operations much more efficiently.
If you can more precisely describe what exactly you're trying to accomplish, we'll be able to help you get started writing your code.

Using a dictionary of the chromosome number is a good idea, as long as you can fit both files into memory.
You then want to sort both lists by locusStart (split the string, convert locusStart to a number--see instructions on sorting if you're unsure how to sort on locusStart alone).
Now you can just walk through your lists: if the lower locusStart is less than the first upper locusStart, put the line in file 2 and go on to the next one. If the lower locusStart is greater than the first upper locusStart then
While it is also greater than locusEnd, throw away the beginning of the upper list
If you find a case where it's greater than locusStart and less than locusEnd, put it in file 1
Otherwise, put it in file 2
This should replace what is now probably an O(n^2) algorithm with a O(n log n) one.

list index out of range error received

hey guys, beginner here. I have written a program that outputs files to .txt's and am using another to read them and use them. i have used a list to store these values (len(..) gives me 100 for all files). However, whenever i run this:
for w in range(1,20): # i want files file01-file20 excluding file00
for x in range(100):
c=c+1 #counter to keep list position on f=0
exec "f=open('file%02d.txt','r').readlines()"%w #stores data from file00,file01,file02...
f00=open('file00.txt','r').readlines() #same as ^ but from file00
for y in range(100):
xvp=float(f[c].rstrip('\n')) #the error is on this line; the file are stored in vertical order
pvp=float(f00[y].rstrip('\n')) #maybe even this one
#and i do stuff with those values...
I get in line 12,
xvp=float(f[c].rstrip('\n'))
IndexError: list index out of range
note: there are 100 numbers stored on separate lines in the .txt's
please, if there is any way to help you help me, let me know
thanks

You seem to be incrementing c two thousand times (20 times 100 -- actually only 1900 times, since range(1,20) will not reach the value 20, as you seem to desire in a comment) -- so of course you're going out of range if you use it to index a list of 100! The whole code is rather a mess and I suggest refactoring it radically, to avoid exec and do things the Python way. Assuming Python 2.6 or better (in 2.5, you need a from __future__ import with_statement at the start of your module):
f00 = open('file00.txt').readlines()
for w in range(1, 21):
for x in range(100):
with open('file%02d.txt' % w) as f:
for line in f:
xvp = float(line)
for line00 in f00:
rvp = float(line00)
do_stuff(xvp, rvp)
I don't know if this is the logic you want -- coupling every line of file00.txt with each line from the 20 other files -- but at least this makes it clear which lines are coupled up with which;-). If what you want is to only couple the first line of file00.txt with the first line from each of the others, then second line with second lines, etc, then add import itertools at the start of your module and change the contents of the with into:
for line00, line in itertools.izip(f00, f):
rvp = float(line00)
xvp = float(line)
do_stuff(xvp, rvp)
and so forth.
Note that I'm reading all of file00.txt in memory once and for all (into the f00 list of lines) because apparently you need to loop on those contents more than once, but that's not needed for the other files.
An obvious optimization is to convert file00.txt's lines to floats only once, replacing the f00 = statement with
with open('file00.txt') as f:
rvps = [float(line) for line in f]
then use rvps directly instead of repeating the conversion every time on the strings in f00 -- for example, in the second version (the one using itertools.izip):
for rvp, line in itertools.izip(rvps, f):
xvp = float(line)
do_stuff(xvp, rvp)
Edit: I see I've done a number of tiny enhancements while hardly realizing I was doing so, maybe I'd better explain them;-). No need to pass 'r' when opening a file for reading (can't hurt, but it's quite idiomatic to omit it). No need to strip trailing (or for that matter leading) whitespace from a string before calling float on it -- float happily skips all such leading and trailing whitespace itself. I did fix what apparently was another bug (you'd never deal with file20.txt) by fixing the applicable range to range(1, 21).
The with open(...) as f: statements do the opening, bind name f to the open file object, and, as soon as the block of statements they control is finished, guarantee that the file is properly closed -- it should almost invariably be used in preference to a stand-alone open, because ensuring all files are closed ASAP is really very good practice (the with statement has many other excellent use cases, but this is the single most frequent one, and the only one that happens to be necessary for this functionality).
Looping directly on an open file object f (provided the file is opened in text mode, as is the default and applies throughout here), for line in f:, provides one after the other the lines of f (without ever needing to keep them all in memory at once) and is an extremely popular and good Pythonic idiom.
The construct rvps = [float(line) for line in f], which I use in my recommended optimization, is known as a "list comprehension" and it's a nicely speedy and compact alternative to a loop that builds a new list.
itertools.izip, given a number of iterables, provides a single iterable whose items are tuples made by the items of the other iterables "walked in lockstep". The built-in zip is similar, but (in Python 2) it builds a list in memory, which itertools.izip avoids, so it's good practice to learn to use the itertools version to avoid wasting memory (not really important for small files like the ones you have, but good habits are best learned and "just applied" rather than having to reflect on them every single time -- just one one doesn't start every morning pondering whether one should brush one's teeth, but just goes and does so as a matter of good habit;-).
I'm sure there's more, but this is what comes to mind off-hand - feel free to ask if I can be of further assistance!

there are 100 numbers stored on
separate lines in the .txt's
but in
for w in range(1,20): # i want files file01-file20 excluding file00
for x in range(100):
c=c+1 #counter to keep list position on f=0
you incrementing c by 20*100 = 2000 times.
Maybe you need c = 0 in "w" cycle or just use x instead of c?

Based on how you describe your files, you are indexing into them incorrectly. By using c which is incremented for each iteration of the second loop. It will reach values of up to 2000. Using x seems to be the logical choice.
#restructured for efficiency
file = open('file00.txt','r')
f00 = file.readlines() #no need to reopen the file for every iteration
file.close() #always close the file when done with
for w in range(1,20):
file = open('file%02d.txt'%w,'r')
f = file.readlines() #only open once per iteration
file.close()
for x in range(100):
xvp = float(f[x].rstrip('\n'))
for y in range(100):
pvp = float(f00[y].rstrip('\n'))
#do stuff

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.