Most efficient way in Python to iterate over a large file (10GB+) - python

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.

This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.

Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Related

Removing duplicates in a huge .csv file

I have a csv file of this format
testname unitname time data
test1 1 20131211220159 123123
test1 1 20131211220159 12345
test1 1 20131211230180 1234
I am trying to remove all old data from this file and retain only the data with the latest timestamp.(First two of the abovv should be deleted because the last time stamp is greater than the first two timestamps). I want to keep all test data unless the same test and same unit was repeated at a later time. The input file is sorted by time (so older data goes down below).
The file is about 15 Mb.(output_Temp.csv). I copied it as output_temp2.csv
This is what i have.
file1=open("output_temp.csv","r")
file2=open("output_temp2.csv","r")
file3=open("output.csv","w")
flag=0
linecounter=0
for line in file1:
testname=line[0]
vid=line[1]
tstamp=line[2]
file2.seek(0) #reset
for i in range(linecounter):
file2.readline() #came down to the line #
for line2 in file2:
if testname==line2.split(",")[0] and vid==line2.split(",")[1] and tstamp!=line2.split(",")[2]:
flag==1
print line
if flag==1:
break
if flag==0:
file3.write(line)
linecounter=linecounter+1 #going down is ok dont go up.
flag=0
This is taking really long to process, I think it might be ok but its literally taking 10 minutes per 100kb and I have a long way to go.
The main reason this is slow is that you're reading the entire file (or, rather, a duplicate copy of it) for each line in the file. So, if there are 10000 lines, you're reading 10000 lines 10000 times, meaning 10000000 total line reads!
If you have enough memory to save the lines read so far, there's a really easy solution: Store the lines seen so far in a set. (Or, rather, for each line, store the tuple of the three keys that count for being a duplicate.) For each line, if it's already in the set, skip it; otherwise, process it and add it to the set.
For example:
seen = set()
for line in infile:
testname, vid, tstamp = line.split(",", 3)[:3]
if (testname, vid, tstamp) in seen:
continue
seen.add((testname, vid, tstamp))
outfile.write(line)
The itertools recipes in the docs have a function unique_everseen that lets you wrap this up even more nicely:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_everseen(infile, key=keyfunc):
outfile.write(line)
If the set takes too much memory, you can always fake a set on top of a dict, and you can fake a dict on top of a database by using the dbm module, which will do a pretty good job of keeping enough in memory to make things fast but not enough to cause a problem. The only problem is that dbm keys have to be strings, not tuples of three strings… but you can always just keep them joined up (or re-join them) instead of splitting, and then you've got a string.
I'm assuming that when you say the file is "sorted", you mean in terms of the timestamp, not in terms of the key columns. That is, there's no guarantee that two rows that are duplicates will be right next to each other. If there were, this is even easier. It may not seem easier if you use the itertools recipes; you're just replacing everseen with justseen:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_justseen(infile, key=keyfunc):
outfile.write(line)
But under the covers, this is only keeping track of the last line, rather than a set of all lines. Which is not only faster, it also saves a lot of memory.
Now that (I think) I understand your requirements better, what you actually want to get rid of is not all but the first line with the same testname, vid, and tstamp, but rather all lines with the same testname and vid except the one with the highest tstamp. And since the file is sorted in ascending order of tstamp, that means you can ignore the tstamp entirely; you just want the last match for each.
This means the everseen trick won't work—we can't skip the first one, because we don't yet know there's a later one.
If we just iterated the file backward, that would solve the problem. It would also double your memory usage (because, in addition to the set, you're also keeping a list so you can stack up all of those lines in reverse order). But if that's acceptable, it's easy:
def keyfunc(line):
return tuple(line.split(",", 2)[:2])
for line in reversed(list(unique_everseen(reversed(list(infile)), key=keyfunc))):
outfile.write(line)
If turning those lazy iterators into lists so we can reverse them takes too much memory, it's probably fastest to do multiple passes: reverse the file on disk, then filter the reversed file, then reverse it again. It does mean two extra file writes, but that can be a lot better than, say, your OS's virtual memory swapping to and from disk hundreds of times (or your program just failing with a MemoryError).
If you're willing to do the work, it wouldn't be that hard to write a reverse file iterator, which reads buffers from the end and splits on newlines and yields the same way the file/io.Whatever object does. But I wouldn't bother unless you turn out to need it.
If you ever do need to repeatedly read particular line numbers out of a file, the linecache module will usually speed things up a lot. Nowhere near as fast as not re-reading at all, of course, but a lot better than reading and parsing thousands of newlines.
You're also wasting time repeating some work in the inner loop. For example, you call line2.split(",") three times, instead of just splitting it once and stashing the value in a variable, which would be three times as fast. A 3x constant gain is nowhere near as important as a quadratic to linear gain, but when it comes for free by making your code simpler and more readable, might as well take it.
For this much file size(~15MB) Pandas would be excellent choice.
Like this:
import pandas as pd
raw_data = pd.read_csv()
clean_data = raw_data.drop_duplicates()
clean_data.to_csv('/path/to/clean_csv.csv')
I was able to process a CSV file about 151MB of size containing more than 5.9Million rows in less than a second with the above snippet.
Please note that the duplicate checking can be a conditional operation or a subset of fields to be matched for duplicate checking.
Pandas does provide lot of these features out of the box. Documentation here

Alternative to Python Multiprocessing Manager dict for large read only store

I'm using Multiprocessing with a large (~5G) read-only dict used by processes. I started by passing the whole dict to each process, but ran into memory restraints, so changed to use a Multiprocessing Manager dict (after reading this How to share a dictionary between multiple processes in python without locking )
Since the change, performance has dived. What alternatives are there for a faster shared data store? The dict has a 40 character string key, and 2 small string element tuple data.
Use a memory mapped file. While this might sound insane (performance wise), it might not be if you use some clever tricks:
Sort the keys so you can use binary search in the file to locate a record
Try to make each line of the file the same length ("fixed width records")
If you can't use fixed width records, use this pseudo code:
Read 1KB in the middle (or enough to be sure the longest line fits *twice*)
Find the first new line character
Find the next new line character
Get a line as a substring between the two positions
Check the key (first 40 bytes)
If the key is too big, repeat with a 1KB block in the first half of the search range, else in the upper half of the search range
If the performance isn't good enough, consider writing an extension in C.

Checking information in a dataset in Python

I currently have a requirement to make a comparison of strings containing MAC addresses (eg. "11:22:33:AA:BB:CC" using Python 2.7. At present, I have a preconfigured set containing the MAC address and my script iterates through the set comparing each new MAC address to those in the list. This works great but as the set grows, the script massively slows down. With only 100 or so, you can notice a massive difference.
Does anybody have any advice on speeding up this process? Is storing them in a set the best way to compare or is it better to store them in a CSV / DB for example?
Sample of the code...
def Detect(p):
stamgmtstypes = (0,2,4)
if p.haslayer(Dot11):
if p.type == 0 and p.subtype in stamgmtstypes:
if p.addr2 not in observedclients:
# This is the set with location_mutex:
detection = p.addr2 + "\t" + str(datetime.now())
print type(p.addr2)
print detection, last_location
observedclients.append(p.addr2)
First, you need to profile your code to understand where exactly the bottleneck is...
Also, as a generic recommendation, consider psyco, although there are a few times when psyco doesn't help
Once you find a bottleneck, cython may be useful, but you need to be sure that you declare all your variables in the cython source.
Try using set. To declare set use set(), not [] (because the latter declares an empty list).
The lookup in the list is of O(n) complexity. It's what happens in you case when the list grows (the complexity grows with growing of n as O(n)).
The lookup in the set is of O(1) complexity on the average.
http://wiki.python.org/moin/TimeComplexity
Also, you will need to change some part of your code. There is no append method in set, so you will need to use something like observedclients.add(address).
The post mentions "the script iterates through the set comparing each new MAC address to those in the list."
To take full advantage of sets, don't loop over them doing one-by-one comparisons. Instead use set operations like union(), intersection(), and difference():
s = set(list_of_strings_containing_mac_addresses)
t = set(preconfigured_set_of_mac_addresses)
print s - t, 'addresses in the list but not preconfigured'

Improve speed of reading and converting from binary file?

I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)

Comparing file contents in Python

I have two files, say source and target. I compare each element in source to check if it also exists in target. If it does not exist in target, I print it ( the end goal is to have 0 difference). Here is the code I have written.
def finddefaulters(source,target):
f = open(source,'r')
g = open(target,'r')
reference = f.readlines()
done = g.readlines()
for i in reference:
if i not in done:
print i,
I need help with
How would this code be rated on a scale of 1-10
How can I make it better and optimal if the file sizes are huge.
Another question - When I read all the lines as list elements, they are interpreted as 'element\n' - So for correct comparison, I have to add a newline at the end of each file. Is there a way to strip the newlines so I do not have to add newline at the end of files. I tried rstrip. But it did not work.
Thanks in advance.
Regarding efficiency: The method you show has an asymptotic runtime complexity of O(m*n) where m and n are the number of elements in reference and done, i.e. if you double the size of both lists, the algorithm will run 4 times longer (times a fixed constant that is uniteresting to theoretical computer scientists). If m and n are very large, you will probably want to choose a faster algorithm, e.g sort the two lists first using the .sort() (runtime complexity: O(n * log(n))) and then go through the lists just once (runtime complexity: O(n)). That algorithm has a worst-case runtime complexity of O(n * log(n)), which is already a big improvement. However, you trade readability and simplicity of the code for efficiency, so I would only advise you to do this if absolutely necessary.
Regarding coding style: You do not .close() the file handles which you should. Instead of opening and closing the file handle, you could use the with language construct of python. Also, if you like the functional style, you could replace the for loop by a list expression:
for i in reference:
if i not in done:
print i,
then becomes:
items = [i.strip() for i in reference if i not in done]
print ' '.join(items)
However, this way you will not see any progress while the list is being composed.
As joaquin already mentions, you can loop over f directly instead of f.readlines() as file handles support the iterator protocol.
Some ideas:
1) use [with] to open files safely:
with open(source) as f:
.............
The with statement is used to wrap the
execution of a block with methods
defined by a context manager. This
allows common try...except...finally
usage patterns to be encapsulated for
convenient reuse.
2) you can iterate over the lines of a file instead of using readlines:
for line in f:
..........
3) Although for this short snippet it could be enough, try to use more informative names for your variables. One-letter names are not recommended.
4) If you want to get profit of python lib, try functions in difflib module. For example:
make_file(fromlines, tolines[, fromdesc][, todesc][, context][, numlines])
Compares fromlines and tolines (lists
of strings) and returns a string which
is a complete HTML file containing a
table showing line by line differences
with inter-line and intra-line changes
highlighted.

Categories

Resources