For a upcoming programming competition I solved a few of the tasks of former competitions.
Each task looks like this: We get a bunch of in-files (each containing 1 line of numbers and strings, f.e. "2 15 test 23 ..."), and we have to build a program and return some computed values.
These in-files can be quite large: for instance 10 MB.
My code is the following:
with open(filename) as f:
input_data = f.read().split()
This is quite slow. I quess mostly because of the split method. Is there a faster way?
What you have already looks like the best way for plain text IO on a one-line file.
10 MB of plain text is fairly large, if you need some more speedup you could consider pickling the data in a binary format instead of a plain text format. Or if it is very repetitive data, you could store it compressed.
If one of your input files contains independent tasks (that is, you can work on a couple of tokens of the line at a time, without knowing tokens further ahead), you can do reading and processing in lockstep, by simpy not reading the whole file at once.
def read_groups(f):
chunksize= 4096 #how many bytes to read from the file at once
buf= f.read(chunksize)
while buf:
if entire_group_inside(buf): #checks if you have enough data to process on buf
i= next_group_index(buf) #returns the index on the next group of tokens
group, buf= buf[:i], buf[i:]
yield group
else:
buf+= f.read(chunksize)
with open(filename) as f:
for data in read_groups(f):
#do something
This has some advantages:
You don't need to read the whole file into memory (which, for 10 MB on a desktop, probably doesn't matter much)
if you do a lot of processing on each group of tokens, it may lead to better performance, as you'll have alternating I/O and CPU bound tasks. Modern OSs use sequential file prefetching to optimize file linear access, so, in practice, if you lockstep I/O and CPU, your I/O will end up being executed in parallel by the OS. Even if your OS has no such functionality, if you have a modern disk, it'll probably cache sequential access to blocks.
If you don't have much processing, though, your task is fundamentally I/O bound, and there isn't much you can do to speed it up as it stands, as wim said - other than rethinking your input data format
Related
I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
The DataFrame you get from pd.read_csv('sample_file.csv') might fit into memory; however, pd.read_csv itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.
I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?
It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).
If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>
A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.
I am making a program that reads multiple files and writes a summary of each file to an ouput file. The size of the output file is rather big, so keeping it in memory is not a good idea. I am trying to develop a multiprocessing way of doing it. So far, the simplest way I was able to come with is:
pool = Pool(processes=4)
it = pool.imap_unordered(do, glob.iglob(aglob))
for summary in it:
writer.writerows(summary)
do is the function that summarizes the file. writer is a csv.writer object
But the truth is that I still do not understand multiprocessing.imap completely. Does this mean that 4 summaries are calculated in parallel and that when I read one of it, the 5th starts to be calculated?
Is there a better way of doing this?
Thanks.
processes=4 means that multiprocessing will start a pool with four worker processes and send the work items to them. Ideally, if you system supports it, i.e. either you have four cores, or the workers are not totally CPU-bound, 4 work items will be processed in parallel.
I don't know the implementation of multiprocessing, but I think that the results of do will be cached internally even before you read them out, i.e. the 5th item will be computed once any process is done with an item from the first wave.
If there is a better way depends on the type of your data. How many files there are in total that need processing, how large the summary objects are etc. If you have many files (say, more than 10k), batching them might be an option, via
it = pool.imap_unordered(do, glob.iglob(aglob), chunksize=100)
This way, a work item is not one file, but 100 files, and results are also reported in batches of 100. If you have many work items, chunking lowers the overhead of pickling and unpickling the result objects.
Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)