Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)
Related
I'm trying to load a large CSV file into a pandas dataframe. The CSV is rather large: a few GB.
The code is working, but rather slowly. Slower than I would expect it to even. If I take only 1/10th of the CSV, the job is done in about 10 seconds. If I try to load the whole file, it takes more than 15 minutes. I would expect this to just take roughly 10 times as long, not ~100 times.
The amount of RAM used by python is never above exactly 1,930.8 MB (there is 16GB in my system):
enter image description here
It seems to be capped at this, making me think that there is some sort of limit on how much RAM python is allowed to use. However, I never set such a limit and online everyone says "Python has no RAM limit".
Could it be that the RAM python is allowed to use is limit somewhere? And if so, how do I remove that limit?
The problem is not just how much RAM it can use, but how fast is your CPU. Loading very large csv file is very time-consuming if you just use plain pandas. Here are a few options:
You can try other libraries that are made to work with big data. This tutorial shows some libraries. I like dask. Its API is like pandas.
If you have GPU, you can use rapids (which is also mentioned in the link). Man, rapids is really a game changer. Any computation on GPU is just significantly faster. One drawback is that not all features in pandas are yet implemented, but that's if you need them.
The last solution, although not recommended, is you can process your file in batches, e.g., use a for loop, load only the first 100K rows, process them, save, then continue doing so until the file ends. This is still very time-consuming but that's the most naive way.
I hope it helps.
How to upload two large (5GB) each csv file in local system Jupyter Notebook using python pandas. Please suggest any configuration to handle big csv files for data analysis ?
Local System Configuration:
OS: Windows 10
RAM: 16 GB
Processor: Intel-Core-i7
Code:
dpath = 'p_flg_tmp1.csv'
pdf = pd.read_csv(dpath, sep="|")
Error:
MemoryError: Unable to allocate array
or
pd.read_csv(po_cust_data, sep="|", low_memory=False)
Error:
ParserError: Error tokenizing data. C error: out of memory
How to handle two bigger csv file in local system for data analysis? please suggested better configuration if possible in local system using python pandas.
If you do not need to process everything at once you can use chunks:
reader = pd.read_csv('tmp.sv', sep='|', chunksize=4000)
for chunk in reader:
print(chunk)
see the Documentation of Pandas for further information.
If you need to process everything at once and chunking really isnt an option you have only two options left
Increase RAM of your system
Switch to another data storage type
A csv file takes an enormous amount of memory in RAM, see this article for more information even if it is for another software it gives a good idea about the problem:
Memory Usage
You can estimate the memory usage of your CSV file with this simple
formula:
memory = 25 * R * C + F
where R is the number of rows, C the number of columns and F the file size in bytes.
One of my test files is 524 MB large, contains 10 columns in 4.4
million rows. Using the formula from above the RAM usage will be about
1.6 GB:
memory = 25 * 4,400,000 * 10 + 524,000,000 = 1,624,000,000 bytes
While this file is opened in Tablecruncher the Activity Monitor
reports 1.4 GB RAM used, so the formula represents a rather accurate
guess.
Use chunk to read data partially.
dpath = 'p_flg_tmp1.csv'
for pdf in pd.read_csv(dpath, sep="|", chunksize=1000):
*do something here*
What's your overall goal here though? People are providing help with how to read it, but then what? You want to do a join/merge? You are gonna need more tricks to get through that.
But then what? Is the rest of your algorithm also chunkable? Are you going to have enough RAM left to process anything? And what about CPU performance? Is one little i7 enough? Do you plan on waiting hours or days for results? This might all be acceptable for your use case, sure, but we don't know that.
At a certain point, if you want to use big data, you need big computer(s). Do you really have to do this locally? Even if you aren't ready for distributed computing over clusters, you could just get an adequately sized VM instance. Your company will pay for it. They pay for themselves. It's much cheaper to give you a better computer than to pay you to wait around for a small one to finish. In India, the price ratios between labor / AWS costs is lower than in the US, sure, but it's still well worth it. Be like hey boss, do you want this to take 3 days or 3 weeks?
Realistically, your small computer problems are only going to get worse after reading in the CSV. I mean I don't know your use case, but that seems likely. You could spend a long time trying to engineer your way out of these problems, but much cheaper to just spin up an EC2 instance.
For a upcoming programming competition I solved a few of the tasks of former competitions.
Each task looks like this: We get a bunch of in-files (each containing 1 line of numbers and strings, f.e. "2 15 test 23 ..."), and we have to build a program and return some computed values.
These in-files can be quite large: for instance 10 MB.
My code is the following:
with open(filename) as f:
input_data = f.read().split()
This is quite slow. I quess mostly because of the split method. Is there a faster way?
What you have already looks like the best way for plain text IO on a one-line file.
10 MB of plain text is fairly large, if you need some more speedup you could consider pickling the data in a binary format instead of a plain text format. Or if it is very repetitive data, you could store it compressed.
If one of your input files contains independent tasks (that is, you can work on a couple of tokens of the line at a time, without knowing tokens further ahead), you can do reading and processing in lockstep, by simpy not reading the whole file at once.
def read_groups(f):
chunksize= 4096 #how many bytes to read from the file at once
buf= f.read(chunksize)
while buf:
if entire_group_inside(buf): #checks if you have enough data to process on buf
i= next_group_index(buf) #returns the index on the next group of tokens
group, buf= buf[:i], buf[i:]
yield group
else:
buf+= f.read(chunksize)
with open(filename) as f:
for data in read_groups(f):
#do something
This has some advantages:
You don't need to read the whole file into memory (which, for 10 MB on a desktop, probably doesn't matter much)
if you do a lot of processing on each group of tokens, it may lead to better performance, as you'll have alternating I/O and CPU bound tasks. Modern OSs use sequential file prefetching to optimize file linear access, so, in practice, if you lockstep I/O and CPU, your I/O will end up being executed in parallel by the OS. Even if your OS has no such functionality, if you have a modern disk, it'll probably cache sequential access to blocks.
If you don't have much processing, though, your task is fundamentally I/O bound, and there isn't much you can do to speed it up as it stands, as wim said - other than rethinking your input data format
I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?
It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).
If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>
A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.