I'm trying to load a large CSV file into a pandas dataframe. The CSV is rather large: a few GB.
The code is working, but rather slowly. Slower than I would expect it to even. If I take only 1/10th of the CSV, the job is done in about 10 seconds. If I try to load the whole file, it takes more than 15 minutes. I would expect this to just take roughly 10 times as long, not ~100 times.
The amount of RAM used by python is never above exactly 1,930.8 MB (there is 16GB in my system):
enter image description here
It seems to be capped at this, making me think that there is some sort of limit on how much RAM python is allowed to use. However, I never set such a limit and online everyone says "Python has no RAM limit".
Could it be that the RAM python is allowed to use is limit somewhere? And if so, how do I remove that limit?
The problem is not just how much RAM it can use, but how fast is your CPU. Loading very large csv file is very time-consuming if you just use plain pandas. Here are a few options:
You can try other libraries that are made to work with big data. This tutorial shows some libraries. I like dask. Its API is like pandas.
If you have GPU, you can use rapids (which is also mentioned in the link). Man, rapids is really a game changer. Any computation on GPU is just significantly faster. One drawback is that not all features in pandas are yet implemented, but that's if you need them.
The last solution, although not recommended, is you can process your file in batches, e.g., use a for loop, load only the first 100K rows, process them, save, then continue doing so until the file ends. This is still very time-consuming but that's the most naive way.
I hope it helps.
I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
The DataFrame you get from pd.read_csv('sample_file.csv') might fit into memory; however, pd.read_csv itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.
I have some data stored in a tree in memory and I regularly store the tree into disk using pickle.
Recently I noticed that the program using a large memory, then I checked saved pickle file, it is around 600M, then I wrote an other small test program loading the tree back into memory, and I found that it would take nearly 10 times memory(5G) than the size on disk, is that normal? And what's the best way to avoid that?
No it's not normal. I suspect your tree is bigger than you think. Write some code to walk it and add up all the space used (and count the nodes).
See memory size of Python data structure
Also what exactly are you asking? Are you surprised that a 600M data structure on disk is 5G in memory. That's not particularly surprising. Pickle compresses the data so you expect it to be smaller on disk. It's smaller by a factor of 10 (roughly) which is pretty good.
If you're surprised by the size of your own data that's another thing.
In Python 3, for disk space (not speed) of saving Pandas objects, is HDF5 or Pickle better? Preferably, also, how much better?
Prior research:
I searched for an article comparing storage methods, and here is the most popular one. Unfortunately, it only talks about speed, which is less important to me.
I tested an example object myself, and found that HDF5 and Pickle produced basically the same file size for my example object. But I don't want to just trust my own result, because maybe my result is due to the specific structure/type of data I happened to test with.
Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)