Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.
Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.
Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.
Is there a way to limit its total memory usage? Thanks
EDIT:
I also tried:
dask.set_options(available_memory=12e9)
It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.
The line
ddf = ddf.set_index("sort_col").compute()
is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.
The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.
Try going through the data in chunks with:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
I have a Python application that processes in-memory-data. It provides <1 second response, by querying ~1 million records and then aggregating the result set.
What would be the best Python framework(s) to make this application more scalable ?
Here are more details :
Data is loaded from a single table on disk which is loaded into memory as numpy arrays and custom indexes using dictionaries.
This application starts breaching the 1 second time limit when the number of records grow more than > 5 million. Search part / locating the indexes takes 100 ms only. I see lot of time (900 to 2000 milli secs) is spent in just summing up the result set.
Also able to see CPU cores & RAM are not used to their full capacity. I see each core is used only upto 20% and a plenty of memory is free.
I just read a long list of python frameworks on distributed computing. Looking for specific solutions for real time responses, by:
making a better usage of available CPU & RAM in single machine through parallel processing to stay within < 1 second response time.
later by extending it beyond a single machine, to support even ~100 million records. This data is in a single table/file that can be horizontally partitioned across many machines, and these machines can work independently on their own data.
Suggestions from what you have "seen" working from your past experience is greatly appreciated.
I am making a program that reads multiple files and writes a summary of each file to an ouput file. The size of the output file is rather big, so keeping it in memory is not a good idea. I am trying to develop a multiprocessing way of doing it. So far, the simplest way I was able to come with is:
pool = Pool(processes=4)
it = pool.imap_unordered(do, glob.iglob(aglob))
for summary in it:
writer.writerows(summary)
do is the function that summarizes the file. writer is a csv.writer object
But the truth is that I still do not understand multiprocessing.imap completely. Does this mean that 4 summaries are calculated in parallel and that when I read one of it, the 5th starts to be calculated?
Is there a better way of doing this?
Thanks.
processes=4 means that multiprocessing will start a pool with four worker processes and send the work items to them. Ideally, if you system supports it, i.e. either you have four cores, or the workers are not totally CPU-bound, 4 work items will be processed in parallel.
I don't know the implementation of multiprocessing, but I think that the results of do will be cached internally even before you read them out, i.e. the 5th item will be computed once any process is done with an item from the first wave.
If there is a better way depends on the type of your data. How many files there are in total that need processing, how large the summary objects are etc. If you have many files (say, more than 10k), batching them might be an option, via
it = pool.imap_unordered(do, glob.iglob(aglob), chunksize=100)
This way, a work item is not one file, but 100 files, and results are also reported in batches of 100. If you have many work items, chunking lowers the overhead of pickling and unpickling the result objects.
Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)