Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time.
Is it expected? I would thought the data would be spilled to disk and there are plenty of disk space left.
Is there a way to limit its total memory usage? Thanks
EDIT:
I also tried:
dask.set_options(available_memory=12e9)
It did not work. It did not seemed to limit its memory usage. Again, when memory usage reach 100%, the job gets killed.
The line
ddf = ddf.set_index("sort_col").compute()
is actually pulling the whole dataframe into memory and converting to pandas. You want to remove the .compute(), and apply whatever logic (filtering, groupby/aggregations, etc.) that you want first, before calling compute to produce a result that is small enough.
The important thing to remember, is that the resultant output must be able to fit into memory, and each chunk that is being processed by each worker (plus overheads) also needs to be able to fit into memory.
Try going through the data in chunks with:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Related
I'm trying to load a large CSV file into a pandas dataframe. The CSV is rather large: a few GB.
The code is working, but rather slowly. Slower than I would expect it to even. If I take only 1/10th of the CSV, the job is done in about 10 seconds. If I try to load the whole file, it takes more than 15 minutes. I would expect this to just take roughly 10 times as long, not ~100 times.
The amount of RAM used by python is never above exactly 1,930.8 MB (there is 16GB in my system):
enter image description here
It seems to be capped at this, making me think that there is some sort of limit on how much RAM python is allowed to use. However, I never set such a limit and online everyone says "Python has no RAM limit".
Could it be that the RAM python is allowed to use is limit somewhere? And if so, how do I remove that limit?
The problem is not just how much RAM it can use, but how fast is your CPU. Loading very large csv file is very time-consuming if you just use plain pandas. Here are a few options:
You can try other libraries that are made to work with big data. This tutorial shows some libraries. I like dask. Its API is like pandas.
If you have GPU, you can use rapids (which is also mentioned in the link). Man, rapids is really a game changer. Any computation on GPU is just significantly faster. One drawback is that not all features in pandas are yet implemented, but that's if you need them.
The last solution, although not recommended, is you can process your file in batches, e.g., use a for loop, load only the first 100K rows, process them, save, then continue doing so until the file ends. This is still very time-consuming but that's the most naive way.
I hope it helps.
I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?
You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html
Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.
Background: In Hadoop Streaming, each reduce job writes to the hdfs as it finishes, thus clearing the way for the Hadoop cluster to execute the next reduce.
I am having trouble mapping this paradigm to (Py)Spark.
As an example,
df = spark.read.load('path')
df.rdd.reduceByKey(my_func).toDF().write.save('output_path')
When I run this, the cluster collects all of the data in the dataframe before it writes anything to disk. At least this is what it looks like is happening as I watch the job progress.
My problem is that my data is much bigger than my cluster memory, so I run out of memory before any data is written. In Hadoop Streaming, we don't have this problem because the output data is streamed to the disk to make room for the subsequent batches of data.
I have considered something like this:
for i in range(100):
(df.filter(df.loop_index==i)
.rdd
.reduceByKey(my_func)
.toDF()
.write.mode('append')
.save('output_path'))
where I only process a subset of my data in each iteration. But this seems kludgy mainly because I have to either persist df, which isn't possible because of memory constraints, or I have to re-read from the input hdfs source in each iteration.
One way to make the loop work is to partition the source folders by day or some other subset of the data. But for the sake of the question, let's assume that isn't possible.
Questions: How do I run a job like this in PySpark? Do I just have to have a much bigger cluster? If so, what are the common practices for sizing a cluster before processing the data?
It might help to repartition your data in a large number of partitions. The example below would be similar to your for loop, although you may want to try with less partitions first
df = spark.read.load('path').repartition(100)
You should also review the number of executors you are currently using (--num-executors). Reducing this number should also reduce your memory footprint.
I am reading in a chunk of data from a pytables.Table (version 3.1.1) using the read_where method from a big hdf5 file. The resulting numpy array has about 420 MB, however the memory consumption of my python process has gone up by 1.6GB during the read_where call and the memory is not released after the call is finished. Even deleting the array, closing the file and deleting the hdf5 file handle does not free the memory.
How can I free this memory again?
The huge memory consumption is due to the fact that python implements a lot of stuffs around the data to facilitate its manipulation.
You've got a good explanation of why the memory use is maintained here and there (found on this question). A good workaround would be to open and manipulate your table in a subprocess with the multiprocessing module
We would need more context on the details of your Table object, like how large is it and the chunk size. How HDF5 handles chunking is probably one of the largest responsibles for hugging memory in this case.
My advice is to have a thorough read of this: http://pytables.github.io/usersguide/optimization.html#understanding-chunking and experiment with different chunksizes (typically making them larger).