Dask distributed memory and constant pickling/unpickling of results

Dask distributed memory and constant pickling/unpickling of results - python

dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks #MRocklin!)
While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.
It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.
When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?

In case the same process needs the calculation result (continuing based on the temporary result), is it smart enough to keep a cache in the given process and reuse results there?
Yes. Dask keeps data in memory on workers until that data is no longer needed. https://distributed.dask.org/en/latest/memory.html
Edit: When Dask moves data between processes on the same host then yes, it serializes data and moves it across a local socket and then deserializes it. It doesn't necessarily use pickle to serialize. It depends on the type.

Related

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

How to make sure child process finishes copying data into shared memory before join() is called?

I am using multiprocessing.Process to load some images and store them in a shared memory as explained here. The problem is, sometimes my code crashes due to a huge memory spike at completely random times. I just had an idea of what might be causing this: the process does not have had enough time to copy the contents of the image into the shared memory in RAM by the time join(). To test my hypothesis I added time.sleep(0.015) after doing join() on each of my processes and this has already reduced the number of memory spikes by about 90% or more. However, I'm still not 100% certain whether not getting memory spikes as often is because that little amount of time could help the data to get fully transferred to the shared memory or not.
So I wonder, is there a way to make sure a child process has finished copying data to memory before .join() is called? I do not want to use a fixed number when calling time.sleep(). It would be great to to know when the data has been fully transferred to the shared memory and then do join().

How to read / process large files in parallel with Python

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.
Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.
Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?
I'm using Python 3.6.X

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.
Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing.
If that does not help, you could try:
Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.

The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.
The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.
The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.
However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.
The performance difference between serial and parallel can be easily seen:
First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM.
Then run a simple serial code and time it.
Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.
This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).
The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.
As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process().
Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.
Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.

Python Manager Dictionary Efficiency

I have an object oriented Python program where I am doing certain data operations in each object using multiprocessing. I am storing each object in a common manager dictionary. When I want to update an object, first, I am retrieving the object from the dictionary and after the update I am putting it back. My Class structure is like
from src.data_element import Data_element
from multiprocessing import freeze_support, Process, Manager
import pandas as pd
class Data_Obj(Data_element):
def __init__(self, dataset_name,name_wo_fields, fields):
Data_element.__init__(self, dataset_name, name_wo_fields, fields)
self.depends=['data_1','data_2']
def calc(self,obj_dict_manager):
data_1=obj_dict_manager['data_1']
data_2=obj_dict_manager['data_2']
self.df = pd.merge(
data_1.df,
data_2.df,
on='week',
suffixes=('', '_y')
)[['week','val']]
def calculate(obj_dict_manager,data):
data_obj=obj_dict_manager[data]
data_obj.calc(obj_dict_manager)
obj_dict_manager[data]=data_obj
if __name__ == '__main__':
freeze_support()
manager=Manager()
obj_dict_manager=manager.dict()
obj_dict_manager=create_empty_objects(obj_dict_manager)
joblist=[]
for data in obj_dict_manager.keys():
p=Process(target=calculate, args=(obj_dict_manager,data))
joblist.append(p)
p.start()
for job in joblist:
job.join()
During these operations, there is a significant time spend on
data_1=obj_dict_manager['data_1']
data_2=obj_dict_manager['data_2']
i.e., the 1 second spend during retrieving the objects from the manager dictionary and the rest of the calculation takes another 1 second.
Is there any way that I can reduce the time spent here? I will be doing thousands of such operations and performance is critical for me.

An Important Note
You're doing something that is potentially dangerous: as you're iterating over the keys in obj_dict_manager, you're launching processes that modify the very same dictionary. You should never be modifying something that while you're iterating over it, and doing the modifications asynchronously from a subprocess could introduce especially strange results.
Possible Causes of your Issue
1) I can't tell how many objects are actually stored in your shared dictionary (because we don't have the code for create_empty_objects()), but if it is a significant amount, your subprocesses may be competing for access to the shared dictionary. In particular, since you have both reading and writing to the dictionary, it's going to be locked by one process or another a lot of the time.
2) Since we can't see how many keys are in your shared dictionary, we also can't see how many processes are being launched. If you're creating more processes than cores on your system, you may be subjecting your CPU to a lot of context switching, which is going to slow everything down.
3) A combination of #1 & #2 - This could be especially problematic if the manager grants a lock to one process, then that process gets put to sleep because you have dozens of processes competing for CPU time on an 8-core machine, and now everyone has to wait until that process wakes up and releases the lock.
How to Fix It
1) If your issue is skewed towards #1, consider splitting up your dictionary instead of utilize a shared one, and pass a chunk of the dictionary to each subprocess, let them do whatever they need, have them return the resulting dictionary, then recombine all the returned dictionaries as the processes complete. Something like multiprocessing.map_async() may work better for you if you can divide the dictionary up.
2) In most cases, try to limit the number of processes you spawn to the number of cores you have on your system, some times even less if you have a lot of other stuff running at the same time on your system. An exception to this is if you're doing a lot of parallel processing AND you expect the subprocesses to get block a lot, like when doing IO in parallel.

High memory usage only when multiprocessing

I am trying to use python's multiprocessing library to hopefully gain some performance. Specifically I am using its map function. Now, for some reason when I swap it out with its single processed counterpart I don't get high memory usage. But using the multiprocessing version of map causes my memory to go through the roof. For the record I am doing something which can easily hog up loads of memory, but what would the difference be between the two to cause such a stark difference?

You realize that multiprocessing does not use threads, yes? I say this because you mention a "single threaded counterpart".
Are you sending a lot of data through multiprocessing's map? A likely cause is the serialization multiprocessing has to do in many cases. multiprocessing uses pickle, which does typically take up more memory than the data it's pickling. (In some cases, specifically on systems with fork() where new processes are created when you call the map method, it can avoid the serialization, but whenever it needs to send new data to existing process it cannot do so.)
Since with multiprocessing all of the actual work is done in separate processes, the memory of your main process should not be affected by the actual operations you perform. The total use of memory does go up by quite a bit, however, because each worker process has a copy of the data you sent across. This is sometimes copy-on-write memory (in the same cases as not serializing) on systems that have CoW, but Python's use of memory is such that this quickly becomes written to, and thus copied.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.