High memory usage only when multiprocessing - python

I am trying to use python's multiprocessing library to hopefully gain some performance. Specifically I am using its map function. Now, for some reason when I swap it out with its single processed counterpart I don't get high memory usage. But using the multiprocessing version of map causes my memory to go through the roof. For the record I am doing something which can easily hog up loads of memory, but what would the difference be between the two to cause such a stark difference?

You realize that multiprocessing does not use threads, yes? I say this because you mention a "single threaded counterpart".
Are you sending a lot of data through multiprocessing's map? A likely cause is the serialization multiprocessing has to do in many cases. multiprocessing uses pickle, which does typically take up more memory than the data it's pickling. (In some cases, specifically on systems with fork() where new processes are created when you call the map method, it can avoid the serialization, but whenever it needs to send new data to existing process it cannot do so.)
Since with multiprocessing all of the actual work is done in separate processes, the memory of your main process should not be affected by the actual operations you perform. The total use of memory does go up by quite a bit, however, because each worker process has a copy of the data you sent across. This is sometimes copy-on-write memory (in the same cases as not serializing) on systems that have CoW, but Python's use of memory is such that this quickly becomes written to, and thus copied.

Related

How to speed up nested loops in python with concurrency?

i have the following code:
def multiple_invoice_matches(payment_regex, invoice_regex):
multiple_invoice_payment_matches=[]
for p in payment_regex:
if p["match_count"]>1:
for k in p["matches"]:
for i in invoice_regex:
if i["rechnung_nr"] ==k:
multiple_invoice_payment_matches.append({"fuzzy_ratio":100, "type":2, "m_match":0, "invoice":i, "payment":p})
return multiple_invoice_payment_matches
The sizes of payment_regex and invoice_regex are really huge. Therefore, the code snippet give above takes too much time to return the result. How can I speed up running time of this code?
You could take a look at the numba library, if your data has the possibility of parallelization, rewrite your function using the numba library would definitely speed up your code.
Without the dimensions of size and how your data is structured it's kind of hard to give a general approach to optimize your function.
I could say partition your data into multiple ranges (either by payment_regex, or by invoice_regex, or both) and then add those partitions to a work queue that is processed by multiple threads. Wait for those threads to finish (i.e.: join them), and then construct your final list based on the partial results you got for each partition.
This will work well in other programming languages, but unfortunately, not in Python, because of GIL - the Python's Global Interpreter Lock.
If you don't know much about GIL here's a decent article, saying:
The Python Global Interpreter Lock or GIL, in simple words,
is a mutex (or a lock) that allows only one thread to hold
the control of the Python interpreter.
[...]
The impact of the GIL isn’t visible to developers who execute
single-threaded programs, but it can be a performance bottleneck
in CPU-bound and multi-threaded code.
To evade GIL you basically have two options:
(1) spawn multiple Python processes and use shared memory for backing up your data => concurrency will now rely on the OS for switching between processes (e.g.: use numpy and shared memory, see here)
(2) use a Python package that can manipulate your data and implements the multi-threading model in C, where GIL is not effective (e.g.: use numba)
You may ask yourself then why Python supports multi-threading in the first place?
Multi-threading in Python is mostly useful when the threads are blocked by IO operations (read/write of files, sockets, etc.) or by other system calls that put the thread in the sleep state. That's where Python releases the GIL lock and other threads can operate concurrently while some are at sleep.

Dask distributed memory and constant pickling/unpickling of results

dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks #MRocklin!)
While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.
It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.
When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?
In case the same process needs the calculation result (continuing based on the temporary result), is it smart enough to keep a cache in the given process and reuse results there?
Yes. Dask keeps data in memory on workers until that data is no longer needed. https://distributed.dask.org/en/latest/memory.html
Edit: When Dask moves data between processes on the same host then yes, it serializes data and moves it across a local socket and then deserializes it. It doesn't necessarily use pickle to serialize. It depends on the type.

Python multithreading in IO-bound tasks

It is recommended to use Python multi-threading only in IO-bound tasks because Python has a global interpreter lock (GIL) that only allows one thread to hold the control of the Python interpreter. However, Does multithreading make sense for IO-bound operations? says that, in general, multithreading in disk IO-bound tasks only makes sense if you are accessing more than one disk, given that the bottleneck is the disk.
Given that, if I have several tasks that access a database in a single local disk running simultaneously, is there any advantage in using multithreading, as the bottleneck will be the disk?
Does the answer change if the database is stored in a single remote disk? I guess that possibly yes, given that there is another variable which may be the bottleneck: the round-trip time between me and the server.
CPython and Pypy both have problems with threading CPU-bound tasks. Others, like Jython and IronPython do not.
Sometimes it makes sense to use multithreading or multiprocessing with I/O bound tasks, because a disk seek is an eon to the CPU, so if you can get some CPU work out of the way while you wait for a disk response, you've done a good thing.
If you write your code to have a tunable amount of parallelism, you can experimentally deduce a good number for your workload.
If you write your code to use the new concurrent.futures API, you can (mostly) easily flip between threads and processes using the similar:
concurrent.futures.ThreadPoolExecutor
concurrent.futures.ProcessPoolExecutor
This API is available in CPython 3.2 and up, as well as Tauthon 2.8.
Here's an example program: http://stromberg.dnsalias.org/~strombrg/coordinate/
HTH.

How to read / process large files in parallel with Python

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.
Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.
Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?
I'm using Python 3.6.X
Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.
Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.
There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing.
If that does not help, you could try:
Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.
The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.
The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.
The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.
However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.
The performance difference between serial and parallel can be easily seen:
First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM.
Then run a simple serial code and time it.
Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.
This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).
The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.
As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process().
Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.
Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.
If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.
Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically
For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.
You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.
Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.
If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)
IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.
You may want to look at Twisted. It is designed for asynchronous network tasks.

Categories

Resources