Pandas DataFrame Multithreading No Performance Gain

Pandas DataFrame Multithreading No Performance Gain - python

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?

When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

Related

Parallel Processing Python : Why parallel processing is slower than serial processing? [duplicate]

from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?

Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.

There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.

As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.

why is multiprocess Pool slower than a for loop?

from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?

Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.

There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.

As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.

Python multithreading performance

I've to process say thousands of records in an array. I did the normal for loop like this
for record in records:
results = processFile(record)
write_output_record(o, results)
The script above took 427.270612955 seconds!
As there is no dependancy between these records. I used Python multi threading module in a hope to speedup the process. below is my implementation
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(processes=threads)
results = pool.map(processFile, records)
pool.close()
pool.join()
write_output(o, results)
My computer has 8 cpu's. And it takes 852.153398991 second.
Can somebody help me as in what am I doing wrong?
PS: processFile function has no i/o's. its mostly processing the records and sending back the update record

Try using vmstat and verify whether its a memory issue. Sometimes, using multithreading can slow your system down if each thread pushes up the RAM usage by a significant amount.
Usually people encounter three types of issues: CPU bound (Constraint on CPU computations), Memory bound (Constraint on RAM) and I/O bound (Network & hard drive I/O constraints).

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

I have some string processing job in Python. And I wish to speed up the job
by using a thread pool. The string processing job has no dependency to each
other. The result will be stored into a mongodb database.
I wrote my code as follow:
thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
pool.apply_async(_process, [single_string ])
pool.close()
pool.join()
def _process(s):
# Do staff, pure python string manipulation.
# Save the output to a database (pyMongo).
I try to run the code in a Linux machine with 8 CPU cores. And it turns out
that the maximum CPU usage can only be around 130% (read from top), when I
run the job for a few minutes.
Is my approach correct to use a thread pool? Is there any better way to do so?

You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:
def _process(s):
for i in xrange(100000000):
j = i * i

Does Python support multithreading? Can it speed up execution time?

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.