Python multithreading performance

Python multithreading performance - python

I've to process say thousands of records in an array. I did the normal for loop like this
for record in records:
results = processFile(record)
write_output_record(o, results)
The script above took 427.270612955 seconds!
As there is no dependancy between these records. I used Python multi threading module in a hope to speedup the process. below is my implementation
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(processes=threads)
results = pool.map(processFile, records)
pool.close()
pool.join()
write_output(o, results)
My computer has 8 cpu's. And it takes 852.153398991 second.
Can somebody help me as in what am I doing wrong?
PS: processFile function has no i/o's. its mostly processing the records and sending back the update record

Try using vmstat and verify whether its a memory issue. Sometimes, using multithreading can slow your system down if each thread pushes up the RAM usage by a significant amount.
Usually people encounter three types of issues: CPU bound (Constraint on CPU computations), Memory bound (Constraint on RAM) and I/O bound (Network & hard drive I/O constraints).

Related

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Pandas DataFrame Multithreading No Performance Gain

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?

When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

Maximum pool size when using ThreadPool Python

I am using ThreadPool to achieve multiprocessing. When using multiprocessing, pool size limit should be equivalent to number of CPU cores.
My question- When using ThreadPool, should the pool size limit be number of CPU cores?
This is my code
from multiprocessing.pool import ThreadPool as Pool
class Subject():
def __init__(self, url):
#rest of the code
def func1(self):
#returns something
if __name__=="__main__":
pool_size= 11
pool= Pool(pool_size)
objects= [Subject() for url in all_my_urls]
for obj in objects:
pool.apply_async(obj.func1, ())
pool.close()
pool.join()
What should be the maximum pool size be?
Thanks in advance.

You cannot use threads for multiprocessing, you can only achieve multithreading. Multiple threads cannot run concurrently in a single Python process because of the GIL and so multithreading is only useful if they are running IO heavy work (e.g. talking to the Internet) where they spend a lot of time waiting, rather than CPU heavy work (e.g. maths) which constantly occupies a core.
So if you have many IO heavy tasks running at once then having that many threads will be useful, even if it's more than the the number of CPU cores. A very large number threads will eventually have a negative impact on performance, but until you actually measure a problem don't worry. Something like 100 threads should be fine.

No, you needn't limit the thread pool size be same as the number of CPU cores. If you are using it in IO high throughput situation, you could adjust your thread pool size to a suitable number which help you get the highest IO throughput, and if increasing the threads number, you couldn't get higher IO throughput number.
(I found that the maximum thread number I could set for threadpool is only around 9000, if higher, the Python3.6 reports error, and Google let me visit your question)

Why single python process's cpu usage can be more than 100 percent?

Because of GIL, I thought a multi-thread python process can only have one thread running at one time, thus the cpu usage can not be more than 100 percent.
But I found the code bellow can occupy 950% cpu usage in top.
import threading
import time
def f():
while 1:
pass
for i in range(10):
t = threading.Thread(target=f)
t.setDaemon(True)
t.start()
time.sleep(60)
This is not a same question as Python interpreters uses up to 130% of my CPU. How is that possible?. In that question, the OP said he was doing I/O intensive load-testing which may release the GIL. But in my program, there is no I/O operation.
Tests run on CPython 2.6.6.

I think in this case the CPU is busy doing thread switch instead of doing some actual work. In another word, the thread switch is using all CPUs to do its job, but the python loop code runs too fast to cause observable CPU usage. I tried adding some real calculations as below, the CPU usage dropped to around 200%. And if you add more calculations, I believe the CPU usage will be very close to 100%.
def f():
x=1
while 1:
y=x*2

One reason could be the method you're using to get to 950%. There's a number called (avg) load which is perhaps not what one would expect before reading the documentation.
The load is the (average) number of threads that's either in running or runnable state (in queue for CPU time). If you like in your example have ten busy looping threads while one thread is running the other nine is in runnable state (in queue for a time slot).
The load is an indication on how many cores that you could have made use of. Or how much CPU power your program wants to use (not necessarily the actual CPU power it gets to use).

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

I have some string processing job in Python. And I wish to speed up the job
by using a thread pool. The string processing job has no dependency to each
other. The result will be stored into a mongodb database.
I wrote my code as follow:
thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
pool.apply_async(_process, [single_string ])
pool.close()
pool.join()
def _process(s):
# Do staff, pure python string manipulation.
# Save the output to a database (pyMongo).
I try to run the code in a Linux machine with 8 CPU cores. And it turns out
that the maximum CPU usage can only be around 130% (read from top), when I
run the job for a few minutes.
Is my approach correct to use a thread pool? Is there any better way to do so?

You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:
def _process(s):
for i in xrange(100000000):
j = i * i

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.