Optimizing number of threads in Python - python

I am writing a program that analyses csv files in a directory, initially one file at a time. This could be several hundred files, but all of them are relatively small. My main runtime limitation was I/O, so I turned to multithreading using the threading library, which is a first for me.
I created a thread for each function call, following this guide, where each function call opens a csv in the desired directory. As a result, I have a list of threads for each file (i.e. hundreds of threads). However, my program still ran slowly, with the bulk of its time spent method 'acquire' of '_thread.lock' objects according to cProfile. I believe that this is because of the large number of threads resulting in lots of threads waiting for others to finish their tasks - is this correct?
How would you recommend I resolve this? My current idea is to split my list of files into equally sized chunks and to assign a thread to each chunk, rather than a thread to each file, and for each thread to iterate through the files in each chunk.

Python has something called the Global Interpreter Lock which seriously hurts your performance with that many threads, as each one is waiting to hold the "interpreter lock." I would recommend using Processes which if I remember are similar to Python thread objects in their use but do not suffer the same performance penalty of waiting for a lock. A thread and a process are different, but for your application, it sounds like it should not matter.
It is worth noting that the GIL can be released when performing I/O such as reading from a file, and therefore using threads might be fine - you just need to use fewer of them. In fact, with the number of threads/processes you are looking to create it might be a better idea to use a fixed pool of workers.

Related

Fast IO operations in a seperate thread

I have a set of instruments my program communicates with and I'd like to put the communication in a separate thread.
The IO is pretty slow (~100 ms per item per instrument), and I need to record the resulted values from them in a shared array (of the last N values) and saved to a file, with repeated measurments taken as quickly as possible. Some instruments are slow to 'formulate' a responce, so some readings could be done concurrently, but the readings need to be approximately syncronised (i.e. 1 timestamp per row of readings)
I'd like this all to be done in a seperate thread(s) so the timing can't be interupted by computation etc happening in the main thread, but the main thread should be able to access the array.
Ideally I should be able to run some daq.start() and it gets going without further interaction.
What's the 'pythonic' way to do this? I've been reading about asyncio, threading, and multiprocessing and it's unclear to me which is appropriate.
In c++ I would have start thread1, which would just record measurments sequentially into a cache array. thread2 would flush that cache into the main shared array whenever it could get a lock. At the same time it would write that to the output file. By keeping track of the index ranges which were being read, lock conflicts would be rare (but importantly wouldn't interupt the DAQ when they happen)
the correct answer here is threads, this is not an opinion.
communicating with hardware is done using drivers implemented in DLLs, and when python calls a DLL it drops the GIL, therefore threads can execute in the background with as little overhead as possible on the python interpreter itself.
proper synchronization should be done, which include thread locks if writing to file is done using the threads, but also when writing to files the threads will drop the GIL and they will still have little to no overhead on your python interpreter.
both of the above are not the case for asyncio, which is designed for asynchronous networking, not hardware.
for the implementation, a threadpool is usually the most pythonic way to go about this, you just spawn as much workers as the number of instruments that you connect to and let them do their work.
since you are not using any of asyncio features, you should be using multiprocesing.threadpool with apply_async or imap_unordered and the thread locks from the threading module for writing to disk, there is also a barrier if you want to synchronize each frame across all threads.

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!
Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Thread are not happening at the same time?

I have a program that fetches data via an API. I created a function that only takes the target data as an argument and with a for-loop I run this method 10 times.
The programm takes quite some time to display the data because the next function call only happens when the function before has done its work.
I want to use Threads to make it all happen quicker. However, I'm confused. On realpython.org I read this:
A thread is a separate flow of execution. This means that your program will have two things happening at once. But for most Python 3 implementations the different threads do not actually execute at the same time: they merely appear to. It’s tempting to think of threading as having two (or more) different processors running on your program, each one doing an independent task at the same time. That’s almost right. The threads may be running on different processors, but they will only be running one at a time.
First they say: "This means that your program will have two things happening at once" and then they say "but they will only be running one at a time". So my threads are not done simultaneously?
I want to make a decision on whether to use Threads or Multiprocessing but I can't figure it out.
Can somebody help?
With both Threads or Multiprocessing you must assume that execution of your program could jump from one thread/process to another randomly. The difference is that with Threads, code is never really executed at the same time. That means there is always only one CPU core doing your work. With Multiprocessing, your code runs on multiple cores at the same time. So only Multiprocessing will solve your computation N times faster with N processes. (There will be some overhead of course.) If you are not doing any heavy computation, but need to create the illusion of things running in parallel, use threads. This is especially useful for GUIs.
The confusing part is that IO (copying files or loading something from the web for example) is not CPU bound, as it does not require a lot of CPU instructions to happen. So always use threads for this. To understand it a bit more, you should realise that when a thread is waiting for an IO operation to finish, it is actually in a blocked state. This allows other threads to run. So if you use threads to fetch data the first thread will start loading it and then block. This makes room for the the second thread to do the same and so on. When one of the threads has the data ready, it will unblock, run the rest of its code and finish.
(Note that when multiple threads are running they can pause randomly and give room for other threads to run for a while and then carry on. (See first sentence of this answer.))
Generally always use threads unless you need to do something CPU heavy in parallel. Multiprocessing has a lot of limitations when it comes to how it works internally and using it is more complicated and heavy.
This only applies to some implementations of Python tough, for example the most commonly used "official" implementation, CPython. In other languages or less common Python implementations threads are often able to execute instructions on multiple cores at the same time.

Python: Interruptable threading in wx

My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?
Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.

python program choice

My program is ICAPServer (similar with httpserver), it's main job is to receive data from clients and save the data to DB.
There are two main steps and two threads:
ICAPServer receives data from clients, puts the data in a queue (50kb <1ms);
another thread pops data from the queue, and writes them to DB SO, if 2nd step is too slow, the queue will fill up memory with those data.
Wondering if anyone have any suggestion...
It is hard to say for sure, but perhaps using two processes instead of threads will help in this situation. Since Python has the Global Interpreter Lock (GIL), it has the effect of only allowing any one thread to execute Python instructions at any time.
Having a system designed around processes might have the following advantages:
Higher concurrency, especially on multiprocessor machines
Greater throughput, since you can probably spawn multiple queue consumers / DB writer processes to spread out the work. Although, the impact of this might be minimal if it is really the DB that is the bottleneck and not the process writing to the DB.
One note: before going for optimizations, it is very important to get some good measurement, and profiling.
That said, I would bet the slow part in the second step is database communication; you could try to analyze the SQL statement and its execution plan. and then optimize it (it is one of the features of SQLAlchemy); if still it would be too slow, check about database optimizations.
Of course, it is possible the bottleneck would be in a completely different place; in this case, you still have chances to optimize using C code, dedicated network, or more threads - just to give three possible example of completely different kind of optimizations.
Another point: as I/O operations usually release the GIL, you could also try to improve performance just by adding another reader thread - and I think this could be a much cheaper solution.
Put an upper limit on the amount of data in the queue?

Categories

Resources