Threading in python to build wordcloud

Threading in python to build wordcloud - python

I am building a simple word frequency counter application in python. The document has close to 1.6 million words. I divide the work (sentences) equally among threads. Ideally one would except the running time to decrease as number of threads increase (till some threshold) but this doesn't seems to be the case in my testing. Single thread is considerably faster than multithreaded implementation every time for some reason. I was earlier using locks to write into a global hash table but I found that to be way inefficient for multithreaded environment (at least 2-3 times slower). Then I started writing into individual hash tables for each thread and merging them in the end. The single threaded approach now takes time ~4.5 seconds but the multithreaded way is at least a second or two slower. Any thoughts on what I am doing wrong maybe?

The reason is called Global Interpreter Lock. This mechanism makes it possible for only one thread to be executed at any given time.
You may notice another anomaly, the more cores you have, slower you code will run. If all threads are on the single core, OS can schedule their execution without contention for GIL, however if threads are split between multiple cores than there will be contention for the GIL which can be very noticeable.
If you wish to do parallel processing, than you should consider approach with multiple processes, not threads, that is the preferred approach in python.
You could still use threads for blocking IO operations, although it's better to write non-blocking code in this scenario, twisted is popular framework for this.

Related

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Thread are not happening at the same time?

I have a program that fetches data via an API. I created a function that only takes the target data as an argument and with a for-loop I run this method 10 times.
The programm takes quite some time to display the data because the next function call only happens when the function before has done its work.
I want to use Threads to make it all happen quicker. However, I'm confused. On realpython.org I read this:
A thread is a separate flow of execution. This means that your program will have two things happening at once. But for most Python 3 implementations the different threads do not actually execute at the same time: they merely appear to. It’s tempting to think of threading as having two (or more) different processors running on your program, each one doing an independent task at the same time. That’s almost right. The threads may be running on different processors, but they will only be running one at a time.
First they say: "This means that your program will have two things happening at once" and then they say "but they will only be running one at a time". So my threads are not done simultaneously?
I want to make a decision on whether to use Threads or Multiprocessing but I can't figure it out.
Can somebody help?

With both Threads or Multiprocessing you must assume that execution of your program could jump from one thread/process to another randomly. The difference is that with Threads, code is never really executed at the same time. That means there is always only one CPU core doing your work. With Multiprocessing, your code runs on multiple cores at the same time. So only Multiprocessing will solve your computation N times faster with N processes. (There will be some overhead of course.) If you are not doing any heavy computation, but need to create the illusion of things running in parallel, use threads. This is especially useful for GUIs.
The confusing part is that IO (copying files or loading something from the web for example) is not CPU bound, as it does not require a lot of CPU instructions to happen. So always use threads for this. To understand it a bit more, you should realise that when a thread is waiting for an IO operation to finish, it is actually in a blocked state. This allows other threads to run. So if you use threads to fetch data the first thread will start loading it and then block. This makes room for the the second thread to do the same and so on. When one of the threads has the data ready, it will unblock, run the rest of its code and finish.
(Note that when multiple threads are running they can pause randomly and give room for other threads to run for a while and then carry on. (See first sentence of this answer.))
Generally always use threads unless you need to do something CPU heavy in parallel. Multiprocessing has a lot of limitations when it comes to how it works internally and using it is more complicated and heavy.
This only applies to some implementations of Python tough, for example the most commonly used "official" implementation, CPython. In other languages or less common Python implementations threads are often able to execute instructions on multiple cores at the same time.

Why python does (not) use more CPUs? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Does Python support multithreading? Can it speed up execution time?

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

How does threading or Multiprocessing work with recursions?

Background
I'm a bit new to developing and had a general python/programming question. If you have a method that is a recursion, what is involved to enabling multiple threads or multiprocessing? I've done some light reading and a few examples but they seem to be applying the syntax for new code(and not very cpu intensive tasks), I'm more wondering how do I re-design existing code to do this?
Say I have something thats cpu intensive(basically keeps adding to itself until limit is hit):
def adderExample(sum, number):
if sum > 1000:
print 'sum is larger than 10. Stoping'
else:
sum = sum + number
print sum
number = number + 1
adderExample(sum, number)
adderExample(0,0)
Question(s)/Though process
How would I approach this to make it run faster assuming I have multiple cores available(I want it to eventually want it span machines but I think thats a sperate issue with hadoop so I'll keep this example to only one system with multiple cpu's)? It seems threading it isn't the best choice(because of the time it takes to spawn new threads), if thats true should I only focus on multiprocessing? If so, can recursions be split to different cpu's(vai queues I assume and then rejoin after its done)? Can I create multiple threads for each process than split those processes over multiple cpu's? Lastly, is recursion depth limits an overall limit or is it based on threads/proceses, if so does multiprocessing/threading get around it?
Another question(related) how do those guys trying to codes(rsa, wireless keys,etc) via brute force overcome this problem? I assume they are scaling their mathematical processes over multiple cpu somehow. This or any example to build my understanding would be great.
Any tips/suggestions would be great
Thanks!

Such a loop wouldn't benefit much at all from threading. Consider that you're doing a series of additions, whose intermediate values depend on the previous iterations. This can't be parallelized, because the threads would be stomping on each other's values and overwriting things. You can lock the data so only one thread works on it at a time, but then you lose any benefit of having multiple threads working on that data.
Threads work best when they have independent data sets. e.g. a graphics renderer is a perfect example. Each thread renders a subset of the larger image - they may share common data sources for texture/vertex/color/etc... data, but each thread has its own little section of the total image to work one, and doesn't touch other areas of the image. Whatever thread #1 does on its little section of pixels won't affect what thread #2 is doing elsewhere in the image.
For your related question, password cracking is another example where threading/multiprocessing makes sense. Each thread goes off on its own testing multiple possible passwords against one common "to be cracked" list. What one thread is doing doesn't affect any of the other cracker threads, unless you get a match, which may mean all threads abort since the job is "done".
Once threads become interdependent on each other, you lose a lot of the benefits of having multiple threads. They'll spend more time waiting for the other to finish than they'll spend on doing actual work. Of course, this doesn't say you should never use threads. Sometimes it does makes sense to have multiple threads, even if they are interdependent. E.g. a graphics thread + sound effects thread + action processor thread + A.I. calculations thread, etc... in a game. each one is nominally dependent on each other, but while the sound thread is busy generating the bang+ricochet audio for the gun the player just shot, the a.i. thread is off calculating what the game's mobs are doing, the graphics thread is drawing some clouds in the background, etc...

Threading kinda sorta implies multiple stacks, recursion single stacks. That said, if you get to the recurse-left, recurse-right part and decide to spawn threads for the sub-problems if the current count of threads is "low" and do straight recursion otherwise you can combine the concepts.
But regular Python is not a good language for this pattern. Python threads all run on the same interpreter hardware thread, so you won't actually pick up any multiprocessing goodness.

Phunctor is correct that the threading library is a poor choice for parallelizing this type of problem, due to the "Global Interpreter Lock" that prevents multiple threads from executing Python code in parallel.
Where the threading library can be highly useful, though, is when each thread's code spends a lot of time waiting for I/O to happen. So, for example, if you're implementing a server that has to hit the disk or wait on a network response, servicing a request in each thread can be very efficient, since the threading library can favor the ones that are not waiting on I/O and thus maximize use of the Python interpreter. (In a single thread, you'd have to use a tight loop checking the statuses of your I/O requests, which would tend to be wasteful as load got high.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.