Parallel threading python GIL vs Java

Parallel threading python GIL vs Java - python

I know that python has a GIL that make threads not be able to run at the same time therefore threading is just context switching.
Why is java different?
Threads on the same CPU in every language cannot run parallel.
Is creating new thread in java utilizes cores in multi core machine?
python can only spawn threads on the same CPU, in contrast to java?
If 1. Is the case, when using more threads than CPUs even in java it comes back to context switching again for several of them?
If 1. Is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
Isn't the whole point of threading is being able to use the same memory space? If java does run some of them in multiple threads for perallelism, how do they really share memory?
Thank you

Why is java different?
Because it is able to effectively use multiple cores at the same time.
Does creating a new thread in java utilizes cores in multi core machine?
Yes.
Python can only spawn threads on the same CPU, in contrast to Java?
Java can spawn multiple threads which will on different CPUs. Java is not responsible for the actual thread scheduling. That is handled by the OS. And the OS may reschedule a thread to a different CPU to the one that it started on.
I am not sure about the precise details for Python, but I think the GIL is an implementation detail rather than something that it intrinsic to the language itself1. But either way, in a Python implementation, the GIL means that you would get little performance benefit in spawning threads on multiple cores. As this page says:
"The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter."
If 1. is the case, when using more threads than CPUs does it come back to context switching in Java?
It depends. When switching a CPU between threads belonging to different processes, a full context switch is involved. But when switching between threads in the same process, only the (user) registers need to be switched. (The virtual memory registers and caches don't need to be switched / flushed because the threads share the same virtual address space.)
If 1. is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
The key difference between multi-threading and multi-processing is that processes do not share any memory. By contrast, one thread in a process can see the memory of all of the others ... modulo issues of when changes are visible.
This difference has a variety of consequences.
Isn't the whole point of threading is being able to use the same memory space?
Yes, that is the main point ... when you compare multi-threading with multi-processing.
If Java does run some of them in multiple threads for parallelism ...
Java supports threads for many reasons. Parallelism is only one of those reasons. Others include multiplexing I/O and simplifying certain kinds of programming problem. These other reasons are also relevant to Python.
... how do [Java threads] really share memory?
The hardware deals with the issues of making the physical memory visible to all of the threads, and propagation of changes via the memory caches. It is complicated.
In Java the onus is on the programmer to "do the right thing" when threads make use of shared variables / objects. You need to use volatile variables, or synchronized blocks / methods, or something else that ensures that there is a happens before chain between a write and subsequent read. (Otherwise you can get issues with changes not being visible.)
This transfer of responsibility to the programmer allows the compiler to generate code with fewer main memory operations ... and hence that is faster. The downside is that if an application doesn't obey the rules, it is liable to behave in unexpected ways.
By contrast, in Python the memory model is unspecified, but there is an expectation (by millions of Python programmers) that it will behave in an intuitive fashion; i.e. a shared variable write performed by one thread will immediately be visible to other threads. This is hard to achieve efficiently while also allowing Python threads to run in parallel.
1 - While the GIL is not formally part of the Python spec, the influence of GIL on the (unspecified!) Python memory model and Python programmers assumptions make it more than merely an implementation detail. It remains to be seen if Python can successfully evolve into a language where multi-threading can use multiple cores effectively.

Not a complete answer here, but just adding a couple of things that Stephen C didn't already say:
Python can only spawn threads on the same CPU, in contrast to java?
That would be an optimization, not an essential fact. There's no reason in principle why Python could not simply allow the OS to schedule its threads on whatever CPU happened to be available at any given time.
OTOH, given that no two Python threads can do significant work at the same time, it potentially could improve performance if the threads all had affinity for the same CPU. (See what Stephen C said about "full context switch" vs. "only the (user) registers."
Giving user-mode processes control over processor affinity is a relatively new feature in some operating systems. I have no idea of whether or not any Python version actually uses that feature.
If java does run...multiple threads for parallelism...?
Java doesn't "run multiple threads for parallelism." Your Java program creates multiple threads for whatever reason you happen to want them. Most modern OSs provide threads. Java simply makes that ability available to application programmers in a way that is tightly integrated with the language itself. You are free to use them (or not) however you see fit.

Related

Python multiprocessing and concurency vs paralellism

I have created and application. In this application I use multiprocessing library. In that application I do spin two processes (instances of the same class) to consume data from Kafka and put into Python Queue.
This is the library I used:
Python multiprocessing
Q1. Is it concurrency or is it parallelism?
Q2. Is it multithreading or is it multiprocessing?
Q3. How does Python maps Processes to CPUs? (does this question make sense?)
I understand in order to speak about multithreading I need to use separate / multiple CPUs (so separate threads are mapped to separate CPU threads).
I understand in order to speak about multiprocessing I need to use separate memory space for both processes? Is it correct?
I assume if I spin two processes within one Application instance => we talk about concurrency.
If I spin multiple instances of above application then we would talk about parallelism? (multiple CPUs, separate memory spaces used)?
I see that Python library defines it as follows: Python multiprocessing library
The multiprocessing package offers both local and remote concurrency
...
Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.
...
A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).

First, separate threads are not mapped to separate CPU-s. That's optional, and in python due to the GIL, all threads in a process will run on the same CPU
1) It's both concurrency, in that the order of execution is not set, and parallelism, since the multiprocessing package can run on multiple processors, bypassing the GIL limitations.
2) Since the threading package is another story, then it's definitely multiprocessing
3) I may be speaking out of line, but python , IMO does NOT map processes to CPU-s, it leaves this detail to the OS

Q1: It is at least concurrency, can be parallelism as well (terms intended as defined in the answer to this question). Clearly, if you have one processor only, true parallellism cannot be achieved, becuse only one process can use the CPU at a single time. In that case, however, the muliprocessing library still allows you to define multiple tasks, that run in separate processes. It will be the OS's scheduler to decide which process runs when.
Q2: Multiprocessing (...which is kind of implied by the library name). Due to the Global Interpreter Lock present in most Python interpreter implementations, parallelism with threads is impossible. Multiprocessing offers a threading-like interface that makes use of processes under the hood.
Q3: It doesn't. Python spawns processes, the OS scheduler decided who runs where and when. There are some ways to execute processes on specific CPUs, but this is not the default behaviour of multiprocessing (and I'm not aware of any way to force the library to pin processes to CPUs).

Is the max thread limit actually a non-relevant issue for Python / Linux?

The current Python application that I'm working on has a need to utilize 1000+ threads (Pythons threading module). Not that any single thread is working at max cpu cycles, this is just a web server load test app I'm creating. I.E. emulate 200 firefox clients all longing into web server and downloading small web components, basically emulating humans that operate in seconds as opposed to microseconds.
So, I was reading through the various topics such as "how many threads does python support on Linux / windows, etc, and I saw a lot of varied answers. One users said its all about memory and the Linux kernel by default only sets aside 8Meg for threads, if it exceeds that then threads start being killed by the Kernel.
One guy stated this is a non issue for CPython because only 1 thread is running at a time anyway (because of the GIL) so we can specify a gazillion threads??? What's the actual truth on this?

"One thread is running at a time because of the GIL." Well, sort of. The GIL means that only one thread can be executing Python code at a time. However, any number of threads could be doing IO, various other syscalls, or other code that doesn't hold the GIL.
It sounds like your threads will be doing mostly network I/O, and any number of threads can do I/O simultaneously. The GIL competition might be pretty fierce with 1000 threads, but you can always create multiple Python processes and divide the I/O threads between them (i.e., fork a couple times before you start).
"The Linux kernel by default only sets aside 8Meg for threads." I'm not sure where you heard that. Maybe what you actually heard was "On Linux, the default stack size is often 8 MiB," which is true. Each thread will use up 8 MiB of address space for stack (no problem on 64-bit) plus kernel resources for the additional memory maps and the thread process itself. You can change the stack size using the threading.stack_size library function, which helps if you have a lot of threads that don't make deep calls.
>>> import threading
>>> threading.stack_size()
0 # platform default, probably 8 MiB
>>> threading.stack_size(64*1024) # 64 KiB stack size for future threads
Others in this thread have suggested using an asynchronous / nonblocking framework. Well, you can do that. However, on the modern Linux kernel, a multithreaded model is competitive with asynchronous (select/poll/epoll) I/O multiplexing techniques. Rewriting your code to use an asynchronous model is a non-trivial amount of work, so I'd only do it if I couldn't get the required performance from a threaded model. If your threads are really trying to simulate human latency (e.g., spend most of their time sleeping), there are a lot of scenarios in which the asynchronous approach is actually slower. I'm not sure if this applies to Python, where the reduced GIL contention alone might merit the switch.

Both of those are partially true:
Each thread does have a stack, and you can run out of address space for the stack if you create enough threads.
Python also does have something called a GIL, which only allows one Python thread to run at a time. However, once Python code calls into C code, that C code can run while a different Python thread runs. However, threads in Python are still physical, and there is still the stack space limit.
If you're planning on having many connections, rather than using many threads, consider using an asynchronous design. Twisted would probably work well here.

Would limiting a GILed Python program to a single CPU boost performance?

Following up on David Beazley's paper regarding Python and GIL, would it be a good practice to limit a Python program (CPython with GIL and all) to a single CPU in a Windows based multi-core system?
Would it boost performance?
UPDATE: Assume multiple threads are used (not sure if it makes a difference)

The paper does indeed imply that limiting a program to a single core would improve performance in that particular case. However, there are a number of concerns that you would need to deal with:
His tests are mainly for compute intensive threads rather than IO bound threads. If the threads you are using often block voluntarily (such as in a web server waiting for a client) then you don't run into GIL issues at all.
The GIL issues deal specifically with threads and not processes. I may be reading your question wrong, but you seem to be asking about restricting all Python programs to a single core. Programs using processes for parallelism don't suffer from GIL issues and restricting them to a single core will make them slower.
The GIL is drastically different in Python 3.2 (as David mentions in this video. The GIL was changed explicitly to deal with such issues. While it still has problems, it no longer has this problem.
In summary, the only time you want to complicate your life by forcing the OS to restrict the program to a single core is when you are running a:
Multithreaded
Compute Intensive
Lower than Python 3.2
program on a multicore machine.

Bias : For parallel computing involving heavy CPU processing, I much
prefer message passing and cooperating processes to thread programming
(of course, it depends on the problem)
You shouldn't limit your programs to one core. Beazley was just demonstrating a specific problem that performed poorly under those unque conditions (those conditions being an IO bound thread and a CPU bound thread competing against each other). Ideally you want to avoid those conditions by using a different method (import multiprocessing).
I think the best solution is to put your CPU bound tasks in other processes using the multiprocessing module so that they utilize their own cores, and IO bound tasks in threads (or microthreads/coroutines, if you read his interesting paper on that: http://www.dabeaz.com/coroutines/) since the GIL is best suited for those types of tasks.
Conclusion: Python threads are best suited for IO bound tasks, NOT CPU bound.

Python Global Interpreter Lock (GIL) workaround on multi-core systems using taskset on Linux?

So I just finished watching this talk on the Python Global Interpreter Lock (GIL) http://blip.tv/file/2232410.
The gist of it is that the GIL is a pretty good design for single core systems (Python essentially leaves the thread handling/scheduling up to the operating system). But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.
So since the GIL limits us to basically executing a Python program on one CPU my thought is why not accept this and simply use taskset on Linux to set the affinity of the program to a certain core/cpu on the system (especially in a situation with multiple Python apps running on a multi-core system)?
So ultimately my question is this: has anyone tried using taskset on Linux with Python applications (especially when running multiple applications on a Linux system so that multiple cores can be used with one or two Python applications bound to a specific core) and if so what were the results? is it worth doing? Does it make things worse for certain workloads? I plan to do this and test it out (basically see if the program takes more or less time to run) but would love to hear from others as to your experiences.
Addition: David Beazley (the guy giving the talk in the linked video) pointed out that some C/C++ extensions manually release the GIL lock and if these extensions are optimized for multi-core (i.e. scientific or numeric data analysis/etc.) then rather than getting the benefits of multi-core for number crunching the extension would be effectively crippled in that it is limited to a single core (thus potentially slowing your program down significantly). On the other hand if you aren't using extensions such as this
The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box since a thread fires off an HTTP request and then since it's waiting on I/O gives up the GIL and another thread can do it's thing, so that part of the program can easily run 100+ threads without hurting the CPU much and let me actually use the network bandwidth that is available. As for stackless Python/etc I'm not overly interested in rewriting the program or replacing my Python stack (availability would also be a concern).
[*] Only the main thread can receive signals so if you send a ctrl-C the Python interpreter basically tries to get the main thread to run so it can handle the signal, but since it doesn't directly control which thread is run (this is left to the operating system) it basically tells the OS to keep switching threads until it eventually hits the main thread (which if you are unlucky may take a while).

Another solution is:
http://docs.python.org/library/multiprocessing.html
Note 1: This is not a limitation of the Python language, but of CPython implementation.
Note 2: With regard to affinity, your OS shouldn't have a problem doing that itself.

I have never heard of anyone using taskset for a performance gain with Python. Doesn't mean it can't happen in your case, but definitely publish your results so others can critique your benchmarking methods and provide validation.
Personally though, I would decouple your I/O threads from the CPU bound threads using a message queue. That way your front end is now completely network I/O bound (some with HTTP interface, some with message queue interface) and ideal for your threading situation. Then the CPU intense processes can either use multiprocessing or just be individual processes waiting for work to arrive on the message queue.
In the longer term you might also want to consider replacing your threaded I/O front-end with Twisted or some thing like eventlets because, even if they won't help performance they should improve scalability. Your back-end is now already scalable because you can run your message queue over any number of machines+cpus as needed.

An interesting solution is the experiment reported by Ryan Kelly on his blog: http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2/
The results seems very satisfactory.

I've found the following rule of thumb sufficient over the years: If the workers are dependent on some shared state, I use one multiprocessing process per core (CPU bound), and per core a fix pool of worker threads (I/O bound). The OS will take care of assigining the different Python processes to the cores.

The Python GIL is per Python interpreter. That means the only to avoid problems with it while doing multiprocessing is simply starting multiple interpreters (i.e. using seperate processes instead of threads for concurrency) and then using some other IPC primitive for communication between the processes (such as sockets). That being said, the GIL is not a problem when using threads with blocking I/O calls.
The main problem of the GIL as mentioned earlier is that you can't execute 2 different python code threads at the same time. A thread blocking on a blocking I/O call is blocked and hence not executin python code. This means it is not blocking the GIL. If you have two CPU intensive tasks in seperate python threads, that's where the GIL kills multi-processing in Python (only the CPython implementation, as pointed out earlier). Because the GIL stops CPU #1 from executing a python thread while CPU #0 is busy executing the other python thread.

Until such time as the GIL is removed from Python, co-routines may be used in place of threads. I have it on good authority that this strategy has been implemented by two successful start-ups, using greenlets in at least one case.

This is a pretty old question but since everytime I search about information related to python and performance on multi-core systems this post is always on the result list, I would not let this past before me an do not share my thoughts.
You can use the multiprocessing module that rather than create threads for each task, it creates another process of cpython compier interpreting your code.
It would make your application to take advantage of multicore systems.
The only problem that I see on this approach is that you will have a considerable overhead by creating an entire new process stack on memory. (http://en.wikipedia.org/wiki/Thread_(computing)#How_threads_differ_from_processes)
Python Multiprocessing module:
http://docs.python.org/dev/library/multiprocessing.html
"The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box..."
About this, I guess that you can have also a pool of process too: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers
Att,
Leo

Stackless python and multicores?

So, I'm toying around with Stackless Python and a question popped up in my head, maybe this is "assumed" or "common" knowledge, but I couldn't find it actually written anywhere on the stackless site.
Does Stackless Python take advantage of multicore CPUs? In normal Python you have the GIL being constantly present and to make (true) use of multiple cores you need to use several processes, is this true for Stackless also?

Stackless python does not make use of any kind of multi-core environment it runs on.
This is a common misconception about Stackless, as it allows the programmer to take advantage of thread-based programming. For many people these two are closely intertwined, but are, in fact two separate things.
Internally Stackless uses a round-robin scheduler to schedule every tasklet (micro threads), but no tasklet can be run concurrent with another one. This means that if one tasklet is busy, the others must wait until that tasklet relinquishes control. By default the scheduler will not stop a tasklet and give processor time to another. It is the tasklet's responsibility to schedule itself back in the end of the schedule queue using Stackless.schedule(), or by finishing its calculations.
all tasklets are thus executed in a sequential manner, even when multiplpe cores are available.
The reason why Stackless does not have multi-core support is because this makes threads a whole lot easier. And this is just what stackless is all about:
from the official stackless website
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure.
More readable code.
Increased programmer productivity.
Here is a link to some more information about multiple cores and stackless.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.