I am running a program which is attempting to open 1000 threads using Python's ThreadPoolExecutor, which I have configured to allow a maximum of 1000 threads. On a Windows machine with 4GB of memory, I am able to start ~870 threads before I get a runtime error: can't start new thread. With 16GB of memory, I am able to start ~870 threads as well, though the runtime error, can't start new thread, occurs two minutes later. All threads are running a while loop, which means that they will never complete their tasks. This is the intention.
Why is PyCharm/Windows/Python, whichever may be the culprit, failing to start more than 870 out of the 1000 threads which I am attempting to start, with that number being invariable despite a significant change in the RAM? This leaves me to conclude that hardware limitations are not the problem, which also leaves me completely and utterly confused.
What could be causing this, and how do I fix it?
It is very hard to say without all the details of your configuration and your code, but my guess is that it's windows being starved for certain kinds of memory. I suggest looking into the details in this article:
I attempted to duplicate your issue with Pycharm and python3.8 on my linux box and I was able to make 10000 threads with the code below. Note that I have every thread sleep for quite a while upon creation otherwise the thread creation process slows way down as the main thread of execution, which is trying to make the threads, becomes CPU starved. I have 32GB of RAM but I am able to make 10000 threads with a ThreadPoolExecutor on linux.
from concurrent.futures import ThreadPoolExecutor
import time
def runForever():
time.sleep(10)
while True:
for i in range(100):
a = 10
t = ThreadPoolExecutor(max_workers=10000)
for i in range(10000):
t.submit(runForever)
print(len(t._threads))
print(len(t._threads))
Related
I got the following code and when I run it, the first function returns Done after 5 seconds and the second one after 10 seconds, not 15. This, logically speaking, means they both run at the same time, yet everyone says threading is not parallel running. Can someone shed some light on what's happening on the background, please?
import threading
import time
def dummy(param):
time.sleep(param)
print('done')
param1 = 10
param2 = 5
thread1 = threading.Thread(target=dummy, args=(param1,))
thread2 = threading.Thread(target=dummy, args=(param2,))
thread1.start()
thread2.start()
thread1.join()
thread2.join()
I don't think that this is a good test. Sleeping gives up the CPU as far as I know, so the first sleep releases the CPU to the second thread and then that one starts sleeping. You aren't doing work on multiple threads at once. Both threads are sleeping, not running and doing work.
People say that you can't use multithreading to run code in parallel because in CPython, the Global Interpreter Lock (GIL) prevents multiple bytecode instructions from running simultaneously in different threads in the same process. That means that two threads can't do work at precisely the same time.
You can however have I/O tasks running in parallel, since, for example, waiting for a socket to return data doesn't require heavy work on the CPU. I believe for the purposes of the explanation here, sleeping the thread can be thought of as closer to waiting on long-running I/O than having the CPU do work. That means that yes, the two sleeps can happen in parallel.
Carcigenicate gave a good answer, much of 'what is happening in the background' asked about in the question. I try to open it a bit too.
Threads are started in the background of your main execution. No matter whether there are multiple cores or not, or whether the GIL is active or not.
Your thread.start() calls return immediately and are ran right one after the other, practically at the same time in your example. So after 10 seconds both are done. Threads always work like that.
If there is only one core, the operating system gives each thread some time almost all the time, like every millisecond maybe. If you use Python with GIL (the default official from python.org, called CPython), multiple cores are not used at the exact same time for Python code that sets the lock. It is possible to release the lock in C code for Python, and AFAIK e.g. libraries for reading from disk or network do that. For your Python code, maybe it runs one line from one thread, then the other, but it's still practically simultaneous on your gigahertz range processor.
Now, if you want to test performance benefit of running multiple threads, for example a worker thread per core, that you must test with a test function that does some work. Even just counts numbers. Then if you run many in parallel, vs sequentially, you'll see differences depending on number of cores and whether GIL is there or not. I thought PyPy doesn't have GIL but apparently it does, https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil-why . IronPython and Jython do not, I'd test IronPython for a non-GIL Python, https://ironpython.net/
I have two programs, one written in C and one written in Python. I want to pass a few arguments to C program from Python and do it many times in parallel, because I have about 1 million of such C calls.
Essentially I did like this:
from subprocess import check_call
import multiprocessing as mp
from itertools import combinations
def run_parallel(f1, f2):
check_call(f"./c_compiled {f1} {f2} &", cwd='.', shell=True)
if __name__ == '__main__':
pairs = combinations(fns, 2)
pool = mp.Pool(processes=32)
pool.starmap(run_parallel, pairs)
pool.close()
However, sometimes I get the following errors (though the main process is still running)
/bin/sh: fork: retry: No child processes
Moreover, sometimes the whole program in Python fails with
BlockingIOError: [Errno 11] Resource temporarily unavailable
I found while it's still running I can see a lot of processes spawned for my user (around 500), while I have at most 512 available.
This does not happen all the time (depending on the arguments) but it often does. How I can avoid these problems?
I'd wager you're running up against a process/file descriptor/... limit there.
You can "save" one process per invocation by not using shell=True:
check_call(["./c_compiled", f1, f2], cwd='.')
But it'd be better still to make that C code callable from Python instead of creating processes to do so. By far the easiest way to interface "random" C code with Python is Cython.
"many times in parallel" you can certainly do, for reasonable values of "many", but "about 1 million of such C calls" all running at the same time on the same individual machine is almost surely out of the question.
You can lighten the load by running the jobs without interposing a shell, as discussed in #AKX's answer, but that's not enough to bring your objective into range. Better would be to queue up the jobs so as to run only a few at a time -- once you reach that number of jobs, start a new one only when a previous one has finished. The exact number you should try to keep running concurrently depends on your machine and on the details of the computation, but something around the number of CPU cores might be a good first guess.
Note in particular that it is counterproductive to have more jobs at any one time than the machine has resources to run concurrently. If your processes do little or no I/O then the number of cores in your machine puts a cap on that, for only the processes that are scheduled on a core at any given time (at most one per core) will make progress while the others wait. Switching among many processes so as to attempt to avoid starving any of them will add overhead. If your processes do a lot of I/O then they will probably spend a fair proportion of their time blocked on I/O, and therefore not (directly) requiring a core, but in this case your I/O devices may well create a bottleneck, which might prove even worse than the limitation from number of cores.
Because of GIL, I thought a multi-thread python process can only have one thread running at one time, thus the cpu usage can not be more than 100 percent.
But I found the code bellow can occupy 950% cpu usage in top.
import threading
import time
def f():
while 1:
pass
for i in range(10):
t = threading.Thread(target=f)
t.setDaemon(True)
t.start()
time.sleep(60)
This is not a same question as Python interpreters uses up to 130% of my CPU. How is that possible?. In that question, the OP said he was doing I/O intensive load-testing which may release the GIL. But in my program, there is no I/O operation.
Tests run on CPython 2.6.6.
I think in this case the CPU is busy doing thread switch instead of doing some actual work. In another word, the thread switch is using all CPUs to do its job, but the python loop code runs too fast to cause observable CPU usage. I tried adding some real calculations as below, the CPU usage dropped to around 200%. And if you add more calculations, I believe the CPU usage will be very close to 100%.
def f():
x=1
while 1:
y=x*2
One reason could be the method you're using to get to 950%. There's a number called (avg) load which is perhaps not what one would expect before reading the documentation.
The load is the (average) number of threads that's either in running or runnable state (in queue for CPU time). If you like in your example have ten busy looping threads while one thread is running the other nine is in runnable state (in queue for a time slot).
The load is an indication on how many cores that you could have made use of. Or how much CPU power your program wants to use (not necessarily the actual CPU power it gets to use).
I have to call around 100 C++ long-running programs (many times), which can be both CPU and IO heavy.
I'm doing the calling from Python using Popen. With my current naive approach:
for exe in all_exe:
processes.append(Popen(exe))
while not completed(processes):
sleep(1.0)
my system's resources get exhausted very quickly. Is there a Python
library that would allow me to specify how many workers out of 100 I
want to run at once (since running all at once is bad), that would
start a new worker as soon as the old one completes?
Maybe also with some options like nice and ionice, so that I can
make the most use of my system's resources without experiencing
slowdowns.
The current Python application that I'm working on has a need to utilize 1000+ threads (Pythons threading module). Not that any single thread is working at max cpu cycles, this is just a web server load test app I'm creating. I.E. emulate 200 firefox clients all longing into web server and downloading small web components, basically emulating humans that operate in seconds as opposed to microseconds.
So, I was reading through the various topics such as "how many threads does python support on Linux / windows, etc, and I saw a lot of varied answers. One users said its all about memory and the Linux kernel by default only sets aside 8Meg for threads, if it exceeds that then threads start being killed by the Kernel.
One guy stated this is a non issue for CPython because only 1 thread is running at a time anyway (because of the GIL) so we can specify a gazillion threads??? What's the actual truth on this?
"One thread is running at a time because of the GIL." Well, sort of. The GIL means that only one thread can be executing Python code at a time. However, any number of threads could be doing IO, various other syscalls, or other code that doesn't hold the GIL.
It sounds like your threads will be doing mostly network I/O, and any number of threads can do I/O simultaneously. The GIL competition might be pretty fierce with 1000 threads, but you can always create multiple Python processes and divide the I/O threads between them (i.e., fork a couple times before you start).
"The Linux kernel by default only sets aside 8Meg for threads." I'm not sure where you heard that. Maybe what you actually heard was "On Linux, the default stack size is often 8 MiB," which is true. Each thread will use up 8 MiB of address space for stack (no problem on 64-bit) plus kernel resources for the additional memory maps and the thread process itself. You can change the stack size using the threading.stack_size library function, which helps if you have a lot of threads that don't make deep calls.
>>> import threading
>>> threading.stack_size()
0 # platform default, probably 8 MiB
>>> threading.stack_size(64*1024) # 64 KiB stack size for future threads
Others in this thread have suggested using an asynchronous / nonblocking framework. Well, you can do that. However, on the modern Linux kernel, a multithreaded model is competitive with asynchronous (select/poll/epoll) I/O multiplexing techniques. Rewriting your code to use an asynchronous model is a non-trivial amount of work, so I'd only do it if I couldn't get the required performance from a threaded model. If your threads are really trying to simulate human latency (e.g., spend most of their time sleeping), there are a lot of scenarios in which the asynchronous approach is actually slower. I'm not sure if this applies to Python, where the reduced GIL contention alone might merit the switch.
Both of those are partially true:
Each thread does have a stack, and you can run out of address space for the stack if you create enough threads.
Python also does have something called a GIL, which only allows one Python thread to run at a time. However, once Python code calls into C code, that C code can run while a different Python thread runs. However, threads in Python are still physical, and there is still the stack space limit.
If you're planning on having many connections, rather than using many threads, consider using an asynchronous design. Twisted would probably work well here.