I'm new to multiprocessing - and I may interpreting this wrong - but as I run my programs I notice the more processes I spawn the more 'sy' goes up on my linux computer. For example:
Cpu(s): 14.0%us, 24.1%sy, 0.0%ni, 58.8%id, 0.0%wa, 2.2%hi, 0.0%si, 0.8%st
The more processes I spawn the higher the sy process goes and the actual process just gets half'ed(so it was 20%/cpu before it goes to 10%/cpu) and ideal cpu remains the same(almost 60%). I'm not sure if this is a linux question or python question but is there any thing I can do to reduce this number and allow my programs to use more available cpu?
The system CPU time is time used by processes inside the kernel. If you have such a big ratio of system CPU to user CPU, it probably means that your process is doing a lot of system calls.
Don't think that this is lost time: the kernel is doing something useful for your process.
You might try to e.g. lower the rate of system calls by notably increasing your buffer sizes. Or maybe your processes have too much synchronization primitives.
You might use strace to find out about system calls done by your processes.
Its more likely a hardware question.
Some key things:
How much RAM is free?
Are you using Swap space?
How many CPUs do you have?
Is your app heavy on calculations?
How large is the shared variable(s)?
Does your app have any I/O?
If your app has a lot of output, you may want to look into a database option and insert the values into a table. This will add caching and control traffic flow between processes. No need to share a variable which may eventually cause other issues when the result set increases over time.
There may be some other tweaks you can do to Linux's memory to help.
Number of open files may be one. I can check which proc settings you can optimize if needed. It will help a little, but I think you may be running into a hardware wall.
Another option is to setup the manager to spawn to other servers, and then run processes there. You will need to ssh to a machine and pass an arg if the process is master or slave. It can be done by adding an init override in the manager to redirect the processes.
Hope this helps
Rich
Related
Problem
We run several calculations on geographical data from user input (called a "system"). Sometimes one system needs 10 locations to do calculations for, sometimes 1000+. One location takes approximately 1 second to calculate, hopefully we can speed this up in the future. We currently do this by using a multiprocessing Pool (from billiard) from within a Celery worker. This works in that it utilises all cores 100%, but there are two problems:
There are lingering connections (pipes, probably to the child procs) that cause the worker to hang when reaching the max open file limit (investigated, but haven't found a solution after more than a day of work)
We can't spread the calculations over multiple machines.
To solve these problems, I would could run each calculation as a separate Celery task. However, we also want to schedule these calculations "fairly" for our users, so that:
Users working on small systems (say <50 locations) don't have to wait until a large system (>1000 locations) is finished. The larger the system, the less the increased waiting time matters to the user (they are doing something else anyway, and can get a notification). So this would be something akin to Weighted fair queueing
.
I have not been able to find a distributed task runner that implements this possibility of prioritisation. Did I miss one? I looked at Celery, RQ, Huey, MRQ, Pulsar Queue and some more, as well as into data processing pipelines like Luigi and Pinball, but none seem to easily enable this.
Most of these suggest creating priority by adding more workers for higher priority queues. However, that wouldn't work as the workers would start fighting for CPU time. (RQ does it differently by emptying the complete first passed in queue, before moving on to the next).
Proposed architecture
What I imagine would work is running a multiprocessing program, with a process per CPU, that fetches, in a WFQ fashion, from multiple Redis lists, each being a certain queue.
Would this be the right approach? Of course there is quite some work to be done on making the queue configuration be dynamic (for example also storing it in Redis, and reloading it upon each couple of processed tasks), and getting event monitoring to be able to get insight.
Additional thoughts:
Each task needs around 3MB of data, coming from Postgres, which is the same for each location in the system (or at least per a couple of 100 locations). With the current approach, this resides in the shared memory, and each process can access it quickly. I'll probably have to setup a local Redis instance on each machine to cache this data to, so not every process is going to fetch it over and over again.
I keep hitting up on ZeroMQ, and it has a lot of enticing possibilities, but besides maybe the monitoring, it doesn't seem to be a good fit. Or am I wrong?
What would make more sense: running each worker as a separate program, and managing it with something like supervisor, or starting a single program, that forks a child for each CPU (no CPU count config necessary), and maybe also monitors its children for stuck processes?
We already run both RabbitMQ and Redis, so I could also use RMQ for the queues. It seems to me the only thing gained by using RMQ is the possibility of not losing tasks on worker crash by using acknowledgements, at the cost of using a more difficult library/complicated protocol.
Any other advice?
The current Python application that I'm working on has a need to utilize 1000+ threads (Pythons threading module). Not that any single thread is working at max cpu cycles, this is just a web server load test app I'm creating. I.E. emulate 200 firefox clients all longing into web server and downloading small web components, basically emulating humans that operate in seconds as opposed to microseconds.
So, I was reading through the various topics such as "how many threads does python support on Linux / windows, etc, and I saw a lot of varied answers. One users said its all about memory and the Linux kernel by default only sets aside 8Meg for threads, if it exceeds that then threads start being killed by the Kernel.
One guy stated this is a non issue for CPython because only 1 thread is running at a time anyway (because of the GIL) so we can specify a gazillion threads??? What's the actual truth on this?
"One thread is running at a time because of the GIL." Well, sort of. The GIL means that only one thread can be executing Python code at a time. However, any number of threads could be doing IO, various other syscalls, or other code that doesn't hold the GIL.
It sounds like your threads will be doing mostly network I/O, and any number of threads can do I/O simultaneously. The GIL competition might be pretty fierce with 1000 threads, but you can always create multiple Python processes and divide the I/O threads between them (i.e., fork a couple times before you start).
"The Linux kernel by default only sets aside 8Meg for threads." I'm not sure where you heard that. Maybe what you actually heard was "On Linux, the default stack size is often 8 MiB," which is true. Each thread will use up 8 MiB of address space for stack (no problem on 64-bit) plus kernel resources for the additional memory maps and the thread process itself. You can change the stack size using the threading.stack_size library function, which helps if you have a lot of threads that don't make deep calls.
>>> import threading
>>> threading.stack_size()
0 # platform default, probably 8 MiB
>>> threading.stack_size(64*1024) # 64 KiB stack size for future threads
Others in this thread have suggested using an asynchronous / nonblocking framework. Well, you can do that. However, on the modern Linux kernel, a multithreaded model is competitive with asynchronous (select/poll/epoll) I/O multiplexing techniques. Rewriting your code to use an asynchronous model is a non-trivial amount of work, so I'd only do it if I couldn't get the required performance from a threaded model. If your threads are really trying to simulate human latency (e.g., spend most of their time sleeping), there are a lot of scenarios in which the asynchronous approach is actually slower. I'm not sure if this applies to Python, where the reduced GIL contention alone might merit the switch.
Both of those are partially true:
Each thread does have a stack, and you can run out of address space for the stack if you create enough threads.
Python also does have something called a GIL, which only allows one Python thread to run at a time. However, once Python code calls into C code, that C code can run while a different Python thread runs. However, threads in Python are still physical, and there is still the stack space limit.
If you're planning on having many connections, rather than using many threads, consider using an asynchronous design. Twisted would probably work well here.
We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.
We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.
Will starting multiple Python processes to run the script take advantage of the other cores?
Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.
That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.
I've taken a look at the multiprocessing library but not sure how I can utilize it.
Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then
import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))
will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.
Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.
Starting independent Python processes is ideal. There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.
You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores. There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.
You can use pool of multiprocessing to create processes for increasing performance. Let's say, you have a function handle_file which is for processing image. If you use iteration, it can only use at most 100% of one your core. To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them. Here is an example:
import os
import multiprocessing
def handle_file(path):
print 'Do something to handle file ...', path
def run_multiprocess():
tasks = []
for filename in os.listdir('.'):
tasks.append(filename)
print 'Create task', filename
pool = multiprocessing.Pool(8)
result = all(list(pool.imap_unordered(handle_file, tasks)))
print 'Finished, result=', result
def run_one_process():
for filename in os.listdir('.'):
handle_file(filename)
if __name__ == '__main__':
run_one_process
run_multiprocess()
The run_one_process is single core way to process data, simple, but slow. On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them. It would be about 8 times faster if you have 8 cores. I suggest you set the worker number to double of your cores or exactly the number of your cores. You can try it and see which configuration is faster.
For advanced distributed computing, you can use ZeroMQ as larsmans mentioned. It's hard to understand at first. But once you understand it, you can design a very efficient distributed system to process your data. In your case, I think one REQ with multiple REP would be good enough.
Hope this would be helpful.
See the answer to this question.
If the app can process ranges of input data, then you can launch 4
instances of the app with different ranges of input data to process
and the combine the results after they are all done.
Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.
WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.
If you are reading a large number of files and saving metadata to a database you program does not need more cores.
Your process is likely IO bound not CPU bound. Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.
I think in this scenario it would make perfectly sense to use Celery.
I am developing some Python code for Windows. A criteria is that it will use less than 1% of CPU. I understand that it is impossible to guarantee this all the time due to things like garbage collection, but what would be the best practice to get as close as possible. My current solution is to spread a lot of time.sleep(0.1) around the code, especially in loops. There are, however, obvious problems with this approach.
What other approaches could be taken?
I should also mention that the application has lots of threads in it using the threading library.
EDIT: Setting the process priority is not what I am after.
It is the job of the operating system to schedule CPU time. Use your operating system's built-in process-limits mechanisms (hopefully they exist on Windows) to restrict your process to <1% CPU.
This style of sprinkling unnecessary sleeps every few lines in the code will make the code terrible to create and extend and maintain, not to mention incredibly inelegant. (Rate-limiting yourself may be useful in very small, limited, critical sections -- for example your program is queuing lots of IO requests and you don't wish to inundate the operating system, you might wish to put a single sleep-until-[condition] in each critical loop which has the potential to inundate the system, but otherwise use extremely sparingly.)
Ideally you would call an API to the appropriate OS mechanisms from within your program when you start up, telling the OS to throttle you appropriately.
If the goal is to not bother the user then "below 1% CPU" is the wrong approach. What you really want is "don't take time away from other processes but still complete as fast as possible" - that's what "below normal" process priority is for. See http://code.activestate.com/recipes/496767-set-process-priority-in-windows/ for an example of how process priority can be changed for the current process (calling that function with default parameters will do).
For the sales pitch you can show the task manager while the computer is idle ("See? 99%, my application gets lots of work done") and then start some CPU-intensive application ("Almost all CPU time is spent in the application the user is working with, my application simply went into background").
If the box used for the demonstration is a Windows Server, it can use Windows System Resource Manager for restricting CPU usage below the desired threshold. Trying to force this behavior by code is impossible, unless a Windows API exposes this capability explicitly.
I use Python 2.5.4. My computer: CPU AMD Phenom X3 720BE, Mainboard 780G, 4GB RAM, Windows 7 32 bit.
I use Python threading but can not make every python.exe process consume 100% CPU. Why are they using only about 33-34% on average?.
I wish to direct all available computer resources toward these large calculations so as to complete them as quickly as possible.
EDIT:
Thanks everybody. Now I'm using Parallel Python and everything works well. My CPU now always at 100%. Thanks all!
It appears that you have a 3-core CPU. If you want to use more than one CPU core in native Python code, you have to spawn multiple processes. (Two or more Python threads cannot run concurrently on different CPUs)
As R. Pate said, Python's multiprocessing module is one way. However, I would suggest looking at Parallel Python instead. It takes care of distributing tasks and message-passing. You can even run tasks on many separate computers with little change to your code.
Using it is quite simple:
import pp
def parallel_function(arg):
return arg
job_server = pp.Server()
# Define your jobs
job1 = job_server.submit(parallel_function, ("foo",))
job2 = job_server.submit(parallel_function, ("bar",))
# Compute and retrieve answers for the jobs.
print job1()
print job2()
Try the multiprocessing module, as Python, while it has real, native threads, will restrict their concurrent use while the GIL is held. Another alternative, and something you should look at if you need real speed, is writing a C extension module and calling functions in it from Python. You can release the GIL in those C functions.
Also see David Beazley's Mindblowing GIL.
Global Interpreter Lock
The reasons of employing such a lock include:
* increased speed of single-threaded programs (no necessity to acquire or release locks
on all data structures separately)
* easy integration of C libraries that usually are not thread-safe.
Applications written in languages with
a GIL have to use separate processes
(i.e. interpreters) to achieve full
concurrency, as each interpreter has
its own GIL.
From CPU usage it looks like you're still running on a single core. Try running a trivial calculation with 3 or more threads with same threading code and see if it utilizes all cores. If it doesn't, something might be wrong with your threading code.
What about Stackless Python?
You bottleneck is probably somewhere else, like the hard-drive (paging), or memory access.
You should perform some Operating System and Python monitoring to determine where the bottle neck is.
Here is some info for windows 7:
Performance Monitor: You can use Windows Performance Monitor to examine how programs you run affect your computer’s performance, both in real time and by collecting log data for later analysis. (Control Panel-> All Control Panel Items->Performance Information and Tools-> Advanced Tools- > View Performance Monitor)
Resource Monitor: Windows Resource Monitor is a system tool that allows you to view information about the use of hardware (CPU, memory, disk, and network) and software (file handles and modules) resources in real time. You can use Resource Monitor to start, stop, suspend, and resume processes and services. (Control Panel-> All Control Panel Items->Performance Information and Tools-> Advanced Tools- > View Resource Monitor)
I solved the problems that led me to this post by running a second script manually. This post helped me run multiple python scripts at the same time.
I managed to execute in the newly-opened terminal window typing a command there. Not as convenient as shift + enter but does the job.