How to run batch jobs in Python with a pool of resources? - python

I have to call around 100 C++ long-running programs (many times), which can be both CPU and IO heavy.
I'm doing the calling from Python using Popen. With my current naive approach:
for exe in all_exe:
processes.append(Popen(exe))
while not completed(processes):
sleep(1.0)
my system's resources get exhausted very quickly. Is there a Python
library that would allow me to specify how many workers out of 100 I
want to run at once (since running all at once is bad), that would
start a new worker as soon as the old one completes?
Maybe also with some options like nice and ionice, so that I can
make the most use of my system's resources without experiencing
slowdowns.

Related

PyCharm Python Threading

I am running a program which is attempting to open 1000 threads using Python's ThreadPoolExecutor, which I have configured to allow a maximum of 1000 threads. On a Windows machine with 4GB of memory, I am able to start ~870 threads before I get a runtime error: can't start new thread. With 16GB of memory, I am able to start ~870 threads as well, though the runtime error, can't start new thread, occurs two minutes later. All threads are running a while loop, which means that they will never complete their tasks. This is the intention.
Why is PyCharm/Windows/Python, whichever may be the culprit, failing to start more than 870 out of the 1000 threads which I am attempting to start, with that number being invariable despite a significant change in the RAM? This leaves me to conclude that hardware limitations are not the problem, which also leaves me completely and utterly confused.
What could be causing this, and how do I fix it?
It is very hard to say without all the details of your configuration and your code, but my guess is that it's windows being starved for certain kinds of memory. I suggest looking into the details in this article:
I attempted to duplicate your issue with Pycharm and python3.8 on my linux box and I was able to make 10000 threads with the code below. Note that I have every thread sleep for quite a while upon creation otherwise the thread creation process slows way down as the main thread of execution, which is trying to make the threads, becomes CPU starved. I have 32GB of RAM but I am able to make 10000 threads with a ThreadPoolExecutor on linux.
from concurrent.futures import ThreadPoolExecutor
import time
def runForever():
time.sleep(10)
while True:
for i in range(100):
a = 10
t = ThreadPoolExecutor(max_workers=10000)
for i in range(10000):
t.submit(runForever)
print(len(t._threads))
print(len(t._threads))

Fastest way to process large files in Python

We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.
We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.
Will starting multiple Python processes to run the script take advantage of the other cores?
Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.
That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.
I've taken a look at the multiprocessing library but not sure how I can utilize it.
Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then
import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))
will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.
Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.
Starting independent Python processes is ideal. There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.
You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores. There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.
You can use pool of multiprocessing to create processes for increasing performance. Let's say, you have a function handle_file which is for processing image. If you use iteration, it can only use at most 100% of one your core. To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them. Here is an example:
import os
import multiprocessing
def handle_file(path):
print 'Do something to handle file ...', path
def run_multiprocess():
tasks = []
for filename in os.listdir('.'):
tasks.append(filename)
print 'Create task', filename
pool = multiprocessing.Pool(8)
result = all(list(pool.imap_unordered(handle_file, tasks)))
print 'Finished, result=', result
def run_one_process():
for filename in os.listdir('.'):
handle_file(filename)
if __name__ == '__main__':
run_one_process
run_multiprocess()
The run_one_process is single core way to process data, simple, but slow. On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them. It would be about 8 times faster if you have 8 cores. I suggest you set the worker number to double of your cores or exactly the number of your cores. You can try it and see which configuration is faster.
For advanced distributed computing, you can use ZeroMQ as larsmans mentioned. It's hard to understand at first. But once you understand it, you can design a very efficient distributed system to process your data. In your case, I think one REQ with multiple REP would be good enough.
Hope this would be helpful.
See the answer to this question.
If the app can process ranges of input data, then you can launch 4
instances of the app with different ranges of input data to process
and the combine the results after they are all done.
Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.
WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.
If you are reading a large number of files and saving metadata to a database you program does not need more cores.
Your process is likely IO bound not CPU bound. Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.
I think in this scenario it would make perfectly sense to use Celery.

python Multiprocessing: What are ways I can reduce sy process overhead?

I'm new to multiprocessing - and I may interpreting this wrong - but as I run my programs I notice the more processes I spawn the more 'sy' goes up on my linux computer. For example:
Cpu(s): 14.0%us, 24.1%sy, 0.0%ni, 58.8%id, 0.0%wa, 2.2%hi, 0.0%si, 0.8%st
The more processes I spawn the higher the sy process goes and the actual process just gets half'ed(so it was 20%/cpu before it goes to 10%/cpu) and ideal cpu remains the same(almost 60%). I'm not sure if this is a linux question or python question but is there any thing I can do to reduce this number and allow my programs to use more available cpu?
The system CPU time is time used by processes inside the kernel. If you have such a big ratio of system CPU to user CPU, it probably means that your process is doing a lot of system calls.
Don't think that this is lost time: the kernel is doing something useful for your process.
You might try to e.g. lower the rate of system calls by notably increasing your buffer sizes. Or maybe your processes have too much synchronization primitives.
You might use strace to find out about system calls done by your processes.
Its more likely a hardware question.
Some key things:
How much RAM is free?
Are you using Swap space?
How many CPUs do you have?
Is your app heavy on calculations?
How large is the shared variable(s)?
Does your app have any I/O?
If your app has a lot of output, you may want to look into a database option and insert the values into a table. This will add caching and control traffic flow between processes. No need to share a variable which may eventually cause other issues when the result set increases over time.
There may be some other tweaks you can do to Linux's memory to help.
Number of open files may be one. I can check which proc settings you can optimize if needed. It will help a little, but I think you may be running into a hardware wall.
Another option is to setup the manager to spawn to other servers, and then run processes there. You will need to ssh to a machine and pass an arg if the process is master or slave. It can be done by adding an init override in the manager to redirect the processes.
Hope this helps
Rich

Tasks queue process in python

Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?
You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.
This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.
I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.

Why does my Python program average only 33% CPU per process? How can I make Python use all available CPU?

I use Python 2.5.4. My computer: CPU AMD Phenom X3 720BE, Mainboard 780G, 4GB RAM, Windows 7 32 bit.
I use Python threading but can not make every python.exe process consume 100% CPU. Why are they using only about 33-34% on average?.
I wish to direct all available computer resources toward these large calculations so as to complete them as quickly as possible.
EDIT:
Thanks everybody. Now I'm using Parallel Python and everything works well. My CPU now always at 100%. Thanks all!
It appears that you have a 3-core CPU. If you want to use more than one CPU core in native Python code, you have to spawn multiple processes. (Two or more Python threads cannot run concurrently on different CPUs)
As R. Pate said, Python's multiprocessing module is one way. However, I would suggest looking at Parallel Python instead. It takes care of distributing tasks and message-passing. You can even run tasks on many separate computers with little change to your code.
Using it is quite simple:
import pp
def parallel_function(arg):
return arg
job_server = pp.Server() 
# Define your jobs
job1 = job_server.submit(parallel_function, ("foo",))
job2 = job_server.submit(parallel_function, ("bar",))
# Compute and retrieve answers for the jobs.
print job1()
print job2()
Try the multiprocessing module, as Python, while it has real, native threads, will restrict their concurrent use while the GIL is held. Another alternative, and something you should look at if you need real speed, is writing a C extension module and calling functions in it from Python. You can release the GIL in those C functions.
Also see David Beazley's Mindblowing GIL.
Global Interpreter Lock
The reasons of employing such a lock include:
* increased speed of single-threaded programs (no necessity to acquire or release locks
on all data structures separately)
* easy integration of C libraries that usually are not thread-safe.
Applications written in languages with
a GIL have to use separate processes
(i.e. interpreters) to achieve full
concurrency, as each interpreter has
its own GIL.
From CPU usage it looks like you're still running on a single core. Try running a trivial calculation with 3 or more threads with same threading code and see if it utilizes all cores. If it doesn't, something might be wrong with your threading code.
What about Stackless Python?
You bottleneck is probably somewhere else, like the hard-drive (paging), or memory access.
You should perform some Operating System and Python monitoring to determine where the bottle neck is.
Here is some info for windows 7:
Performance Monitor: You can use Windows Performance Monitor to examine how programs you run affect your computer’s performance, both in real time and by collecting log data for later analysis. (Control Panel-> All Control Panel Items->Performance Information and Tools-> Advanced Tools- > View Performance Monitor)
Resource Monitor: Windows Resource Monitor is a system tool that allows you to view information about the use of hardware (CPU, memory, disk, and network) and software (file handles and modules) resources in real time. You can use Resource Monitor to start, stop, suspend, and resume processes and services. (Control Panel-> All Control Panel Items->Performance Information and Tools-> Advanced Tools- > View Resource Monitor)
I solved the problems that led me to this post by running a second script manually. This post helped me run multiple python scripts at the same time.
I managed to execute in the newly-opened terminal window typing a command there. Not as convenient as shift + enter but does the job.

Categories

Resources