I have some python code which performs a non-collision task, no race conditions can occur as a result of this parallelism. I'm merely attempting to increase the speed of processing, I've 4 files, and rather than reading each of them one at a time I'd like to open all four files and read/edit data from them simultaneously.
I've read a few questions on here detailing that multi-threading in python isn't possible due to the Global Interpreter Lock, but that multiprocessing gets around this. For the record my code does exactly what it's meant to when I just run it four times from the terminal in separate terminals - I'm guessing this is "multiprocessing", however I'd like a cleaner programmatic solution.
The data-sets are large, so it can be assumed that as soon as a "process" is given to the interpreter, it is essentially locked for a large time period :
Example:
import multiprocessing
def worker():
while true:
#do some stuff
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
This gives me the issue in that, the above will execute the first process, and then never realistically start the second until execution is finished on the first, making them run one after the other.
Is there a way I can effectively execute worker X amount of times by either starting them at the same time, or preventing them from running until they all start? - My OS has access to 8 cores
Related
I wanted to speed up a python script I have that iterates over 300 records. So I figured I'd try to use threading. My non-thread version takes just under 1 minute to execute. My threaded version does 1 seconds better. Here are the pertinent parts of my thread version of the script:
... other imports ...
import threading
import concurrent.futures
# global vars
threads = []
check_records = []
default_max_problems = 5
problems_found = 0
lock = threading.Lock()
... some functions ...
def check_host(rec):
with lock:
global problems_found
global max_problems
if problems_found >= max_problems:
# I'd prefer to stop all threads and stop new ones from starting,
# but I don't know how to do that.
return
... bunch of function calls that do network stuff ...
check_records.append(rec)
if not(reachable and dns_ready):
problems_found += 1
logging.debug(f"check_host problems_found is {problems_found}.")
if __name__ == '__main__':
... handle command line args ...
try:
with concurrent.futures.ThreadPoolExecutor() as executor:
for ip in get_ips():
req_rec = find_dns_req_record(ip, dns_record_reqs)
executor.submit(check_host, req_rec)
Why is performance of my threaded script almost the same my non-thread version?
The kind of work you are performing is important to answer the question. If you are performing many IO-bound tasks (network calls, disk reads, etc.), then using Python's multi-threading should provide a good speed increase, since you can now have multiple threads waiting for multiple IO calls.
However, if you are performing raw computation, then multi-threading wont help you, because of Python's GIL (global interpreter lock), which basically only allows one thread to run at a time. To speed up non IO-bound computation, you will need to use the multiprocessing module, and spin up multiple Python processes. One of the disadvantages of multiple processes vs multiple threads is that it is harder to share data/memory between processes (because they have separate address spaces) vs threads (threads share memory because they are part of the same process).
Another thing that is important to consider is how you are using locks. If you put too much code under a lock, then threads won't be able to concurrently execute that code. You should try to have the smallest amount of code possible under any given lock, and only in places where shared data is accessed. If your entire thread function body is under a lock then you eliminate the potential for speed improvement via multi-threading.
def multiprocess_function():
run = 0
while run == 0:
for i in the range(100):
#This will initiate 100 threads
threading.Thread(target=sum, args=(i,0))
time.sleep(10)
p1 = multiprocessing.Process(target=multiprocess_function)
p1.start
In the above code snippet, I am starting a new infinite loop process (on a separate core (say #2)). Within this function, I launch 100 threads. Will the threads run on the same core #2 or it will run on the main python core?
Also, how many threads can you run on one core?
All threads run within the process they started from. In this case, the process p1.
With regard to how many threads you can run in a process, you have to keep in mind that in CPython, only one thread at a time can be executing Python bytecode. This is enforced by the Global Interpreter Lock ("GIL"). So for jobs that require a lot of calculations it is generally better to use processes and not threads.
If you look at the documentation for concurrent.futures.ThreadPoolExecutor, by default the number of worker threads that it uses is five times the amount of physical processors. That seams to be a reasonable amount for the kinds of workloads that the ThreadPoolExecutor is meant for.
The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..
My (simplified) code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]
jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.
The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.
Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)
There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.
I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]
And it does print all 1000 numbers, even though get isn't called.
I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?
Thanks a lot!
The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.
Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.
This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.
It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.
Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.
In python, I'm writing a script that runs an external process. This external process does the following steps:
Fetch a value from a config file, taking into account other running
processes.
Run another process, using the value from step 1.
Step 1 can be bypassed by passing in a value to use. Trying to use the same value concurrently is an error, but using it sequentially is valid. (think of it as a pool of pids, with no more than 10 available) Other processes (e.g. a user logging in) can use one of these "pids".
The external process takes a few hours to run, and multiple independent copies must be run. Running them sequentially works, but takes too long.
I'm changing the script to run these processes concurrently using the multiprocessing module. A simplified version of my code is:
from multiprocessing import Pool
import subprocess
def longRunningTask(n):
subprocess.call(["ls", "-l"]) # real code uses a process with no screen I/O
if __name__ == '__main__':
myArray = [1,2,3,4,5]
pool = Pool(processes=3)
pool.map(longRunningTask, myArray)
Using this code fails, because it uses the same "pid" for every process started.
The solutions I've come up with are:
If the call fails, have a random delay and try again. This could end
up busy waiting for hours if enough "pids" are in use.
Create a Queue of the available "pids", get() an item from it before starting the process, and put() it when it completes. This would still need to wait if the "pid" was in use, the same as number 1.
Use a Manager to hold an array of "pids" that are in use (starting empty). Before starting the process, get a "pid", check if it's in the array (start again if it is), add it to the array, remove it when done.
Are there problems with approach 3, or is there a different way to do it?
Am new to python and making some headway with threading - am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
How do I do this?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
Edit: maybe I should add that each of my threads is using subprocess.Popen to run a separate command line decoder (flac) piped to stdout which is fed into a command line encoder (lame/mp3).
If you want to limit the number of parallel threads, use a semaphore:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads will wait in threadLimiter.acquire() and a waiting thread will only continue once another thread goes through threadLimiter.release().
"Each of my threads is using subprocess.Popen to run a separate command line [process]".
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
You want to have more work than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.
If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing module in Python 2.6 -- it makes parallel programming a cinch. You can create a Pool object with 2*num_threads processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads tasks at a time, until all are done.
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.
It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
As you are new to Python though I'll assume you don't know about the GIL and it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessing module (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
In your case you are probably intersted in using a pool as specified here.
Short answer: don't use threads.
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh which runs a configurable number of Popen() subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll() methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep() values and in the past I've written a version (before the subprocess module was added to Python) which used the signal module (SIGCHLD and SIGALRM) and the os.fork() and os.execve() functions --- which my on pipe and file descriptor plumbing, etc).
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
So I know this model works reliably. Running 100 current ssh processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).
I am not an expert in this, but I have read something about "Lock"s. This article might help you out
Hope this helps
I would like to add something, just as a reference for others looking to do something similar, but who might have coded things different from the OP. This question was the first one I came across when searching and the chosen answer pointed me in the right direction. Just trying to give something back.
import threading
import time
maximumNumberOfThreads = 2
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
def simulateThread(a,b):
threadLimiter.acquire()
try:
#do some stuff
c = a + b
print('a + b = ',c)
time.sleep(3)
except NameError: # Or some other type of error
# in case of exception, release
print('some error')
threadLimiter.release()
finally:
# if everything completes without error, release
threadLimiter.release()
threads = []
sample = [1,2,3,4,5,6,7,8,9]
for i in range(len(sample)):
thread = threading.Thread(target=(simulateThread),args=(sample[i],2))
thread.daemon = True
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This basically follows what you will find on this site:
https://www.kite.com/python/docs/threading.BoundedSemaphore