Optimizing Multiprocessing in Python (Follow Up: using Queues)

Optimizing Multiprocessing in Python (Follow Up: using Queues) - python

This is a followup question to this. User Will suggested using a queue, I tried to implement that solution below. The solution works just fine with j=1000, however, it hangs as I try to scale to larger numbers. I am stuck here and cannot determine why it hangs. Any suggestions would be appreciated. Also, the code is starting to get ugly as I keep messing with it, I apologize for all the nested functions.
def run4(j):
"""
a multicore approach using queues
"""
from multiprocessing import Process, Queue, cpu_count
import os
def bazinga(uncrunched_queue, crunched_queue):
"""
Pulls the next item off queue, generates its collatz
length and
"""
num = uncrunched_queue.get()
while num != 'STOP': #Signal that there are no more numbers
length = len(generateChain(num, []) )
crunched_queue.put([num , length])
num = uncrunched_queue.get()
def consumer(crunched_queue):
"""
A process to pull data off the queue and evaluate it
"""
maxChain = 0
biggest = 0
while not crunched_queue.empty():
a, b = crunched_queue.get()
if b > maxChain:
biggest = a
maxChain = b
print('%d has a chain of length %d' % (biggest, maxChain))
uncrunched_queue = Queue()
crunched_queue = Queue()
numProcs = cpu_count()
for i in range(1, j): #Load up the queue with our numbers
uncrunched_queue.put(i)
for i in range(numProcs): #put sufficient stops at the end of the queue
uncrunched_queue.put('STOP')
ps = []
for i in range(numProcs):
p = Process(target=bazinga, args=(uncrunched_queue, crunched_queue))
p.start()
ps.append(p)
p = Process(target=consumer, args=(crunched_queue, ))
p.start()
ps.append(p)
for p in ps: p.join()

You're putting 'STOP' poison pills into your uncrunched_queue (as you should), and having your producers shut down accordingly; on the other hand your consumer only checks for emptiness of the crunched queue:
while not crunched_queue.empty():
(this working at all depends on a race condition, btw, which is not good)
When you start throwing non-trivial work units at your bazinga producers, they take longer. If all of them take long enough, your crunched_queue dries up, and your consumer dies. I think you may be misidentifying what's happening - your program doesn't "hang", it just stops outputting stuff because your consumer is dead.
You need to implement a smarter methodology for shutting down your consumer. Either look for n poison pills, where n is the number of producers (who accordingly each toss one in the crunched_queue when they shut down), or use something like a Semaphore that counts up for each live producer and down when one shuts down.

Related

Understanding the speed difference in threading

This is the script both threading functions call:
def searchBadWireless( hub ):
host = f'xxx.xxx.xxx.{hub}'
results = {}
try:
netConnect = ConnectHandler( device_type=platform, ip=host, username=cisco_username, password=cisco_password, )
output = netConnect.send_command( 'sh int status | i 298|299' )
netConnect.disconnect()
results[ int( hub ) ] = output
except:
print( f'{host} - Failed to connect' )
Now the first threading function I have completes in around 7 seconds:
def threadingProcess( execFunction ):
switchList = getSwitchIPs()
start = perf_counter()
threads = []
for ip in switchList:
thread = threading.Thread(target=execFunction, args=( [ip[ 0 ]] ) )
threads.append( thread )
for t in threads:
t.start()
for c in threads:
c.join()
finish = perf_counter()
print(f"It took {finish-start} second(s) to finish.")
But the second one I have runs at around 32 seconds:
def newThreadProcess():
switchList = getSwitchIPs()
start = perf_counter()
with ThreadPoolExecutor() as executor:
results = executor.map(searchBadWireless, switchList)
# for result in results:
# print(result)
finish = perf_counter()
print(f"It took {finish-start} second(s) to finish.")
From what I have read online the better approach is the second function but why does it take so much longer to complete than the first, is there a way of speeding it up to be as fast as the first function?

The first function is faster for the simple reason that all threads are started immediately. If your work items are of number N, you are lunching N threads in parallel. If your machine can handle that load, it will be fast.
For the second function, the ThreadPoolExecutor, by default, limits the number of threads by using a pool of threads. In order to specify the pool size, you need to set the max_workers arguments to the target number of threads.
Doc:
Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
So it seems that the host had a low number of CPUs, thus limiting the number of threads in the pool. Theoretically, if the number of max_workers was equal to N (number of work items), the throughput of both functions would be the same.

random.random() generates same number in multiprocessing

I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)

You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.

Factorial calculation with threading takes too long to finish, How to fix it?

I am trying to calculate whether a given number is prime or not with the formula :
(n-1)! mod n =? (n-1)
I must calculate the factorial with different threads and make them work simultaneously and control if they're all finished and if so then join them. By doing so I will be calculating factorial with different threads and then be able to take the modulo. However even though my code works fine with the small prime numbers it is taking too long to execute when the number is too big. I searched my code and couldn't really find alternative that can slow down the execution time. Here is my code :
import threading
import time
# GLOBAL VARIABLE
result = 1
# worker class in order to multiply on threads
class Worker:
# initiating the worker class
def __init__(self):
super().__init__()
self.jobs = []
# the function that does the actual multiplying
def multiplier(self,beg,end):
global result
for i in range(beg,end+1):
result*= i
#print("\tresult updated with *{}:".format(i),result)
#print("Calculating from {} to {}".format(beg,end)," : ",result)
# appending threads to the object
def append_job(self,job):
self.jobs.append(job)
# function that is to see the threads
def see_jobs(self):
return self.jobs
# initiating the threads
def initiate(self):
for j in self.jobs:
j.start()
# finalizing and joining the threads
def finalize(self):
for j in self.jobs:
j.join()
# controlling the threads by blocking them untill all threads are asleep
def work(self):
while True:
if 0 == len([t for t in self.jobs if t.is_alive()]):
self.finalize()
break
# this is the function to split the factorial into several threads
def splitUp(n,t):
# defining the remainder and the whole
remainder, whole = (n-1) % t, (n-1) // t
# deciding to tuple count
tuple_count = whole if remainder == 0 else whole + 1
# empty result list
result = []
# iterating
beginning = 1
end = (n-1) // t
for i in range(1,tuple_count+1):
if i == tuple_count:
result.append((beginning,n-1)) # if we are at the end, just append all to end
else:
result.append((beginning,end*i))
beginning = end*i + 1
return result
if __name__ == "__main__":
threads = 64
number = 743
splitted = splitUp(number,threads)
worker = Worker()
#print(worker.see_jobs())
s = time.time()
# creating the threads
for arg in splitted:
thread = threading.Thread(target=worker.multiplier(arg[0],arg[1]))
worker.append_job(thread)
worker.initiate()
worker.work()
e = time.time()
print("result found with {} threads in {} secs\n".format(threads,e-s))
if result % number == number-1:
print("PRIME")
else:
print("NOT PRIME")
"""
-------------------- REPORT ------------------------
result found with 2 threads in 6.162530899047852 secs
result found with 4 threads in 0.29897499084472656 secs
result found with 16 threads in 0.009003162384033203 secs
result found with 32 threads in 0.0060007572174072266 secs
result found with 64 threads in 0.0029952526092529297 secs
note that: these results may differ from machine to machine
-------------------------------------------------------
"""
Thanks in advance.

First and foremost, you have a critical error in your code that you haven't reported or tried to trace:
======================== RESTART: ========================
result found with 2 threads in 5.800899267196655 secs
NOT PRIME
>>>
======================== RESTART: ========================
result found with 64 threads in 0.002002716064453125 secs
PRIME
>>>
As the old saying goes, "if the program doesn't work, it doesn't matter how fast it is".
The only test case you've given is 743; if you want help to diagnose the logic error, please determine the minimal test and parallelism that causes the error, and post a separate question.
I suspect that it's with your multplier function, as you're working with an ill-advised global variable in parallel processing, and your multiply operation is not thread-safe.
In assembly terms, you have an unguarded region:
LOAD result
MUL i
STORE result
If this is interleaved with the same work from another process, the result is virtually guaranteed to be wrong. You have to make this a critical region.
Once you fix that, you still have your speed problem. Factorial is the poster-child for recursion acceleration. I see two obvious accelerants:
Instead of that horridly slow multiplication loop, use functools.reduce to blast through your multiplication series.
If you're going to loop the program with a series of inputs, then short-cut most of the calculations with memoization. The example on the linked page benefits greatly from multiple-recursion; since factorial is linear, you'd need repeated application to take advantage of the technique.

Multiprocessing Running Slower than a Single Process

I'm attempting to use multiprocessing to run many simulations across multiple processes; however, the code I have written only uses 1 of the processes as far as I can tell.
Updated
I've gotten all the processes to work (I think) thanks to #PaulBecotte ; however, the multiprocessing seems to run significantly slower than its non-multiprocessing counterpart.
For instance, not including the function and class declarations/implementations and imports, I have:
def monty_hall_sim(num_trial, player_type='AlwaysSwitchPlayer'):
if player_type == 'NeverSwitchPlayer':
player = NeverSwitchPlayer('Never Switch Player')
else:
player = AlwaysSwitchPlayer('Always Switch Player')
return (MontyHallGame().play_game(player) for trial in xrange(num_trial))
def do_work(in_queue, out_queue):
while True:
try:
f, args = in_queue.get()
ret = f(*args)
for result in ret:
out_queue.put(result)
except:
break
def main():
logging.getLogger().setLevel(logging.ERROR)
always_switch_input_queue = multiprocessing.Queue()
always_switch_output_queue = multiprocessing.Queue()
total_sims = 20
num_processes = 5
process_sims = total_sims/num_processes
with Timer(timer_name='Always Switch Timer'):
for i in xrange(num_processes):
always_switch_input_queue.put((monty_hall_sim, (process_sims, 'AlwaysSwitchPlayer')))
procs = [multiprocessing.Process(target=do_work, args=(always_switch_input_queue, always_switch_output_queue)) for i in range(num_processes)]
for proc in procs:
proc.start()
always_switch_res = []
while len(always_switch_res) != total_sims:
always_switch_res.append(always_switch_output_queue.get())
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\tLength of Always Switch Result List: {alw_sw_len}'.format(alw_sw_len=len(always_switch_res))
print '\tThe success average of switching doors was: {alw_sw_prob}'.format(alw_sw_prob=always_switch_success)
which yields:
Time Elapsed: 1.32399988174 seconds
Length: 20
The success average: 0.6
However, I am attempting to use this for total_sims = 10,000,000 over num_processes = 5, and doing so has taken significantly longer than using 1 process (1 process returned in ~3 minutes). The non-multiprocessing counterpart I'm comparing it to is:
def main():
logging.getLogger().setLevel(logging.ERROR)
with Timer(timer_name='Always Switch Monty Hall Timer'):
always_switch_res = [MontyHallGame().play_game(AlwaysSwitchPlayer('Monty Hall')) for x in xrange(10000000)]
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\n\tThe success average of not switching doors was: {not_switching}' \
'\n\tThe success average of switching doors was: {switching}'.format(not_switching=never_switch_success,
switching=always_switch_success)

You could try import “process “ under some if statements

EDIT- you changed some stuff, let me try and explain a bit better.
Each message you put into the input queue will cause the monty_hall_sim function to get called and send num_trial messages to the output queue.
So your original implementation was right- to get 20 output messages, send in 5 input messages.
However, your function is slightly wrong.
for trial in xrange(num_trial):
res = MontyHallGame().play_game(player)
yield res
This will turn the function into a generator that will provide a new value on each next() call- great! The problem is here
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
out_queue.put(ret.next())
except:
break
Here, on each pass through the loop you create a NEW generator with a NEW message. The old one is thrown away. So here, each input message only adds a single output message to the queue before you throw it away and get another one. The correct way to write this is-
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
for result in ret:
out_queue.put(ret.next())
except:
break
Doing it this way will continue to yield output messages from the generator until it finishes (after yielding 4 messages in this case)

I was able to get my code to run significantly faster by changing monty_hall_sim's return to a list comprehension, having do_work add the lists to the output queue, and then extend the results list of main with the lists returned by the output queue. Made it run in ~13 seconds.

Set the nr of executions per second using Python's multiprocessing

I wrote a script in Python 3.6 initially using a for loop which called an API, then putting all results into a pandas dataframe and writing them to a SQL database. (approximately 9,000 calls are made to that API every time the script runs).
Realising the calls inside the for loop were processed one-by-one, I decided to use the multiprocessing module to speed things up.
Therefore, I created a module level function called parallel_requests and now I call that instead of having the for loop:
list_of_lists = multiprocessing.Pool(processes=4).starmap(parallel_requests, zip(....))
Side note: I use starmap instead of map only because my parallel_requests function takes multiple arguments which I need to zip.
The good: this approach works and is much faster.
The bad: this approach works but is too fast. By using 4 processes (I tried that because I have 4 cores), parallel_requests is getting executed too fast. More than 15 calls per second are made to the API, and I'm getting blocked by the API itself.
In fact, it only works if I use 1 or 2 processes, otherwise it's too damn fast.
Essentially what I want is to keep using 4 processes, but also to limit the execution of my parallel_requests function to only 15 times per second overall.
Is there any parameter of multiprocessing.Pool that would help with this, or it's more complicated than that?

For this case I'd use a leaky bucket. You can have one process that fills a queue at the proscribed rate, with a maximum size that indicates how many requests you can "bank" if you don't make them at the maximum rate; the worker processes then just need to get from the queue before doing its work.
import time
def make_api_request(this, that, rate_queue):
rate_queue.get()
print("DEBUG: doing some work at {}".format(time.time()))
return this * that
def throttler(rate_queue, interval):
try:
while True:
if not rate_queue.full(): # avoid blocking
rate_queue.put(0)
time.sleep(interval)
except BrokenPipeError:
# main process is done
return
if __name__ == '__main__':
from multiprocessing import Pool, Manager, Process
from itertools import repeat
rq = Manager().Queue(maxsize=15) # conservative; no banking
pool = Pool(4)
Process(target=throttler, args=(rq, 1/15.)).start()
pool.starmap(make_api_request, zip(range(100), range(100, 200), repeat(rq)))

I'll look at the ideas posted here, but in the meantime I've just used a simple approach of opening and closing a Pool of 4 processes for every 15 requests and appending all the results in a list_of_lists.
Admittedly, not the best approach, since it takes time/resources to open/close a Pool, but it was the most handy solution for now.
# define a generator for use below
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
list_of_lists = []
for current_chunk in chunks(all_data, 15): # 15 is the API's limit of requests per second
pool = multiprocessing.Pool(processes=4)
res = pool.starmap(parallel_requests, zip(current_chunk, [to_symbol]*len(current_chunk), [query]*len(current_chunk), [start]*len(current_chunk), [stop]*len(current_chunk)) )
sleep(1) # Sleep for 1 second after every 15 API requests
list_of_lists.extend(res)
pool.close()
flatten_list = [item for sublist in list_of_lists for item in sublist] # use this to construct a `pandas` dataframe
PS: This solution is really not at all that fast due to the multiple opening/closing of pools. Thanks Nathan Vērzemnieks for suggesting to open just one pool, it's much faster, plus your processor won't look like it's running a stress test.

One way to do is to use Queue, which can share details about api-call timestamps with other processes.
Below is an example how this could work. It takes the oldest entry in queue, and if it is younger than one second, sleep functions is called for the duration of the difference.
from multiprocessing import Pool, Manager, queues
from random import randint
import time
MAX_CONNECTIONS = 10
PROCESS_COUNT = 4
def api_request(a, b):
time.sleep(randint(1, 9) * 0.03) # simulate request
return a, b, time.time()
def parallel_requests(a, b, the_queue):
try:
oldest = the_queue.get()
time_difference = time.time() - oldest
except queues.Empty:
time_difference = float("-inf")
if 0 < time_difference < 1:
time.sleep(1-time_difference)
else:
time_difference = 0
print("Current time: ", time.time(), "...after sleeping:", time_difference)
the_queue.put(time.time())
return api_request(a, b)
if __name__ == "__main__":
m = Manager()
q = m.Queue(maxsize=MAX_CONNECTIONS)
for _ in range(0, MAX_CONNECTIONS): # Fill the queue with zeroes
q.put(0)
p = Pool(PROCESS_COUNT)
# Create example data
data_length = 100
data1 = range(0, data_length) # Just some dummy-data
data2 = range(100, data_length+100) # Just some dummy-data
queue_iterable = [q] * (data_length+1) # required for starmap -function
list_of_lists = p.starmap(parallel_requests, zip(data1, data2, queue_iterable))
print(list_of_lists)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing Multiprocessing in Python (Follow Up: using Queues) - python

Related

Understanding the speed difference in threading

random.random() generates same number in multiprocessing

Factorial calculation with threading takes too long to finish, How to fix it?

Multiprocessing Running Slower than a Single Process

Set the nr of executions per second using Python's multiprocessing

Categories

Resources