I am trying to implement multiprocessing in a Python program where I need to run some CPU intensive code. In my test code the multiprocessing Queue and the multiprocessing Pool are both slower than a normal loop with no multiprocessing. During the Pool section of my code, I can see that the CPU usage is maxed out. However, it is still slower than the normal loop! Is there an issue with my code?
import time
from multiprocessing import Process
from multiprocessing import Queue
from multiprocessing import Pool
import random
def run_sims(iterations):
sim_list = []
for i in range(iterations):
sim_list.append(random.uniform(0,1))
print(iterations, "count", sum(sim_list)/len(sim_list))
return (sum(sim_list)/len(sim_list))
def worker(queue):
i=0
while not queue.empty():
task = queue.get()
run_sims(task)
i=i+1
if __name__ == '__main__':
queue = Queue()
iterations_list = [30000000, 30000000, 30000000, 30000000, 30000000]
it_len = len(iterations_list)
## Queue ##
print("#STARTING QUEUE#")
start_t = time.perf_counter()
for i in range(it_len):
iterations = iterations_list[i]
queue.put(iterations)
process = Process(target=worker, args=(queue, ))
process.start()
process.join()
end_t = time.perf_counter()
print("Queue time: ", end_t - start_t)
## Pool ##
print("#STARTING POOL#")
start_t = time.perf_counter()
with Pool() as pool:
results = pool.imap_unordered(run_sims, iterations_list)
for res in results:
res
end_t = time.perf_counter()
print("Pool time: ", end_t - start_t)
## No Multiprocessing - Normal Loop
print("#STARTING NORMAL LOOP#")
start_t = time.perf_counter()
for i in iterations_list:
run_sims(i)
end_t = time.perf_counter()
print("Normal time: ", end_t - start_t)
I've tried the above code but the multiprocessing sections are slower than the normal loop:
Queue Time: 59 seconds
Pool Time: 83 seconds
Normal Loop Time: 55 seconds
My expectation is that Queue and Pool would be significantly faster than the normal loop.
Added processes to the queue code so that it will perform about the same as the pool. On my machine, queue and pool were significantly faster than sequential. I have 4 cores and 8 cpus. Since this is a cpu bound task, performance differences will differ depending on the number of available cpus and other working going on in the machine.
This script keeps the number of workers below the cpu count. If these were network bound tasks, a larger pool could potentially perform faster. Disk bound tasks would likely not benefit from a larger pool.
import time
from multiprocessing import Process
from multiprocessing import Queue
from multiprocessing import Pool
from multiprocessing import cpu_count
import random
def run_sims(iterations):
sim_list = []
for i in range(iterations):
sim_list.append(random.uniform(0,1))
print(iterations, "count", sum(sim_list)/len(sim_list))
return (sum(sim_list)/len(sim_list))
def worker(queue):
i=0
while not queue.empty():
task = queue.get()
run_sims(task)
i=i+1
if __name__ == '__main__':
iteration_count = 5
queue = Queue()
iterations_list = [30000000] * iteration_count
it_len = len(iterations_list)
# guess a parallel execution size. CPU bound, and we want some
# room for other processes.
pool_size = max(min(cpu_count()-2, len(iterations_list)), 2)
print("Pool size", pool_size)
## Queue ##
print("#STARTING QUEUE#")
start_t = time.perf_counter()
for iterations in iterations_list:
queue.put(iterations)
processes = []
for i in range(pool_size):
processes.append(Process(target=worker, args=(queue, )))
processes[-1].start()
for process in processes:
process.join()
end_t = time.perf_counter()
print("Queue time: ", end_t - start_t)
## Pool ##
print("#STARTING POOL#")
start_t = time.perf_counter()
with Pool(pool_size) as pool:
results = pool.imap_unordered(run_sims, iterations_list)
for res in results:
res
end_t = time.perf_counter()
print("Pool time: ", end_t - start_t)
## No Multiprocessing - Normal Loop
print("#STARTING NORMAL LOOP#")
start_t = time.perf_counter()
for i in iterations_list:
run_sims(i)
end_t = time.perf_counter()
print("Normal time: ", end_t - start_t)
Related
Given the following program running my_function in a subprocess using run_process_timeout_wrapper leads to a timeout (over 160s), while running it "normally" takes less than a second.
from multiprocessing import Process, Queue
import time
import numpy as np
import xgboost
def run_process_timeout_wrapper(function, args, timeout):
def foo(n, out_q):
res = function(*n)
out_q.put(res) # to get result back from thread target
result_q = Queue()
p = Process(target=foo, args=(args, result_q))
p.start()
try:
x = result_q.get(timeout=timeout)
except Empty as e:
p.terminate()
raise multiprocessing.TimeoutError("Timed out after waiting for {}s".format(timeout))
p.terminate()
return x
def my_function(fun):
print("Started")
t1 = time.time()
pol = xgboost.XGBRegressor()
pol.fit(np.random.rand(5,1500), np.random.rand(50,1))
print("Took ", time.time() - t1)
pol.predict(np.random.rand(2,1500))
return 5
if __name__ == '__main__':
t1 = time.time()
pol = xgboost.XGBRegressor()
pol.fit(np.random.rand(50,150000), np.random.rand(50,1))
print("Took ", time.time() - t1)
my_function(None)
t1 = time.time()
res = run_process_timeout_wrapper(my_function, (None,),160)
print("Res ", res, " Time ", time.time() - t1)
I am running this on Linux. Since it has come up, I have also added a print in the beginning of my_function showing that this function is at least reached.
Gathered from this issue I found that forking a multi-threaded application is problematic. One possible solution is to add
if __name__ == "__main__":
mp.set_start_method('spawn')
However, this may lead to other issues.
Using below code I start to thread processes, write_process writes to a queue and read_process reads from a queue :
import time
from multiprocessing import Process, Queue, Pool
class QueueFun():
def writing_queue(self, work_tasks):
while True:
print("Writing to queue")
work_tasks.put(1)
time.sleep(1)
def read_queue(self, work_tasks):
while True:
print('Reading from queue')
work_tasks.get()
time.sleep(2)
if __name__ == '__main__':
q = QueueFun()
work_tasks = Queue()
write_process = Process(target=q.writing_queue,
args=(work_tasks,))
write_process.start()
read_process = Process(target=q.read_queue,
args=(work_tasks,))
read_process.start()
write_process.join()
read_process.join()
Running above code prints:
Writing to queue
Reading from queue
Writing to queue
Reading from queue
Writing to queue
Writing to queue
Reading from queue
Writing to queue
How to start N processes to read from the queue?
I tried starting 3 processes using below code but just 1 process is started, this is because the .join() prevents the second process from starting?:
for i in range(0 , 3):
read_process = Process(target=q.read_queue,
args=(work_tasks,))
print('Starting read_process' , i)
read_process.start()
read_process.join()
I also considered using a Pool as described in https://docs.python.org/2/library/multiprocessing.html but this seems just relevant for transforming an existing collection :
print pool.map(f, range(10))
How to start n threads where each thread processes a shared queue?
You can just put it to list, and join it outside of creation loop:
if __name__ == '__main__':
q = QueueFun()
work_tasks = Queue()
write_process = Process(target=q.writing_queue,
args=(work_tasks,))
write_process.start()
processes = []
for i in range(0, 5):
processes.append(Process(target=q.read_queue,
args=(work_tasks,)))
for p in processes:
p.start()
write_process.join()
for p in processes:
p.join()
I have a code where I try to stress CPU cores. I want to run partial number of cores at 100% while the rest should run at 0%. The logic I've used for cores to run at 100% is:
#Pass the CPU core number as affinity
def loop(conn, affinity):
proc = psutil.Process()
proc_info = proc.pid
msg = "Process ID: "+str(proc_info)+" CPU: "+str(affinity[0])
conn.send(msg)
conn.close()
proc.cpu_affinity(affinity) #Allocate a certain CPU core for this process
while True:
1*1
The cores executing this code run at 100%.
I wrote another loop and am attaching the remaining cores to processing executing this loop:
def rest_cores(affinity, exec_time):
proc = psutil.Process()
proc.cpu_affinity(affinity)
time.sleep(exec_time)
According to this logic, the cores should suspend execution for the exec_time and be at 0%. But the cores run at a higher percentage. How do I ensure that all the remaining cores are running at 0%?
Here is the full logic:
from multiprocessing import Process, Pipe
import os
import signal
import sys
import time
import psutil
def loop(conn, affinity):
proc = psutil.Process()
proc_info = proc.pid
msg = "Process ID: "+str(proc_info)+" CPU: "+str(affinity[0])
conn.send(msg)
conn.close()
proc.cpu_affinity(affinity)
while True:
1*1
def rest_cores(affinity, exec_time):
proc = psutil.Process()
proc.cpu_affinity(affinity)
time.sleep(exec_time)
def cpu_stress():
procs = []
conns = []
n_cpu = psutil.cpu_count(logical=True)
proc_num = n_cpu//2 #Half the cores will run at 100%
for i in range(proc_num): #Initial Half of the total cores
parent_conn, child_conn = Pipe()
p = Process(target=loop, args=(child_conn,[i]))
p.start()
procs.append(p)
conns.append(parent_conn)
for i in range(proc_num+1, n_cpu): #Final half of total cores
parent_conn, child_conn = Pipe()
p = Process(target=rest_cores, args=([i], exec_time))
p.start()
procs.append(p)
for conn in conns:
try:
print(conn.recv())
except EOFError:
continue
time.sleep(exec_time)
for p in procs:
p.terminate()
cpu_stress()
I have found several other questions that touch on this topic but none that are quite like my situation.
I have several very large text files (3+ gigabytes in size).
I would like to process them (say 2 documents) in parallel using multiprocessing. As part of my processing (within a single process) I need to make an API call and because of this would like to have each process have it's own threads to run asynchronously.
I have came up with a simplified example ( I have commented the code to try to explain what I think it should be doing):
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
def process_huge_file(*, file_, batch_size=250, num_threads=4):
# create APICaller instance for each process that has it's own Queue
api_call = APICaller()
batch = []
# create threads that will run asynchronously to make API calls
# I expect these to immediately block since there is nothing in the Queue (which is was
# the api_call.run depends on to make a call
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
####
# start processing the file line by line
for line in file_:
# if we are at our batch size, add the batch to the api_call to to let the threads do
# their api calling
if i % batch_size == 0:
api_call.queue.put(batch)
else:
# add fake line to batch
batch.append(fake_line)
class APICaller:
def __init__(self):
# thread safe queue to feed the threads which point at instances
of these APICaller objects
self.queue = Queue()
def run(self):
print("waiting for something to do")
self.queue.get()
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 1000 for i in range(2)]
####
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
As the code is now, "waiting for something to do" prints 8 times (makes sense 4 threads per process) and then it stops or "deadlocks" which is not what I expect - I expect it to start sharing time with the threads as soon as I start putting items in the Queue but the code does not appear to make it this far. I ordinarily would step through to find a hang up but I still don't have a solid understanding of how to best debug using Threads (another topic for another day).
In the meantime, can someone help me figure out why my code is not doing what it should be doing?
I have made a few adjustments and additions and the code appears to do what it is supposed to now. The main adjustments are: adding a CloseableQueue class (from Brett Slatkins Effective Python Item 55), and ensuring that I call close and join on the queue so that the threads properly exit. Full code with these changes below:
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
from concurrency_utils import CloseableQueue
def sync_process_huge_file(*, file_, batch_size=250):
batch = []
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
time.sleep(0.1)
batch = []
# api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
def process_huge_file(*, file_, batch_size=250, num_threads=4):
api_call = APICaller()
batch = []
# api call threads
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
for _ in threads:
api_call.queue.close()
api_call.queue.join()
for thread in threads:
thread.join()
class APICaller:
def __init__(self):
self.queue = CloseableQueue()
def run(self):
for item in self.queue:
print("waiting for something to do")
pass
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
print("exiting run")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 10000 for i in range(2)]
####
time_s = time.time()
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
time_e = time.time()
print(f"took {time_e-time_s} ")
class CloseableQueue(Queue):
SENTINEL = object()
def __init__(self, **kwargs):
super().__init__(**kwargs)
def close(self):
self.put(self.SENTINEL)
def __iter__(self):
while True:
item = self.get()
try:
if item is self.SENTINEL:
return # exit thread
yield item
finally:
self.task_done()
As expected this is a great speedup from running synchronously - 120 seconds vs 50 seconds.
I'm doing a simple multiprocessing test and something seems off. Im running this on i5-6200U 2.3 Ghz with Turbo Boost.
from multiprocessing import Process, Queue
import time
def multiply(a,b,que): #add a argument to function for assigning a queue
que.put(a*b) #we're putting return value into queue
if __name__ == '__main__':
queue1 = Queue() #create a queue object
jobs = []
start_time = time.time()
#####PARALLEL####################################
for i in range(0,400):
p = p = Process(target= multiply, args= (5,i,queue1))
jobs.append(p)
p.start()
for j in jobs:
j.join()
print("PARALLEL %s seconds ---" % (time.time() - start_time))
#####SERIAL################################
start_time = time.time()
for i in range(0,400):
multiply(5,i,queue1)
print("SERIAL %s seconds ---" % (time.time() - start_time))
Output:
PARALLEL 22.12951421737671 seconds ---
SERIAL 0.004009723663330078 seconds ---
Help is much appreciated.
Here's a brief example of (silly) code that gets a nice speedup. As already covered in comments, it doesn't create an absurd number of processes, and the work done per remote function invocation is high compared to interprocess communication overheads.
import multiprocessing as mp
import time
def factor(n):
for i in range(n):
pass
return n
if __name__ == "__main__":
ns = range(100000, 110000)
s = time.time()
p = mp.Pool(4)
got = p.map(factor, ns)
print(time.time() - s)
assert got == list(ns)
s = time.time()
got = [factor(n) for n in ns]
print(time.time() - s)
assert got == list(ns)