I have an iterator which contains a lot of data (larger then memory) I want to be able to perform some actions on this data. To do this quickly I am using the multiprocessing module.
def __init__(self, poolSize, spaceTimeTweetCollection=None):
super().__init__()
self.tagFreq = {}
if spaceTimeTweetCollection is not None:
q = Queue()
processes = [Process(target=self.worker, args=((q),)) for i in range(poolSize)]
for p in processes:
p.start()
for tweet in spaceTimeTweetCollection:
q.put(tweet)
for p in processes:
p.join()
the aim is that I create some proceses which listen in on the queue
def worker(self, queue):
tweet = queue.get()
self.append(tweet) #performs some actions on data
I then loop over the iterator and add the data to the queue as the queue.get() in the worker method is blocking the workers should start performing actions on the data as it recieves it from the queue.
However instead each worker on each processor is run once and thats it! so if poolSize is 8 it will read the first 8 items in the queue perform the actions on 8 different processes and then it will finish! does anyone know why this is happerning? I am running this on windows.
edit
I wanted to mention even thought this is all being done in a class the class is called in _main_like so
if __name__ == '__main__':
tweetDatabase = Database()
dataSet = tweetDatabase.read2dBoundingBox(boundaryBox)
freq = TweetCounter(8, dataSet) # this is where the multiprocessing is done
Your worker is to blame I believe. It just does one thing and then dies. Try:
def worker(self, queue):
while True:
tweet = queue.get()
self.append(tweet)
(I'd take a look at Pool though)
Related
I'm trying to write a code in which there is a single queue and many workers (producer_consumer in the example) that process objects in the queue. I need to use multiprocessing since the code the workers are going to execute will be CPU bounded. The setup is the following:
The queue is initialized by the parent process with some initial values (names in the example), then it starts the workers.
Workers start getting elements from the queue and after processing that element each worker may produce a new object to be inserted (...and then processed by someone else) into the queue.
All this goes on until the queue is empty. When this happens I would like that all workers stops and the control is given back to the Parent to conclude the execution.
I wrote this example in which workers correctly process elements and produce new objects into the queue but the problem is that the execution hang when the queue is empty. Any suggestions?
Thanks in advance
import time
import os
import random
import string
from multiprocessing import Process, Queue, Lock
# Produces and Consumes names in and from the Queue,
def producer_consumer(queue, lock):
# Synchronize access to the console
with lock:
print('Starting consumer => {}'.format(os.getpid()))
while not queue.empty():
time.sleep(random.randint(0, 10))
# If the queue is empty, queue.get() will block until the queue has data
name = queue.get()
if random.random() < 0.7:
product = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
queue.put(product)
else:
product = 'nothing'
# Synchronize access to the console
with lock:
print('{} got {}, produced {}'.format(os.getpid(), name, product))
if __name__ == '__main__':
# Create the Queue object
queue = Queue()
# Create a lock object to synchronize resource access
lock = Lock()
producer_consumers = []
names = ['Mario', 'Peppino', 'Francesco', 'Carlo', 'Ermenegildo']
for name in names:
queue.put(name)
for _ in range(5):
producer_consumers.append(Process(target=producer_consumer, args=(queue, lock)))
for process in producer_consumers:
process.start()
for p in producer_consumers:
p.join()
print('Parent process exiting...')
I'm trying to launch a function (my_function) and stop its execution after a certain time is reached.
So i challenged multiprocessing library and everything works well. Here is the code, where my_function() has been changed to only create a dummy message.
from multiprocessing import Queue, Process
from multiprocessing.queues import Empty
import time
timeout=1
# timeout=3
def my_function(something):
time.sleep(2)
return f'my message: {something}'
def wrapper(something, queue):
message ="too late..."
try:
message = my_function(something)
return message
finally:
queue.put(message)
try:
queue = Queue()
params = ("hello", queue)
child_process = Process(target=wrapper, args=params)
child_process.start()
output = queue.get(timeout=timeout)
print(f"ok: {output}")
except Empty:
timeout_message = f"Timeout {timeout}s reached"
print(timeout_message)
finally:
if 'child_process' in locals():
child_process.kill()
You can test and verify that depending on timeout=1 or timeout=3, i can trigger an error or not.
My main problem is that the real my_function() is a torch model inference for which i would like to limit the number of threads (to 4 let's say)
One can easily do so if my_function were in the main process, but in my example i tried a lot of tricks to limit it in the child process without any success (using threadpoolctl.threadpool_limits(4), torch.set_num_threads(4), os.environ["OMP_NUM_THREADS"]=4, os.environ["MKL_NUM_THREADS"]=4).
I'm completely open to other solution that can monitor the time execution of a function while limiting the number of threads used by this function.
thanks
Regards
You can limit simultaneous process with Pool. (https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool)
You can set max tasks done per child. Check it out.
Here you have a sample from superfastpython by Jason Brownlee:
# SuperFastPython.com
# example of limiting the number of tasks per child in the process pool
from time import sleep
from multiprocessing.pool import Pool
from multiprocessing import current_process
# task executed in a worker process
def task(value):
# get the current process
process = current_process()
# report a message
print(f'Worker is {process.name} with {value}', flush=True)
# block for a moment
sleep(1)
# protect the entry point
if __name__ == '__main__':
# create and configure the process pool
with Pool(2, maxtasksperchild=3) as pool:
# issue tasks to the process pool
for i in range(10):
pool.apply_async(task, args=(i,))
# close the process pool
pool.close()
# wait for all tasks to complete
pool.join()
I'm trying to code a kind of task manager in Python. It's based on a job queue, the main thread is in charge of adding jobs to this queue. I have made this class to handle the jobs queued, able to limit the number of concurrent processes and handle the output of the finished processes.
Here comes the problem, the _check_jobs function I don't get updated the returncode value of each process, independently of its status (running, finished...) job.returncode is always None, therefore I can't run if statement and remove jobs from the processing job list.
I know it can be done with process.communicate() or process.wait() but I don't want to block the thread that launches the processes. Is there any other way to do it, maybe using a ProcessPoolExecutor? The queue can be hit by processes at any time and I need to be able to handle them.
Thank you all for your time and support :)
from queue import Queue
import subprocess
from threading import Thread
from time import sleep
class JobQueueManager(Queue):
def __init__(self, maxsize: int):
super().__init__(maxsize)
self.processing_jobs = []
self.process = None
self.jobs_launcher=Thread(target=self._worker_job)
self.processing_jobs_checker=Thread(target=self._check_jobs_status)
self.jobs_launcher.start()
self.processing_jobs_checker.start()
def _worker_job(self):
while True:
# Run at max 3 jobs concurrently
if self.not_empty and len(self.processing_jobs) < 3:
# Get job from queue
job = self.get()
# Execute a task without blocking the thread
self.process = subprocess.Popen(job)
self.processing_jobs.append(self.process)
# util if queue.join() is used to block the queue
self.task_done()
else:
print("Waiting 4s for jobs")
sleep(4)
def _check_jobs_status(self):
while True:
# Check if jobs are finished
for job in self.processing_jobs:
# Sucessfully completed
if job.returncode == 0:
self.processing_jobs.remove(job)
# Wait 4 seconds and repeat
sleep(4)
def main():
q = JobQueueManager(100)
task = ["stress", "--cpu", "1", "--timeout", "20"]
for i in range(10): #put 10 tasks in the queue
q.put(task)
q.join() #block until all tasks are done
if __name__ == "__main__":
main()
I answer myself, I have come up with a working solution. The JobExecutor class handles in a custom way the Pool of processes. The watch_completed_tasks function tries to watch and handle the output of the tasks when they are done. This way everything is done with only two threads and the main thread is not blocked when submitting processes.
import subprocess
from threading import Timer
from concurrent.futures import ProcessPoolExecutor, as_completed
import logging
def launch_job(job):
process = subprocess.Popen(job, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(f"launching {process.pid}")
return [process.pid, process.stdout.read(), process.stderr.read()]
class JobExecutor(ProcessPoolExecutor):
def __init__(self, max_workers: int):
super().__init__(max_workers)
self.futures = []
self.watch_completed_tasks()
def submit(self, command):
future = super().submit(launch_job, command)
self.futures.append(future)
return future
def watch_completed_tasks(self):
# Manage tasks completion
for completed_task in as_completed(self.futures):
print(f"FINISHED task with PID {completed_task.result()[0]}")
self.futures.remove(completed_task)
# call this function evevery 5 seconds
timer_thread = Timer(5.0, self.watch_completed_tasks)
timer_thread.setName("TasksWatcher")
timer_thread.start()
def main():
executor = JobExecutor(max_workers=5)
for i in range(10):
task = ["stress",
"--cpu", "1",
"--timeout", str(i+5)]
executor.submit(task)
here is a simple example:
from collections import deque
from multiprocessing import Process
global_dequeue = deque([])
def push():
global_dequeue.append('message')
p = Process(target=push)
p.start()
def pull():
print(global_dequeue)
pull()
the output is deque([])
if I was to call push function directly, not as a separate process, the output would be deque(['message'])
How can get the message into deque, but still run push function in a separate process?
You can share data by using multiprocessing Queue object which is designed to share data between processes:
from multiprocessing import Process, Queue
import time
def push(q): # send Queue to function as argument
for i in range(10):
q.put(str(i)) # put element in Queue
time.sleep(0.2)
q.put("STOP") # put poison pillow to stop taking elements from Queue in master
if __name__ == "__main__":
q = Queue() # create Queue instance
p = Process(target=push, args=(q,),) # create Process
p.start() # start it
while True:
x = q.get()
if x == "STOP":
break
print(x)
p.join() # join process to our master process and continue master run
print("Finish")
Let me know if it helped, feel free to ask questions.
You can also use Managers to achieve this.
Python 2: https://docs.python.org/2/library/multiprocessing.html#managers
Python 3:https://docs.python.org/3.8/library/multiprocessing.html#managers
Example of usage:
https://pymotw.com/2/multiprocessing/communication.html#managing-shared-state
I wrote a little multi-host online-scanner. So my question is the code correct? I mean, the program does what it should but if I look in the terminal at the top command, it shows me from time to time a lot of zombie threads.
The code for scan and threader function:
def scan(queue):
conf.verb = 0
while True:
try:
host = queue.get()
if host is None:
sys.exit(1)
icmp = sr1(IP(dst=host)/ICMP(),timeout=3)
if icmp:
with Print_lock:
print("Host: {} online".format(host))
saveTofile(RngFile, host)
except Empty:
print("Done")
queue.task_done()
def threader(queue, hostlist):
threads = []
max_threads = 223
for i in range(max_threads):
thread = Thread(target=scan, args=(queue,))
thread.start()
threads.append(thread)
for ip in hostlist:
queue.put(ip)
queue.join()
for i in range(max_threads):
queue.put(None)
for thread in threads:
thread.join()
P.S. Sorry for my terrible english
If you have a lot more threads than you have cores then you aren't really getting any benefit from spawning them, and even worse python has the global interpreter lock so you don't get real multithreading unless you use multiple processes. Use multiprocessing and set max_threads to multiprocessing.cpu_count().
Even better, you could use a pool.
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
results = p.map(scan, hostlist)
# change scan to take host directly instead of from a queue
# if you change
And that's it, no messing with queues and filling it with None to make sure you kill all your processes.
I should add, make sure you create your process pool inside the main module! Your code in its entirety should look like this:
from multiprocessing import Pool, cpu_count
def scan(host):
# whatever
if __name__ == "__main__":
with Pool(cpu_count()) as p:
results = p.map(scan, hostlist)