My question is around Queues and using ThreadPoolExecutor. If I understand the Python docs for Queues I can have code somewhat like this and not have to worry about needing another lock in Class B to control which thread is adding in items in to the queue? Since the Queue implments multiproducer, multiconsumer
class A:
def __init__(max_worker = 1):
pool = ThreadPoolExecutor(max_worker)
buffer = {}
_lock = threading.RLock()
def add_record_id(id, item):
with self._lock:
buffer[id].add(item, pool)
class B:
def __init__():
q = queue.Queue()
def add(item, pool):
if id >= 0:
q.put(item)
pool.submit(background_remover)
Related
I have two threads with while loops in them. The first process data that the second needs to elaborate in parallel. I need to share a variable.
let's introduce dummy input:
data = iter([1,2,3,4,5,6,7,8,9])
My first class of Thread:
import threading
from queue import Queue
import time
class Thread1(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
_download = {}
def run(self):
i = 0
while True:
_download[i] = next(data)
self.queue.put(next(data))
time.sleep(1)
i += 1
My second class of Thread:
class Thread2(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
self.queue.get()
time.sleep(3)
with the main method:
q = Queue(maxsize=10)
t = Thread1(q)
s = Thread2(q)
t.start()
s.start()
I illustratedthe two alternatives for the case. I can access queue variable from the second Thread but I also want that the second Thread access the dictionary.
what can I do to access also the dictionary from Thread2?
for which choice should I opt?
After encountering some probable memory leaks in a long running multi threaded script I found out about maxtasksperchild, which can be used in a Multi process pool like this:
import multiprocessing
with multiprocessing.Pool(processes=32, maxtasksperchild=x) as pool:
pool.imap(function,stuff)
Is something similar possible for the Threadpool (multiprocessing.pool.ThreadPool)?
As the answer by noxdafox said, there is no way in the parent class, you can use threading module to control the max number of tasks per child. As you want to use multiprocessing.pool.ThreadPool, threading module is similar, so...
def split_processing(yourlist, num_splits=4):
'''
yourlist = list which you want to pass to function for threading.
num_splits = control total units passed.
'''
split_size = len(yourlist) // num_splits
threads = []
for i in range(num_splits):
start = i * split_size
end = len(yourlist) if i+1 == num_splits else (i+1) * split_size
threads.append(threading.Thread(target=function, args=(yourlist, start, end)))
threads[-1].start()
# wait for all threads to finish
for t in threads:
t.join()
Lets say
yourlist has 100 items, then
if num_splits = 10; then threads = 10, each thread has 10 tasks.
if num_splits = 5; then threads = 5, each thread has 20 tasks.
if num_splits = 50; then threads = 50, each thread has 2 tasks.
and vice versa.
Looking at multiprocessing.pool.ThreadPool implementation it becomes evident that the maxtaskperchild parameter is not propagated to the parent multiprocessing.Pool class. The multiprocessing.pool.ThreadPool implementation has never been completed, hence it lacks few features (as well as tests and documentation).
The pebble package implements a ThreadPool which supports workers restart after a given amount of tasks have been processed.
I wanted a ThreadPool that will run a new task as soon as another task in the pool completes (i.e. maxtasksperchild=1). I decided to write a small "ThreadPool" class that creates a new thread for every task. As soon a task in the pool completes, another thread is created for the next value in the iterable passed to the map method. The map method blocks until all values in the passed iterable have been processed and their threads returned.
import threading
class ThreadPool():
def __init__(self, processes=20):
self.processes = processes
self.threads = [Thread() for _ in range(0, processes)]
def get_dead_threads(self):
dead = []
for thread in self.threads:
if not thread.is_alive():
dead.append(thread)
return dead
def is_thread_running(self):
return len(self.get_dead_threads()) < self.processes
def map(self, func, values):
attempted_count = 0
values_iter = iter(values)
# loop until all values have been attempted to be processed and
# all threads are finished running
while (attempted_count < len(values) or self.is_thread_running()):
for thread in self.get_dead_threads():
try:
# run thread with the next value
value = next(values_iter)
attempted_count += 1
thread.run(func, value)
except StopIteration:
break
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, exc_tb):
pass
class Thread():
def __init__(self):
self.thread = None
def run(self, target, *args, **kwargs):
self.thread = threading.Thread(target=target,
args=args,
kwargs=kwargs)
self.thread.start()
def is_alive(self):
if self.thread:
return self.thread.is_alive()
else:
return False
You can use it like this:
def run_job(self, value, mp_queue=None):
# do something with value
value += 1
with ThreadPool(processes=2) as pool:
pool.map(run_job, [1, 2, 3, 4, 5])
I have a list of objects, and I want to execute a method in each object in parallel. The method modifies the attributes of the objects. For example:
class Object:
def __init__(self, a):
self.a = a
def aplus(self):
self.a += 1
object_list = [Object(1), Object(2), Object(3)]
# I want to execute this in parallel
for i in range(len(object_list)):
object_list[i].aplus()
I tried the following:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
executor = ProcessPoolExecutor(max_workers=3)
res = executor.map([obj.aplus for obj in object_list])
Which does not work, leaving the objects unaltered. I assume it's because the objects can only be copied, and not accessed, with multiprocessing. Any idea?
Thanks a lot!
EDIT: Supposedly the objects are very big, so it would be preferable to avoid copying them to each process. The methods are also supposedly very CPU intensive, so multiple processes rather than threads should be used. Within these conditions, I believe there is no solution, as multiprocessing cannot share memory and threads cannot use multiple CPUs. I would like to be shown wrong though.
Here is a working example using Pool.map:
import multiprocessing
class Object:
def __init__(self, a):
self.a = a
def aplus(self):
self.a += 1
def __str__(self):
return str(self.a)
def worker(obj):
obj.aplus()
return obj
if __name__ == "__main__":
object_list = [Object(1), Object(2), Object(3)]
try:
processes = multiprocessing.cpu_count()
except NotImplementedError:
processes = 2
pool = multiprocessing.Pool(processes=processes)
modified_object_list = pool.map(worker, object_list)
for obj in modified_object_list:
print(obj)
Prints:
2
3
4
Here is my answer, using threading:
from threading import Thread
class Object:
def __init__(self, a):
self.a = a
def aplus(self):
self.a += 1
object_list = [Object(1), Object(2), Object(3)]
# A list containing all threads we will create
threads = []
# Create a thread for every objects
for obj in object_list:
thread = Thread(target=obj.aplus)
thread.daemon = True
thread.start()
threads.append(thread)
# Wait for all threads to finish before continuing
for thread in threads:
thread.join();
# prints results
for obj in object_list:
print(obj.a)
I assume it's because the objects can only be copied, and not
accessed, with multiprocessing.
This is exactly right, and is half the answer. Because the processes are isolated they each have their own copy of the object_list. One solution here is to use ThreadPoolExecutor (the threads all share the same object_list).
The syntax to use it is a bit different from what you are trying to use, but this works as intended:
executor = ThreadPoolExecutor(max_workers=3)
res = executor.map(Object.aplus, object_list)
If you really want to use ProcessPoolExecutor then you'll need to get the data back from the processes somehow. The easiest way is to use functions which return values:
from concurrent.futures import ProcessPoolExecutor
class Object:
def __init__(self, a):
self.a = a
def aplus(self):
self.a += 1
return self.a
if __name__ == '__main__':
object_list = [Object(1), Object(2), Object(3)]
executor = ProcessPoolExecutor(max_workers=3)
for result in executor.map(Object.aplus, object_list):
print("I got: " + str(result))
You can even have the function you are maping return self, and put those returned objects back into your object_list at then end. So the full multiprocessing solution would look like:
from concurrent.futures import ProcessPoolExecutor
class Object:
def __init__(self, a):
self.a = a
def aplus(self):
self.a += 1
return self
if __name__ == '__main__':
object_list = [Object(1), Object(2), Object(3)]
executor = ProcessPoolExecutor(max_workers=3)
object_list = list(executor.map(Object.aplus, object_list))
Is there any way to have a pub/sub pattern using multiprocessing data structures? In other words, I would like to have something like a queue, except that the publisher can send a single command to multiple workers simultaneously.
You can create your own data structure to implement a simple pub/sub pattern using a wrapper around multiprocessing.Queue:
import os
import multiprocessing
from functools import wraps
def ensure_parent(func):
#wraps(func)
def inner(self, *args, **kwargs):
if os.getpid() != self._creator_pid:
raise RuntimeError("{} can only be called in the "
"parent.".format(func.__name__))
return func(self, *args, **kwargs)
return inner
class PublishQueue(object):
def __init__(self):
self._queues = []
self._creator_pid = os.getpid()
def __getstate__(self):
self_dict = self.__dict__
self_dict['_queues'] = []
return self_dict
def __setstate__(self, state):
self.__dict__.update(state)
#ensure_parent
def register(self):
q = multiprocessing.Queue()
self._queues.append(q)
return q
#ensure_parent
def publish(self, val):
for q in self._queues:
q.put(val)
def worker(q):
for item in iter(q.get, None):
print("got item {} in process {}".format(item, os.getpid()))
if __name__ == "__main__":
q = PublishQueue()
processes = []
for _ in range(3):
p = multiprocessing.Process(target=worker, args=(q.register(),))
p.start()
processes.append(p)
q.publish('1')
q.publish(2)
q.publish(None) # Shut down workers
for p in processes:
p.join()
Output:
got item 1 in process 4383
got item 2 in process 4383
got item 1 in process 4381
got item 2 in process 4381
got item 1 in process 4382
got item 2 in process 4382
This pattern will work well as long as the parent process is the only one doing the publishing, and you register a subscription queue for each worker in the parent, and then pass that subscription queue to the worker process using its multiprocessing.Process constructor. These limitations are due to multiprocessing.Queue being unpicklable. If you need to pass the subscription queue to an already running worker, you'll need to tweak the implementation to use a multiprocessing.Manager.Queue instead.
I am trying to find a way to compare between different objects (inherited from Thread class) in a way that keep parallilsm (real-time processing).
Every worker has three fields (message, count, n ). I am updating Count everytime. Let's say that I have three threads workers. I need to compare in my server based on the field count, how can I do access and compare between Worker.count of every worker, in a way that I keep parallelism
from Queue import Queue
from threading import Thread
import time
class Worker(Thread):
def __init__(self, message, n):
Thread.__init__(self)
self.message = message
self.count= 0
self.n = n
def run(self):
while True:
print(self.message)
self.count+=1
time.sleep(self.n)
class Comparator(Thread):
def __init__(self, message, n):
Thread.__init__(self)
self.message = message
self.n = n
def run(self):
while True:
max= max([x.count for x in threads]) # how can I access to other threads
print "max", max
time.sleep(self.n)
thread1 = Worker("Test-1", 1)
thread2 = Worker("Test-2", 3)
s = Comparator("Test-3", 2)
s.start()
s.join()
threads = [thread1, thread2]
for g in threads:
g.start()
for worker in threads:
# wait for workers
worker.join()
NOTE Using shared object here is not a good solution for me, using Queue() for example is not what I want, I need to do comparision based on updated field in the object that I update on the go (for simplicity, I use max() ).
you can pass the threads list to Comparator __init__() method :
[...]
class Comparator(Thread):
def __init__(self, message, n, threads):
Thread.__init__(self)
self.threads = threads
def run(self):
while True:
max= max([x.count for x in self.threads])
print("max", max)
time.sleep(self.n)
[...]
threads = [thread1, thread2]
s = Comparator("Test-3", 2, threads)