Why my multiprocess queue might be "losing" items? - python

I've got some code where I want to share objects between processes using queues. I've got a parent:
processing_manager = mp.Manager()
to_cacher = processing_manager.Queue()
fetchers = get_fetchers()
fetcher_process = mp.Process(target=fetch_news, args=(to_cacher, fetchers))
fetcher_process.start()
while 1:
print(to_cacher.get())
And a child:
def fetch_news(pass_to: Queue, fetchers: List[Fetcher]):
def put_news_to_query(pass_to: Queue, fetchers: List[Fetcher]):
for fet in fetchers:
for news in fet.latest_news():
print(news)
pass_to.put(news)
print("----------------")
put_news_to_query(pass_to, fetchers)
I'm expecting to see N objects printed in put_news_to_query, then a line, and then the same objects printed in while loop in a parent. Problem is, objects appear to miss: if I get, say, 8 objects printed in put_news_to_query I get only 2-3 objects printed in while loop. What am I doing wrong here?

This is not the answer, unless the answer is that the code is already working. I've just modified the code to make it a running example of the same technique. The data gets from child to parent without data loss.
import multiprocessing as mp
import time
import random
def worker(pass_to):
for i in range(10):
time.sleep(random.randint(0,10)/1000)
print('child', i)
pass_to.put(i)
print("---------------------")
pass_to.put(None)
def main():
manager = mp.Manager()
to_cacher = manager.Queue()
fetcher = mp.Process(target=worker, args=(to_cacher,))
fetcher.start()
while 1:
msg = to_cacher.get()
if msg is None:
break
print(msg)
if __name__ == "__main__":
main()

So, apparently, it was something related to in which order put and get statements were executed. Basically, some of the objects from parent's print were printed before the line. If you struggle with something like this, I'd recommend adding something to distinguish prints, like this:
print(f"Worker: {news}")
print(f"Main: {to_cacher.get()}")

Related

Multiprocessing event-queue not updating

So I'm writing a program with an event system.
I got a list of events to be handled.
One Process is supposed to push to the handler-list new events.
This part seems to work as I tried to print out the to-handle-list after pushing one event.
It gets longer and longer, while, when I print out the to handle list in the handle-event method, it is empty all the time.
Here is my event_handler code:
class Event_Handler:
def __init__(self):
self._to_handle_list = [deque() for _ in range(Event_Prio.get_num_prios()) ]
self._controll_handler= None
self._process_lock = Lock()
def init(self, controll_EV_handler):
self._controll_handler= controll_EV_handler
def new_event(self, event): #adds a new event to list
with self._process_lock:
self._to_handle_list[event.get_Prio()].append(event) #this List grows
def handle_event(self): #deals with the to_handle_list
self._process_lock.acquire()
for i in range(Event_Prio.get_num_prios()): #here i keep a list of empty deque
print(self._to_handle_list)
if (self._to_handle_list[i]): #checks if to-do is empty, never gets here that its not
self._process_lock.release()
self._controll_handler.controll_event(self._to_handle_list[i].popleft())
return
self._process_lock.release()
def create_Event(self, prio, type):
return Event(prio, type)
I tried everything. I checked if the event-handler-id is the same for both processes (plus the lock works)
I even checked if the to-handle-list-id is the same for both methods; yes it is.
Still the one in the one process grows, while the other is empty.
Can someone please tell me why the one list is empty?
Edit: It works just fine if I throw a event through the system with only one process. has to do sth with multiprocessing
Edit: Because someone asked, here is a simple usecase for it(I only used the essentials):
class EV_Main():
def __init__(self):
self.e_h = Event_Handler()
self.e_controll = None #the controller doesnt even matter because the controll-function never gets called....list is always empty
def run(self):
self.e_h.init(self.e_controll)
process1 = Process(target = self.create_events)
process2 = Process(target = self.handle_events)
process1.start()
process2.start()
def create_events(self):
while True:
self.e_h.new_event(self.e_h.create_Event(0, 3)) # eEvent_Type.S_TOUCH_EVENT
time.sleep(0.3)
def handle_events(self):
while True:
self.e_h.handle_event()
time.sleep(0.1)
To have a shareable set of deque instances, you could create a special class DequeArray which will hold an internal list of deque instances and expose whatever methods you might need. Then I would turn this into a shareable, managed object. When the manager creates an instance of this class, what is returned is a proxy to the actual instance that resides in the manager's address space. Any method calls you make on this proxy are actually shipped of to the manager's process using pickle and any results returned the same way. Since the individual deque instances are not shareable, managed objects, do not add a method that returns one of these deque instances which is then modified without being cognizant that the version of the deque in the manager's address space has not been modified.
Individual operations on a deque are serialized. But if you are doing some operation on a deque that consists of multiple method calls on the deque and you require atomicity, then that sequence is a critical section that needs to be done under control of a lock, as in the left_rotate function below.
from multiprocessing import Process, Lock
from multiprocessing.managers import BaseManager
from collections import deque
# Add methods to this as required:
class DequeArray:
def __init__(self, array_size):
self._deques = [deque() for _ in range(array_size)]
def __repr__(self):
l = []
l.append('DequeArray [')
for d in self._deques:
l.append(' ' + str(d))
l.append(']')
return '\n'.join(l)
def __len__(self):
"""
Return our length (i.e. the number of deque
instances we have).
"""
return len(self._deques)
def append(self, i, value):
"""
Append value to the ith deque
"""
self._deques[i].append(value)
def popleft(self, i):
"""
Eexcute a popleft operation on the ith deque
and return the result.
"""
return self._deques[i].popleft()
def length(self, i):
"""
Return length of the ith dequeue.
"""
return len(self._deques[i])
class DequeArrayManager(BaseManager):
pass
DequeArrayManager.register('DequeArray', DequeArray)
# Demonstrate how to use a sharable DequeArray
def left_rotate(deque_array, lock, i):
# Rotate first element to be last element:
# This is not an atomic operation, so do under control of a lock:
with lock:
deque_array.append(i, deque_array.popleft(i))
# Required for Windows:
if __name__ == '__main__':
# This starts the manager process:
with DequeArrayManager() as manager:
# Two deques:
deque_array = manager.DequeArray(2)
# Initialize with some values:
deque_array.append(0, 0)
deque_array.append(0, 1)
deque_array.append(0, 2)
# Same values in second deque:
deque_array.append(1, 0)
deque_array.append(1, 1)
deque_array.append(1, 2)
print(deque_array)
# Both processses will be modifying the same deque in a
# non-atomic way, so we definitely need to be doing this under
# control of a lock. We don't care which process acquires the
# lock first because the results will be the same regardless.
lock = Lock()
p1 = Process(target=left_rotate, args=(deque_array, lock, 0))
p2 = Process(target=left_rotate, args=(deque_array, lock, 0))
p1.start()
p2.start()
p1.join()
p2.join()
print(deque_array)
Prints:
DequeArray [
deque([0, 1, 2])
deque([0, 1, 2])
]
DequeArray [
deque([2, 0, 1])
deque([0, 1, 2])
]

multiprocessing a function with parameters that are iterated through

I'm trying to improve the speed of my program and I decided to use multiprocessing!
the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function
here is the code that i am dealing with:
def dataLoading(output):
name = ""
link = ""
upCheck = ""
isSuccess = ""
for i in os.listdir():
with open(i) as currentFile:
data = json.loads(currentFile.read())
try:
name = data["name"]
link = data["link"]
upCheck = data["upCheck"]
isSuccess = data["isSuccess"]
except:
print("error in loading data from config: improper naming or formating used")
output[name] = [link, upCheck, isSuccess]
#working
def userCheck(link, user, isSuccess):
link = link.replace("<USERNAME>", user)
isSuccess = isSuccess.replace("<USERNAME>", user)
html = requests.get(link, headers=headers)
page_source = html.text
count = page_source.count(isSuccess)
if count > 0:
return True
else:
return False
I have a parent function to run these two together but I don't think i need to show the whole thing, just the part that gets the data iteratively:
for i in configData:
data = configData[i]
link = data[0]
print(link)
upCheck = data[1] #just for future use
isSuccess = data[2]
if userCheck(link, username, isSuccess) == True:
good.append(i)
you can see how I enter all of the data in there, how would I be able to use multiprocessing to do this when I am iterating through the dictionary to collect multiple parameters?
I like to use mp.Pool().map. I think it is easiest and most straight forward and handles most multiprocessing cases. So how does map work? For starts, we have to keep in mind that mp creates workers, each worker receives a copy of the namespace (ya the whole thing), then each worker works on what they are assigned and returns. Hence, doing something like "updating a global variable" while they work, doesn't work; since they are each going to receive a copy of the global variable and none of the workers are going to be communicating. (If you want communicating workers you need to use mp.Queue's and such, it gets complicated). Anyway, here is using map:
from multiprocessing import Pool
t = 'abcd'
def func(s):
return t[int(s)]
results = Pool().map(func,range(4))
Each worker received a copy of t, func, and the portion of range(4) they were assigned. They are then automatically tracked and everything is cleaned up in the end by Pool.
Something like your dataLoading won't work very well, we need to modify it. I also cleaned the code a little.
def loadfromfile(file):
data = json.loads(open(file).read())
items = [data.get(k,"") for k in ['name','link','upCheck','isSuccess']]
return items[0],items[1:]
output = dict(Pool().map(loadfromfile,os.listdir()))

What's the best way to parallelize this process

I've been trying parallelize a process inside a class method. When I try using Pool() from multiprocessing I get pickling errors. When I use Pool() from multiprocessing.dummy my execution is slower than serialized execution.
I've attempted several variations of my code below, using Stackoverflow posts as a guide, but none of them were a successful workaround for the problem outlined above.
One for example: if I move process_function above the class definition (globalizing it) it's doesn't work because I can't access my objects attributes.
Anyway, my code is similar to:
from multiprocessing.dummy import Pool as ThreadPool
from my_other_module import other_module_class
class myClass:
def __init__(self, some_list, number_iterations):
self.my_interface = other_module_class
self.relevant_list = []
self.some_list = some_list
self.number_iterations = number_iterations
# self.other_attributes = stuff from import statements
def load_relevant_data:
self.relevant_list = self.interface.other_function
def compute_foo(self, relevant_list_member_value):
# math involving class attributes
return foo_scalar
def higher_function(self):
self.relevant_list = self.load_relevant_data
np.random.seed(0)
pool = ThreadPool() # I've tried different args here, no help
pool.map(self.process_function, self.relevant_list)
def process_function(self, dict_from_relevant_list):
foo_bar = self.compute_foo(dict_from_relevant_list['key'])
a = 0
for i in some_other_list:
# do other stuff involving class attributes and foo_bar
# a = some of that
dict_from_relevant_list['other_key'] = a
if __name__ == '__main__':
import time
import pprint as pp
some_list = blah
number_of_iterations = 10**4
my_obj = myClass(some_list, number_of_iterations
my_obj.load_third_parties()
start = time.time()
my_obj.higher_function()
execution_time = time.time() - start
print()
print("Execution time for %s simulation runs: %s" % (number_of_iterations, execution_time))
print()
pp.pprint(my_obj.relevant_list[0:5])
I have a few hundred dictionaries inside relevant list. I just want to populate each of those dictionary's 'other_key' field from a computationally expensive simulation on my inner most loop, which yields a scalar value, like a above. It seems like there should be a simple way to do this since in Matlab I could just right parfor and it's done automatically. Maybe that instinct is wrong for Python.

Issue with sharing data between Python processes with multiprocessing

I've seen several posts about this, so I know it is fairly straightforward to do, but I seem to be coming up short. I'm not sure if I need to create a worker pool, or use the Queue class. Basically, I want to be able to create several processes that each act autonomously (which is why they inherit from the Agent superclass).
At random ticks of my main loop I want to update each Agent. I'm using time.sleep with different values in the main loop and the Agent's run loop to simulate different processor speeds.
Here is my Agent superclass:
# Generic class to handle mpc of each agent
class Agent(mpc.Process):
# initialize agent parameters
def __init__(self,):
# init mpc
mpc.Process.__init__(self)
self.exit = mpc.Event()
# an agent's main loop...generally should be overridden
def run(self):
while not self.exit.is_set():
pass
print "You exited!"
# safely shutdown an agent
def shutdown(self):
print "Shutdown initiated"
self.exit.set()
# safely communicate values to this agent
def communicate(self,value):
print value
A specific agent's subclass (simulating an HVAC system):
class HVAC(Agent):
def __init__(self, dt=70, dh=50.0):
super(Agent, self).__init__()
self.exit = mpc.Event()
self.__pref_heating = True
self.__pref_cooling = True
self.__desired_temperature = dt
self.__desired_humidity = dh
self.__meas_temperature = 0
self.__meas_humidity = 0.0
self.__hvac_status = "" # heating, cooling, off
self.start()
def run(self): # handle AC or heater on
while not self.exit.is_set():
ctemp = self.measureTemp()
chum = self.measureHumidity()
if (ctemp < self.__desired_temperature):
self.__hvac_status = 'heating'
self.__meas_temperature += 1
elif (ctemp > self.__desired_temperature):
self.__hvac_status = 'cooling'
self.__meas_temperature += 1
else:
self.__hvac_status = 'off'
print self.__hvac_status, self.__meas_temperature
time.sleep(0.5)
print "HVAC EXITED"
def measureTemp(self):
return self.__meas_temperature
def measureHumidity(self):
return self.__meas_humidity
def communicate(self,updates):
self.__meas_temperature = updates['temp']
self.__meas_humidity = updates['humidity']
print "Measured [%d] [%f]" % (self.__meas_temperature,self.__meas_humidity)
And my main loop:
if __name__ == "__main__":
print "Initializing subsystems"
agents = {}
agents['HVAC'] = HVAC()
# Run simulation
timestep = 0
while timestep < args.timesteps:
print "Timestep %d" % timestep
if timestep % 10 == 0:
curr_temp = random.randrange(68,72)
curr_humidity = random.uniform(40.0,60.0)
agents['HVAC'].communicate({'temp':curr_temp, 'humidity':curr_humidity})
time.sleep(1)
timestep += 1
agents['HVAC'].shutdown()
print "HVAC process state: %d" % agents['HVAC'].is_alive()
So the issue is that, whenever I run agents['HVAC'].communicate(x) within the main loop, I can see the value being passed into the HVAC subclass in its run loop (so it prints the received value correctly). However, the value never is successfully stored.
So typical output looks like this:
Initializing subsystems
Timestep 0
Measured [68] [56.948675]
heating 1
heating 2
Timestep 1
heating 3
heating 4
Timestep 2
heating 5
heating 6
When in reality, as soon as Measured [68] appears, the internal stored value should be updated to output 68 (not heating 1, heating 2, etc.). So effectively, the HVAC's self.__meas_temperature is not being properly updated.
Edit: After a bit of research, I realized that I didn't necessarily understand what is happening behind the scenes. Each subprocess operates with its own virtual chunk of memory and is completely abstracted away from any data being shared this way, so passing the value in isn't going to work. My new issue is that I'm not necessarily sure how to share a global value with multiple processes.
I was looking at the Queue or JoinableQueue packages, but I'm not necessarily sure how to pass a Queue into the type of superclass setup that I have (especially with the mpc.Process.__init__(self) call).
A side concern would be if I can have multiple agents reading values out of the queue without pulling it out of the queue? For instance, if I wanted to share a temperature value with multiple agents, would a Queue work for this?
Pipe v Queue
Here's a suggested solution assuming that you want the following:
a centralized manager / main process which controls lifetimes of the workers
worker processes to do something self-contained and then report results to the manager and other processes
Before I show it though, for the record I want to say that in general unless you are CPU bound multiprocessing is not really the right fit, mainly because of the added complexity, and you'd probably be better of using a different high-level asynchronous framework. Also, you should use python 3, it's so much better!
That said, multiprocessing.Manager, makes this pretty easy to do using multiprocessing. I've done this in python 3 but I don't think anything shouldn't "just work" in python 2, but I haven't checked.
from ctypes import c_bool
from multiprocessing import Manager, Process, Array, Value
from pprint import pprint
from time import sleep, time
class Agent(Process):
def __init__(self, name, shared_dictionary, delay=0.5):
"""My take on your Agent.
Key difference is that I've commonized the run-loop and used
a shared value to signal when to stop, to demonstrate it.
"""
super(Agent, self).__init__()
self.name = name
# This is going to be how we communicate between processes.
self.shared_dictionary = shared_dictionary
# Create a silo for us to use.
shared_dictionary[name] = []
self.should_stop = Value(c_bool, False)
# Primarily for testing purposes, and for simulating
# slower agents.
self.delay = delay
def get_next_results(self):
# In the real world I'd use abc.ABCMeta as the metaclass to do
# this properly.
raise RuntimeError('Subclasses must implement this')
def run(self):
ii = 0
while not self.should_stop.value:
ii += 1
# debugging / monitoring
print('%s %s run loop execution %d' % (
type(self).__name__, self.name, ii))
next_results = self.get_next_results()
# Add the results, along with a timestamp.
self.shared_dictionary[self.name] += [(time(), next_results)]
sleep(self.delay)
def stop(self):
self.should_stop.value = True
print('%s %s stopped' % (type(self).__name__, self.name))
class HVACAgent(Agent):
def get_next_results(self):
# This is where you do your work, but for the sake of
# the example just return a constant dictionary.
return {'temperature': 5, 'pressure': 7, 'humidity': 9}
class DumbReadingAgent(Agent):
"""A dumb agent to demonstrate workers reading other worker values."""
def get_next_results(self):
# get hvac 1 results:
hvac1_results = self.shared_dictionary.get('hvac 1')
if hvac1_results is None:
return None
return hvac1_results[-1][1]['temperature']
# Script starts.
results = {}
# The "with" ensures we terminate the manager at the end.
with Manager() as manager:
# the manager is a subprocess in its own right. We can ask
# it to manage a dictionary (or other python types) for us
# to be shared among the other children.
shared_info = manager.dict()
hvac_agent1 = HVACAgent('hvac 1', shared_info)
hvac_agent2 = HVACAgent('hvac 2', shared_info, delay=0.1)
dumb_agent = DumbReadingAgent('dumb hvac1 reader', shared_info)
agents = (hvac_agent1, hvac_agent2, dumb_agent)
list(map(lambda a: a.start(), agents))
sleep(1)
list(map(lambda a: a.stop(), agents))
list(map(lambda a: a.join(), agents))
# Not quite sure what happens to the shared dictionary after
# the manager dies, so for safety make a local copy.
results = dict(shared_info)
pprint(results)

Python: How to return values in a multiprocessing situation?

Let's say I have a collection of Process-es, a[0] through a[m].
These processes will then send a job, via a queue, to another collection of Process-es, b[0] through b[n], where m > n
Or, to diagram:
a[0], a[1], ..., a[m] ---Queue---> b[0], b[1], ..., b[n]
Now, how do I return the result of the b processes to the relevant a process?
My first guess was using multiprocessing.Pipe()
So, I've tried doing the following:
## On the 'a' side
pipe = multiprocessing.Pipe()
job['pipe'] = pipe
queue.put(job)
rslt = pipe[0].recv()
## On the 'b' side
job = queue.get()
... process the job ...
pipe = job['pipe']
pipe.send(result)
and it doesn't work with the error: Required argument 'handle' (pos 1) not found
Reading many docs, I came up with:
## On the 'a' side
pipe = multiprocessing.Pipe()
job['pipe'] = multiprocessing.reduction.reduce_connection(pipe[1])
queue.put(job)
rslt = pipe[0].recv()
## On the 'b' side
job = queue.get()
... process the job ...
pipe = multiprocessing.reduction.rebuild_connection(job['pipe'], True, True)
pipe.send(result)
Now I get a different error: ValueError: need more than 2 values to unpack.
I've tried searching and searching and still can't find how to properly use the reduce_ and rebuild_ methods.
Please help so I can return the value from b to a.
I would recommend to avoid using this movement of Pipe and file descriptors (last time I tried, it was not very standard and not very well documented). Having to deal with it was a pain, I do not recommend it :-/
I would suggest a different approach: let the main manage the connections. Keep a work queue, but sent the responses in a different path. This means that you need some kind of identifier for the threads. I will provide a toy implementation to illustrate my proposal:
#!/usr/bin/env python
import multiprocessing
import random
def fib(n):
"Slow fibonacci implementation because why not"
if n < 2:
return n
return fib(n-2) + fib(n-1)
def process_b(queue_in, queue_out):
print "Starting process B"
while True:
j = queue_in.get()
print "Job: %d" % j["val"]
j["result"] = fib(j["val"])
queue_out.put(j)
def process_a(index, pipe_end, queue):
print "Starting process A"
value = random.randint(5, 50)
j = {
"a_id": index,
"val": value,
}
queue.put(j)
r = pipe_end.recv()
print "Process A sent value %d and received: %s" % (value, r)
def main():
print "Starting main"
a_pipes = list()
jobs = multiprocessing.Queue()
done_jobs = multiprocessing.Queue()
for i in range(5):
multiprocessing.Process(target=process_b, args=(jobs, done_jobs,)).start()
for i in range(10):
receiver, sender = multiprocessing.Pipe(duplex=False)
a_pipes.append(sender)
multiprocessing.Process(target=process_a, args=(i, receiver, jobs)).start()
while True:
j = done_jobs.get()
a_pipes[j["a_id"]].send(j["result"])
if __name__ == "__main__":
main()
Note that the Queue of jobs is connected directly between a and b processes. a process is responsible to put their identifier (which the "master" should know). The b uses a different Queue for finished work. I used the same job dictionary, but typical implementation should use some more tailored data structure. This response should have the identifier of a in order for the master to send that to the specific process.
I assume that there is some way to use it with your approach, which I don't dislike at all (it would have been my first approach). But having to deal with file descriptors and the reduce_ and rebuild_ methods is not nice. Not at all.
So, as #MariusSiuram explained in this post, trying to pass a Connection object is an exercise in frustration.
I finally resorted to using a DictProxy to return values from B to A.
This is the concept:
### This is in the main process
...
jobs_queue = multiprocessing.Queue()
manager = multiprocessing.Manager()
ret_dict = manager.dict()
...
# Somewhere during Process initialization, jobs_queue and ret_dict got passed to
# the workers' constructor
...
### This is in the "A" (left-side) workers
...
self.ret_dict.pop(self.pid, None) # Remove our identifier if exist
self.jobs_queue.put({
'request': parameters_to_be_used_by_B,
'requester': self.pid
})
while self.pid not in self.ret_dict:
time.sleep(0.1) # Or any sane value
result = self.ret_dict[self.pid]
...
### This is in the "B" (right-side) workers
...
while True:
job = self.jobs_queue.get()
if job is None:
break
result = self.do_something(job['request'])
self.ret_dict[job['requester']] = result
...

Categories

Resources