python multiproccesing and thread safety in shared objects - python

I am a bit uncertain regarding thread safety and multiprocessing.
From what I can tell multiprocessing.Pool.map pickles the calling function or object but leaves members passed by references intact.
This seems like it could be beneficial since it saves memory but I haven't found any information about the thread safety in those objects.
In my case I am trying to read numpy data from disk, however, I want to be able to modify the source with out changing the implementation so I've broken out the reading part to its own classes.
I roughly have the following situation:
import numpy as np
from multiprocessing import Pool
class NpReader():
def read_row(self, row_index):
pass
class NpReaderSingleFile(NpReader):
def read_row(self, row_index):
return np.load(self._filename_from_row(row_index))
def _filename_from_row(self, row_index):
return Path(row_index).with_suffix('.npy')
class NpReaderBatch(NpReader):
def __init__(self, batch_file, mmap_mode=None):
self.batch = np.load(batch_file, mmap_mode=mmap_mode)
def read_row(self, row_index):
read_index = row_index
return self.batch[read_index]
class ProcessRow():
def __init__(self, reader):
self.reader = reader
def __call__(self, row_index):
return reader.read_row(row_index).shape
readers = [
NpReaderSingleFile(),
NpReaderBatch('batch.npy'),
NpReaderBatch('batch.npy', mmap_mode='r')
]
res = []
for reader in readers:
with Pool(12) as mp:
res.append(mp.map(ProcessRow(reader), range(100000))
It seems to me that there are alot of things that could go wrong here but I, unfortunately does not have the knowledge to determine what of test for it.
Are there any obvious problems with the above approach?
Some things that occurred to me are:
np.load (it seems to work well for small single files, but can I test it to see that it is safe?
Is NpReaderBatch safe or can read_index be modified at the same time by different processes?

Related

How to periodically call instance method from a separate process

I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!

Avoid global variables for unpicklable shared state among multiprocessing.Pool workers

I frequently find myself writing programs in Python that construct a large (megabytes) read-only data structure and then use that data structure to analyze a very large (hundreds of megabytes in total) list of small records. Each of the records can be analyzed in parallel, so a natural pattern is to set up the read-only data structure and assign it to a global variable, then create a multiprocessing.Pool (which implicitly copies the data structure into each worker process, via fork) and then use imap_unordered to crunch the records in parallel. The skeleton of this pattern tends to look like this:
classifier = None
def classify_row(row):
return classifier.classify(row)
def classify(classifier_spec, data_file):
global classifier
try:
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classify_row, rd)
finally:
classifier = None
I'm not happy with this because of the global variable and the implicit coupling between classify and classify_row. Ideally, I would like to write
def classify(classifier_spec, data_file):
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
but this does not work, because the Classifier object usually contains objects which cannot be pickled (because they are defined by extension modules whose authors didn't care about that); I have also read that it would be really slow if it did work, because the Classifier object would get copied into the worker processes on every invocation of the bound method.
Is there a better alternative? I only care about 3.x.
This was surprisingly tricky. The key here is to preserve read-access to variables that are available at fork-time without serialization. Most solutions to sharing memory in multiprocessing end up serializing. I tried using a weakref.proxy to pass in a classifier without serialization, but that didn't work because both dill and pickle will try to follow and serialize the referent. However, a module-ref works.
This organization gets us close:
import multiprocessing as mp
import csv
def classify(classifier, data_file):
with open(data_file, "rt") as fp, mp.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
def orchestrate(classifier_spec, data_file):
# construct a classifier from the spec; note that we can
# even dynamically import modules here, using config values
# from the spec
import classifier_module
classifier_module.init(classifier_spec)
return classify(classifier_module, data_file)
if __name__ == '__main__':
list(orchestrate(None, 'data.txt'))
A few changes to note here:
we add an orchestrate method for some DI goodness; orchestrate figures out how to construct/initialize a classifier, and hands it to classify, decoupling the two
classify only needs to assume that the classifier parameter has a classify method; it doesn't care if it's an instance or a module
For this Proof of Concept, we provide a Classifier that is obviously not serializable:
# classifier_module.py
def _create_classifier(spec):
# obviously not pickle-able because it's inside a function
class Classifier():
def __init__(self, spec):
pass
def classify(self, x):
print(x)
return x
return Classifier(spec)
def init(spec):
global __classifier
__classifier = _create_classifier(spec)
def classify(x):
return __classifier.classify(x)
Unfortunately, there's still a global in here, but it's now nicely encapsulated inside a module as a private variable, and the module exports a tight interface composed of the classify and init functions.
This design unlocks some possibilities:
orchestrate can import and init different classifier modules, based on what it sees in classifier_spec
one could also pass an instance of some Classifier class to classify, as long as this instance is serializable and has a classify method of the same signature
If you want to use forking, I don't see a way around using a global. But I also don't see a reason why you would have to feel bad about using a global in this case, you're not manipulating a global list with multi-threading or so.
It's possible to cope with the ugliness in your example, though. You want to pass classifier.classify directly, but the Classifier object contains objects which cannot be pickled.
import os
import csv
import uuid
from threading import Lock
from multiprocessing import Pool
from weakref import WeakValueDictionary
class Classifier:
def __init__(self, spec):
self.lock = Lock() # unpickleable
self.spec = spec
def classify(self, row):
return f'classified by pid: {os.getpid()} with spec: {self.spec}', row
I suggest we subclass Classifier and define __getstate__ and __setstate__ to enable pickling. Since you're using forking anyway, all state it has to pickle, is information how to get a reference to a forked global instance. Then we'll just update the pickled object's __dict__ with the __dict__ of the forked instance (which hasn't gone through the reduction of pickling) and your instance is complete again.
To achieve this without additional boilerplate, the subclassed Classifier instance has to generate a name for itself and register this as a global variable. This first reference, will be a weak reference, so the instance can be garbage collected when the user expects it. The second reference is created by the user when he assigns classifier = Classifier(classifier_spec). This one, doesn't have to be global.
The generated name in the example below is generated with help of standard-lib's uuid module. An uuid is converted to a string and edited into a valid identifier (it wouldn't have to be, but it's convenient for debugging in interactive mode).
class SubClassifier(Classifier):
def __init__(self, spec):
super().__init__(spec)
self.uuid = self._generate_uuid_string()
self.pid = os.getpid()
self._register_global()
def __getstate__(self):
"""Define pickled content."""
return {'uuid': self.uuid}
def __setstate__(self, state):
"""Set state in child process."""
self.__dict__ = state
self.__dict__.update(self._get_instance().__dict__)
def _get_instance(self):
"""Get reference to instance."""
return globals()[self.uuid][self.uuid]
#staticmethod
def _generate_uuid_string():
"""Generate id as valid identifier."""
# return 'uuid_' + '123' # for testing
return 'uuid_' + str(uuid.uuid4()).replace('-', '_')
def _register_global(self):
"""Register global reference to instance."""
weakd = WeakValueDictionary({self.uuid: self})
globals().update({self.uuid: weakd})
def __del__(self):
"""Clean up globals when deleted in parent."""
if os.getpid() == self.pid:
globals().pop(self.uuid)
The sweet thing here is, the boilerplate is totally gone. You don't have to mess manually with declaring and deleting globals since the instance manages everything itself in background:
def classify(classifier_spec, data_file, n_workers):
classifier = SubClassifier(classifier_spec)
# assert globals()['uuid_123']['uuid_123'] # for testing
with open(data_file, "rt") as fh, Pool(n_workers) as pool:
rd = csv.DictReader(fh)
yield from pool.imap_unordered(classifier.classify, rd)
if __name__ == '__main__':
PATHFILE = 'data.csv'
N_WORKERS = 4
g = classify(classifier_spec='spec1', data_file=PATHFILE, n_workers=N_WORKERS)
for record in g:
print(record)
# assert 'uuid_123' not in globals() # no reference left
The multiprocessing.sharedctypes module provides functions for allocating ctypes objects from shared memory which can be inherited by child processes, i.e., parent and children can access the shared memory.
You could use
1. multiprocessing.sharedctypes.RawArray to allocate a ctypes array from the shared memory.
2. multiprocessing.sharedctypes.RawValue to allocate a ctypes object from the shared memory.
Dr Mianzhi Wang has written a very detailed document on this. You could share multiple multiprocessing.sharedctypes objects.
You may find the solution here useful to you.

Should I access these attributes directly or rather use proxy methods?

My client API encapsulates connections to the server in a class ServerConnection that stores an asyncio.StreamReader/-Writer pair. (For simplicity, I will not use yield from or any other async technique in the code below because it doesn't matter for my question.)
Now, I wonder whether it's better to expose both streams as part of ServerConnection's public interface than to introduce read()/write() methods to ServerConnection which forward calls to the respective stream's methods. Example:
class ServerConnection:
def __init__(self, reader, writer):
self.reader = reader
self.writer = writer
vs.
class ServerConnection:
def __init__(self, reader, writer):
self._reader = reader
self._writer = writer
def read(self, size):
return self._reader.read(size)
def readline(self):
return self._reader.readline()
def write(self, data):
return self._writer.write(data)
def close(self):
return self._writer.close()
# …further proxy methods…
Pros/cons that I'm aware of so far:
Pros: ServerConnection is definitely some kind of bidirectional stream so it'd make sense to have it inherit from a StreamRWPair. Furthermore, it's much more convenient to use server_conn.read() than server_conn.reader.read().
Cons: I would need to define a whole lot of proxy methods. (In my case, I actually use some kind of buffered reader that has additional methods like peek() and so on.) So I thought about creating a base class StreamRWPair similar to Python's io.BufferedRWPair that my ServerConnection can then inherit from. The Python implementation of the io module, however, says the following about BufferedRWPair in the comments:
XXX The usefulness of this (compared to having two separate IO
objects) is questionable.
On top of that, my ServerConnection class has several other methods and I'm afraid that, by proxying, I will clutter its attribute namespace and make its usage less obvious.

Python Multiprocessing - apply class method to a list of objects

Is there a simple way to use Multiprocessing to do the equivalent of this?
for sim in sim_list:
sim.run()
where the elements of sim_list are "simulation" objects and run() is a method of the simulation class which does modify the attributes of the objects. E.g.:
class simulation:
def __init__(self):
self.state['done']=False
self.cmd="program"
def run(self):
subprocess.call(self.cmd)
self.state['done']=True
All the sim in sim_list are independent, so the strategy does not have to be thread safe.
I tried the following, which is obviously flawed because the argument is passed by deepcopy and is not modified in-place.
from multiprocessing import Process
for sim in sim_list:
b = Process(target=simulation.run, args=[sim])
b.start()
b.join()
One way to do what you want is to have your computing class (simulation in your case) be a subclass of Process. When initialized properly, instances of this class will run in separate processes and you can set off a group of them from a list just like you wanted.
Here's an example, building on what you wrote above:
import multiprocessing
import os
import random
class simulation(multiprocessing.Process):
def __init__(self, name):
# must call this before anything else
multiprocessing.Process.__init__(self)
# then any other initialization
self.name = name
self.number = 0.0
sys.stdout.write('[%s] created: %f\n' % (self.name, self.number))
def run(self):
sys.stdout.write('[%s] running ... process id: %s\n'
% (self.name, os.getpid()))
self.number = random.uniform(0.0, 10.0)
sys.stdout.write('[%s] completed: %f\n' % (self.name, self.number))
Then just make a list of objects and start each one with a loop:
sim_list = []
sim_list.append(simulation('foo'))
sim_list.append(simulation('bar'))
for sim in sim_list:
sim.start()
When you run this you should see each object run in its own process. Don't forget to call Process.__init__(self) as the very first thing in your class initialization, before anything else.
Obviously I've not included any interprocess communication in this example; you'll have to add that if your situation requires it (it wasn't clear from your question whether you needed it or not).
This approach works well for me, and I'm not aware of any drawbacks. If anyone knows of hidden dangers which I've overlooked, please let me know.
I hope this helps.
For those who will be working with large data sets, an iterable would be your solution here:
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
pool.imap(sim.start, sim_list)

How to share object tree with process fork?

I don't have much experience with multithreading, and I'm trying to get something like the below working:
from multiprocessing import Process
class Node:
def __init__(self):
self.children = {}
class Test(Process):
def __init__(self, tree):
super().__init__()
self.tree = tree
def run(self):
# infinite loop which does stuff to the tree
self.tree.children[1] = Node()
self.tree.children[2] = Node()
x = Node()
t = Test(x)
t.start()
print(x.children) # random access to tree
I realize this shouldn't (and doesn't) work for a variety of very sensible reasons, but I'm not sure how to get it to work. Referring to the documentation, it seems that I need to do something with Managers and Proxies, but I honestly have no idea where to start, or whether that is actually what I'm looking for. Could someone provide an example of the above that works?
multiprocessing has limited support for implicitly shared objects, which can even share lists and dicts.
In general, multiprocessing is shared-nothing (after the initial fork) and relies on explicit communication between the processes. This adds overhead (how much really depends on the kind of interaction between the processes), but neatly avoids a lot of the pitfalls of multithreaded programming. The high-level building blocks of multiprocessing favor master/slave models (esp. the Pool class), with masters handing out work items, and slaves operating on them, returning results.
Keeping state in sync across several processes may, depending how often they change, incur a prohibitive overhead.
TL;DR: It can be done, but probably shouldn't.
import time, multiprocessing
class Test(multiprocessing.Process):
def __init__(self, manager):
super().__init__()
self.quit = manager.Event()
self.dict = manager.dict()
def stop(self):
self.quit.set()
self.join()
def run(self):
self.dict['item'] = 0
while not self.quit.is_set():
time.sleep(1)
self.dict['item'] += 1
m = multiprocessing.Manager()
t = Test(m)
t.start()
for x in range(10):
time.sleep(1.2)
print(t.dict)
t.stop()
The multiprocessing examples show how to create proxies for more complicated objects, which should allow you to implement the tree structure in your question.
It seems to me that what you want is actual multithreading, rather than multiprocessing. With threads rather than processes, you can do precisely that, since threads run in the same process, sharing all memory and therefore data with each other.

Categories

Resources