Shared dict that gets filled by separate processes

Shared dict that gets filled by separate processes - python

I am trying to divide a process that takes a 20k items array, periodically computes the element of the array and fills a global dict with the current result of that processing, into multiple processes.
my issue is the return dict as i need it to be in a single place in order to later on and periodically send it in it's entirety via an HTTP call.
my reasoning is to have the dict in the main process/thread and divide the 20k items into chunks over 4 processes, each having about 500 threads with each thread processing a number of items, but it seems I can't just pass a global variable to all processes and have that be filled, as each process creates an empty variable and I get nothing in my global variable.
I had the idea of making each process send their result via HTTP to a server and that would buffer the results and then send the entire dict to the destination. but that would introduce huge latency which is not desirable.
how can I achieve the division? is there any way that i can buffer the results coming from the multiple processes with the most reduced latency? the global variable must be a dict.

I believe you cannot share variables between subprocesses, or at least, it is not particularly easy. I am also not sure why you would need this for this problem.
Have you considered using Pythons multiprocessing.Pool functionality? Its official documentation with examples can be found here.
Each thread therein could process a subset of your input dict. After execution, multiprocessing.Pool would return a list of the output of each process, with the list having a length equal to the number of threads used. You can merge the separate outputs into a single dict() once all the processes have finished to gather all data into a single dict(). Would that work?

Related

Get Data from Other Processes using Multiprocessing

(Language is Python 3)
I am writing a program with the module multiprocessing and using Pool. I need some variable that is shared between all of the processes. The parent process will initialize this variable and pass it as an argument to p.map(). I want the child processes to change this variable. The intent of this is because the first part of the child processes' work should be done in parallel (computational work that doesn't need any other processes' data). But, the second part of the processes' work needs to be done in order, one process after another, because they are writing to a file and the contents of that file should be in order. I want each process to wait until the others are done before moving on. I will record the "progress" of the entire program with the variable, e.g. when the first process is done writing to the file, it will increment the variable by one. I want this to be a signal to the next process in line to begin writing to the file. But I need some sort of waituntil() function to make the processes wait until the Value variable indicates that it is their "turn" to write to the file.
Here are my two problems:
I need a variable that the child processes can edit, and the child processes can actually get the value of that variable. What type of variable should I use? Should I use Value, Manager, or something else?
I need the processes to wait until the variable described above equals to a certain value, signaling that it is their turn to write to the file. Is there any sort of waituntil() function that I can use?

What you are looking for is called Synchronization.
There are multitudes of different synchronization primitives to choose from.
You should never attempt to write synchronization primitives on your own, as it is non-trivial to do correctly!
In your case either an Event or a Condition might be suitable.

python multiprocessing.pool.map, passing arguments to spawned processes

def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?

The comments are becoming impossible to follow, so I'm pasting in my important comment here:
On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.
Then some answers to questions:
so in python 2.7, there is no way to spawn a new process?
On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).
The big dictionary is passed in to a function as an argument and I
could only create the pool inside this function. How would I be able
to create the pool before the big dictionary
As simple as this: make these two lines the first two lines in your program:
import multiprocessing
pool = multiprocessing.Pool()
You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.
ANOTHER SUGGESTION
If you're not mutating the dict after it's created, try using this instead:
def content_generator(dict):
for app in dict:
yield app, dict[app]
That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:
for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):

Parallel python loss of data

I have a python function that creates and stores a object instance in a global list and this function is called by a thread. While the thread runs the lists is filled up as it should be, but when the thread exits the list is empty and I have no idea why. Any help would be appreciated.
simulationResults = []
def run(width1, height1, seed1, prob1):
global simulationResults
instance = Life(width1, height1, seed1, prob1)
instance.run()
simulationResults.append(instance)
this is called in my main by:
for i in range(1, nsims + 1):
simulations.append(multiprocessing.Process(target=run, args=(width, height, seed, prob)))
simulations[(len(simulations) - 1)].start()
for i in simulations:
i.join()

multiprocessing is based on processes, not threads. The important difference: Each process has a separate memory space, while threads share a common memory space. When first created, a process may (depending on OS, spawn method, etc.) be able to read the same values the parent process has, but if it writes to them, only the local values are changed, not the parent's copy. Only threads can rely on being able to access an arbitrary single shared global variable and have it behave as expected.
I'd suggest looking at either multiprocessing.Pool and its various methods to dispatch tasks and retrieve their results later, or if you must use raw Processes, look at the various ways to exchange data between processes; you can't just assign to a global variable, because globals stop being shared when the new Process is forked/spawned.

In your code you are creating new processes rather than threads. When the process is created the new process will have deep copies of the variables in the main process, but they are independent from each other. I think for your case it makes sense to use processes rather than threads because It would allow you to utilise multiple cores as opposed to thread that will be limited to a single core due to GIL.
You will have to use interprocess communication techniques to communicate between processes. But since in your case the processes are not persistent daemons, it would make sense to write the simulationResults into a different unique file by each process and read them back from the main process.

If I have 4 threads each accessing only 1 element of a list will I run into race conditions?

If I have a list of say 4 items in python and each thread accesses only 1 item in the list, for example thread 1 assigned to the first item, thread 2 assigned to the 2nd item etc... will I have race conditions? do I need to use mutexes?
edit: I meant modify sorry, each thread will modify 1 item in the list. I added I was using python just because.
edit: #David Heffernan The list is a global because I need to send it somewhere every 1 second, so I have 4 threads doing the modification needed on each single item and every 1 second I send it somewhere over HTTP in the main thread.

You won't have race conditions. CPython uses a global interpreter lock (GIL) to ensure atomicity of basic operations on data structures. However, the GIL also prevents Python code from really running concurrently - if you are looking to implement parallel execution, you probably want to explore multiprocessing.

So long as each thread operates on different items in the list, then there will be no data races.
However, if you can partition the data off so cleanly, wouldn't it make more sense for each thread to operate on its own private data? Why have a list, visible outside the worker threads, containing items that are private to each thread. It would be cleaner, and easier to reason about the code if the data were private to the threads.
On the other hand your question edit suggests that you do have multiple threads accessing the same item of the list simultaneously. In which case there could well be a data race. Ultimately, without more detail nobody can say for sure whether or not your code is safe. In principle, so long as data is not shared between threads, then code will be threadsafe. But it's not possible to be sure that is the case with the problem description as stated.

How to prevent duplicate values in a shared queue

A producer thread queries a data store and puts objects into a queue. Each consumer thread(s) will then pull an object off of the shared queue and do a very long call to an external service. When the call returns, the consumer marks the object as having been completed.
My problem is that I basically have to wait until the queue is empty before the producer can add to it again, or else I risk getting duplicates being sent through.
[edit] Someone asked a good question over IRC and I figured I would add the answer here. The question was, "Why do your producers produce duplicates?" The answer is basically that the producer produces duplicates because we don't track a "sending" state of each object, only "sent" or "unsent".
Is there a way that I can check for duplicates in the queue?

It seems to me like it's not really a problem to have duplicate objects in the queue; you just want to make sure you only do the processing once per object.
EDIT: I originally suggested using a set or OrderedDict to keep track of the objects, but Python has a perfect solution: functools.lru_cache
Use #functools.lru_cache as a decorator on your worker function, and it will manage a cache for you. You can set a maximum size, and it will not grow beyond that size. If you use an ordinary set and don't manage it, it could grow to very large size and slow down your workers.
If you are using multiple worker processes instead of threads, you would need a solution that works across processes. Instead of a set or an lru_cache you could use a shared dict where the key is the unique ID value you use to detect duplicates, and the value is a timestamp for when the object went into the dict; then from time to time you could delete the really old entries in the dict. Here's a StackOverflow answer about shared dict objects:
multiprocessing: How do I share a dict among multiple processes?
And the rest of my original answer follows:
If so, I suggest you have the consumer thread(s) use a set to keep track of objects that have been seen. If an object is not in the set, add it and process it; if it is in the set, ignore it as a duplicate.
If this will be a long-running system, instead of a set, use an OrderedDict to track seen objects. Then from time to time clean out the oldest entries in the OrderedDict.

If you talk about the classes in the Queue module: following the API there is no way to detect if a queue contains a given object.

What do you mean by mark the object as having been completed? Do you leave the object in the queue and change a flag? Or do you mean you mark the object as having been completed in the data store. If the former, how does the queue ever become empty? If the latter, why not remove the object from the queue before you start processing?
Assuming you want to be able to handle cases where the processing fails without losing data, one approach would be to create a separate work queue and processing queue. Then, when a consumer pulls a job from the work queue, they move it to the processing queue and start the long running call to an external service. When that returns, it can mark the data complete and remove it from the processing queue. If you add a field for when the data was put into the processing queue, you could potentially run a periodic job that checks for processing jobs that exceed a certain time and attempt to reprocess them (updating the timestamp before restarting).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.