Which functions and methods can i use inside a multiprocessing process? - python

I am using the multiprocessing module since i have two programs of which one requires information from the other one. With this module i can run them both simultaneously and keep feeding information.
With some basic code the multiprocessing worked just fine. However copying in the process my own programs which basically keep creating information in matrix form seems to be a problem. The code stops at the line with the append method. Is it possible that you cannot use this method in a multiprocessing.process?
If so what can i use then? Is there a list of functions and methods usable in these processes?

Related

How can I save a dynamically generated module and reimport them from file?

I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.
You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]

function seems to malfunction when called concurrently

I am facing a very weird problem (at least I think it is weird).
Basically I have a very simple function which is called from multiple threads concurrently. While calling the function from a single sequential thread seems to work correctly, when I call the function concurrently it seems to mess up data from different callers.
The actual code is hard to replicate standalone, I just wondered what the life/scope of python data structure is. If I call the same function multiple times concurrently their internal variables/arguments are independent from one another right?
I seem to remember I ran into a similar problem when I was using recursive functions, where a function's data whould persist from call to call, could this be the case?
Thanks

Why do we need gevent.queue?

My understanding of Gevent is that it's merely concurrency and not parallelism. My understanding of concurrency mechanisms like Gevent and AsyncIO is that, nothing in the Python application is ever executing at the same time.
The closest you get is, calling a non-blocking IO method, and while waiting for that call to return other methods within the Python application are able to be executed. Again, none of the methods within the Python application ever actually execute Python code at the same time.
With that said, why is there a need for gevent.queue? It sounds to me like the Python application doesn't really need to worry about more than one Python method accessing a queue instance at a time.
I'm sure there's a scenario that I'm not seeing that gevent.queue fixes, I'm just curious what that is.
Although you are right that no two statements execute at the same time within a single Python process, you might want to ensure that a series of statements execute atomically, or you might want to impose an order on the execution of certain statements, and in that case things like gevent.queue become useful. A tutorial is here.

How do I use Google Storage in Flask?

The example code makes use of this oauth2_client which it immediately locks. The script does not work without these lines. What's the correct way to integrate this into a Flask app? Do I have to manage these locks? Does it matter if my web server spawns multiple threads? Or if I'm using gunicorn+gevent? Is there documentation on this anywhere?
It's not actually locking, it's just instantiating a lock object inside the module. The lock is actually acquired/released internally by oauth2_client; you don't need to manage it yourself. You can see this by looking at the source code, here: https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/third_party/oauth2_plugin/oauth2_client.py
In fact, based on the source code linked above, you should be able to simply call oauth2_client.InitializeMultiprocessingVariables() instead of the try/except block, since that is ultimately doing almost the exact same thing.

How to convert Python threading code to multiprocessing code?

I need to convert a threading application to a multiprocessing application for multiple reasons (GIL, memory leaks). Fortunately the threads are quite isolated and only communicate via Queue.Queues. This primitive is also available in multiprocessing so everything looks fine. Now before I enter this minefield I'd like to get some advice on the upcoming problems:
How to ensure that my objects can be transfered via the Queue? Do I need to provide some __setstate__?
Can I rely on put returning instantly (like with threading Queues)?
General hints/tips?
Anything worthwhile to read apart from the Python documentation?
Answer to part 1:
Everything that has to pass through a multiprocessing.Queue (or Pipe or whatever) has to be picklable. This includes basic types such as tuples, lists and dicts. Also classes are supported if they are top-level and not too complicated (check the details). Trying to pass lambdas around will fail however.
Answer to part 2:
A put consists of two parts: It takes a semaphore to modify the queue and it optionally starts a feeder thread. So if no other Process tries to put to the same Queue at the same time (for instance because there is only one Process writing to it), it should be fast. For me it turned out to be fast enough for all practical purposes.
Partial answer to part 3:
The plain multiprocessing.queue.Queue lacks a task_done method, so it cannot be used as a drop-in replacement directly. (A subclass provides the method.)
The old processing.queue.Queue lacks a qsize method and the newer multiprocessing version is inaccurate (just keep this in mind).
Since filedescriptors normally inherited on fork, care needs to be taken about closing them in the right processes.

Categories

Resources