multiprocess a function that calls another function in python - python

I'am trying to speedup a Python program , I made remark that there is a thread always running that scans the inputs from an external resource, and when it gets something, it will call another function that will parse the input data and return an understandable information (the parsing function also uses other functions).
A simple model of the scanning() function
def scanning(x):
alpha = GetSomething(x)
if alpha != 0:
print Parsing(alpha)
So my idea is to convert this thread into a process that will run in parallel with the main process, and when it gets something, it will send it using a Queue to the main process which should then call the parsing function.
My questions are: it is possible to keep the scanning()function as it is and use it inside a process (even if it calls other functions)?
If not, what are the required modifications on the structure of the scanning() function to be used conveniently with the multiprocessing module?
What is the proper way to multiprocess a function that calls other functions in Python ?

Short answer: yes, it is possible.
To understand why, you need to understand one thing about multiprocessing. It does not remove multiprocessing-invoked function into a separate process: it creates a full replica of your entire process: including it's code, loaded modules and any global data that have been initialized before you forked your processes.
So if your code has some sub-functions defined, they will be available to your function after it's been split into a separate process, along with any data that have been pre-initialized. Any modifications to values, functions and namespaces of your main process after forking processes will not affect the forked process at all - you need to use special tools to communicate between processes.
So, let's suppose you have the following abstract code:
import SomeModule
define SomeFunction()
assign SomeValue
define ChildProcess():
call SomeFunction()
increase SomeValue
do ChildProcessStuff
start ChildProcess()
decrease SomeValue
do MainProcessStuff
For both main and spawned processes, your code executes identically until the line start ChildProcess(). After this line your process splits into two which are fully identical at first, but have different points of execution. Main process goes past this line and proceeds straight to do MainProcessStuff, while your child process will never reach that line. Instead, it creates a replica of entire namespace and starts executing ChildProcess() as if it was called like a normal function followed by an exit().
Note how both main and child processes have access to SomeValue. Also note how their changes to it are independent, as they're doing them in different namespaces (and therefore to different SomeValues). This wouldn't be the case with threading module which does not split the namespaces, and it's an important distinction.
Also note that main process never executes the code in ChildProcess but it retains a reference to it, which can be used to track it's progress, terminate it prematurely etc.
You might also be interested in more in-depth information about Python threads and processes here.

Related

How to make cmds.duplicate() execute immediately when called in maya

How to make cmds.duplicate execute immediately when called in maya? Instead of waiting for the entire script to run and then executing it in batches. For example, for this script below, all execution results will appear immediately after the entire script is executed
import time
for i in range(1, 6):
pm.select("pSphere{}".format(i))
time.sleep(0.5)
cmds.duplicate()
I have tried to use python multithreading, like this
import threading
import time
def test():
for i in range(50):
cmds.duplicate('pSphere1')
time.sleep(0.1)
thread = threading.Thread(target=test)
thread.start()
#thread.join()
Sometimes it can success, but sometimes it will crash maya. If the main thread join, it will not achieve the effect. When I want to do a large number of cmds.duplicate, it will resulting in a very high memory consumption, and the program runs more and more slowly. In addition, all duplicate results appear together after the entire python script runs, so I suspect that when I call cmds When duplicating, Maya did not finish executing and outputting the command, but temporarily put the results in a container with variable capacity. With the increase of my calls, the process of dynamic expansion of the container causes the program to become slower and slower, and the memory consumption also increase dramatically. Because I saw that other plug-ins can see the command execution results in real time, so I thought that this should be a proper way to do this just thath I haven't found yet
Your assumptions are not correct. Maya does not need to display anything to complete a tool. If you want to see the results inbetween you can try to use:
pm.refresh()
but this will not change the behaviour in general. I suppose your memory problems have a different source. You could check if it helps to turn off history or the undo queue temporarily.
And of course Ennakard is right with the answer, that most maya commands are not thread save unless mentioned in the docs. Every node creation and modificatons have to be done in the main thread.
The simple answer is you don't, maya command in general and most interaction with maya are not thread safe
threading is usually used for data manipulation before it get used to manipulate anything in maya, but once you start creating node or setting attribute, or any maya modification, no threading.

Make two competing functions and kill the slow one

In python, I have to fetch crypto data from binance every minute and do some calculations. For fetching data I have two functions func_a() and func_b(). They both do the same thing but in wildly different manner. Sometimes func_a is faster and sometimes func_b is faster. I want to run both the functions in parallel, if any of the function returns result to me faster, I want to kill the other one and move on (because they both are going to bring the same result).
How can I achieve this in python? Please mind that I do not want to replace these functions or their mechanics.
Python threads aren't very suitable for this purpose for two reasons:
The Python GIL means that if you spawn two CPU-bound threads, each of the two threads will run at half its normal speed (because only one thread is actually running at any given instant; the other is waiting to acquire the interpreter lock)
There is no reliable way to unilaterally kill a thread, because if you do that, any resources it had allocated will be leaked, causing major problems.
If you really want to be able to cancel a function-in-progress, then, you have two options:
Modify the function to periodically check a "please_quit" boolean variable (or whatever) and return immediately if that boolean's state has changed to True. Then your main thread can set the please_quit variable and then call join() on the thread, and rest assured that the thread will quit ASAP. (This does require that you have the ability to modify the function's implementation)
Spawn child processes instead of child threads. A child process takes more resources to launch, but it can run truly in parallel (since it has its own separate Python interpreter) and it is safe (usually) to unilaterally kill it, because the OS will automatically clean up all of the process's held resources when the process is killed.

multiprocessing in python - what gets inherited by forkserver process from parent process?

I am trying to use forkserver and I encountered NameError: name 'xxx' is not defined in worker processes.
I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:
The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Also, it says:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So apparently a key object that my worker process needs to work on did not get inherited by the server process and then passing to workers, why did that happen? I wonder what exactly gets inherited by forkserver process from parent process?
Here is what my code looks like:
import multiprocessing
import (a bunch of other modules)
def worker_func(nameList):
global largeObject
for item in nameList:
# get some info from largeObject using item as index
# do some calculation
return [item, info]
if __name__ == '__main__':
result = []
largeObject # This is my large object, it's read-only and no modification will be made to it.
nameList # Here is a list variable that I will need to get info for each item in it from the largeObject
ctx_in_main = multiprocessing.get_context('forkserver')
print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=4) as pool:
for x in pool.imap_unordered(worker_func, nameList):
result.append(x)
Thank you!
Best,
Theory
Below is an excerpt from Bojan Nikolic blog
Modern Python versions (on Linux) provide three ways of starting the separate processes:
Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex
Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
Forkserver
The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method.
Practice
Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names), which set list of module names to try to load in forkserver process. I give an example below:
# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}
# main.py
import multiprocessing
import os
from time import sleep
from inherited import large_obj
def worker_func(key: str):
print(os.getpid(), id(large_obj))
sleep(1)
return large_obj[key]
if __name__ == '__main__':
result = []
ctx_in_main = multiprocessing.get_context('forkserver')
ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=cores) as pool:
for x in pool.imap(worker_func, ["one", "two", "three"]):
result.append(x)
for res in result:
print(res)
Output:
# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024
And if we don't use preloading
...
ctx_in_main = multiprocessing.get_context('forkserver')
# ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
...
# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912
So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?
Basically when the server process starts, it will import your main module and everything before if __name__ == '__main__' will be executed. That's why my code don't work, because large_object is nowhere to be found in server process and in all those worker processes that fork from the server process.
Alex's solution works because large_object now gets imported to both main and server process so every worker forked from server will also gets large_object. If combined with set_forkserver_preload(modules_names) all workers might even get the same large_object from what I saw. The reason for using forkserver is explicitly explained in Python documentations and in Bojan's blog:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
So it's more on the safe side of concern here.
On a side note, if you use fork as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write, please correct me if I am wrong). In this case using global large_object will get you access to large_object in worker_func directly.
The forkserver might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.
If I put all those calculations directly into inherited.py as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object, this won't work.
And putting those calculations in __main__ in inherited.py won't work either since now none of the processes will execute them, including main and server.
So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py first, pickle the large_object, exit the interpreter, and start a fresh one to load the pickled large_object. Then I can just go nuts with either fork or forkserver.

How can I ensure that only one process is running a function in python multiprocess?

I have a function that is invoked by potentially multiple processes created with multiprocessing. I want to ensure not serialization, but single execution by the original process, that is only the main process will perform some logic, and the others will do nothing.
One option is to use a RLock with blocking=False, but this does not guarantee that the main process will perform the execution. I don't want to differentiate on current_process().name because it just doesn't feel right, and as far as I understand the name is arbitrary and not necessarily unique anyway.
Is there a more elegant way to ensure this? In MPI I used to do it with the id.

Get Data from Other Processes using Multiprocessing

(Language is Python 3)
I am writing a program with the module multiprocessing and using Pool. I need some variable that is shared between all of the processes. The parent process will initialize this variable and pass it as an argument to p.map(). I want the child processes to change this variable. The intent of this is because the first part of the child processes' work should be done in parallel (computational work that doesn't need any other processes' data). But, the second part of the processes' work needs to be done in order, one process after another, because they are writing to a file and the contents of that file should be in order. I want each process to wait until the others are done before moving on. I will record the "progress" of the entire program with the variable, e.g. when the first process is done writing to the file, it will increment the variable by one. I want this to be a signal to the next process in line to begin writing to the file. But I need some sort of waituntil() function to make the processes wait until the Value variable indicates that it is their "turn" to write to the file.
Here are my two problems:
I need a variable that the child processes can edit, and the child processes can actually get the value of that variable. What type of variable should I use? Should I use Value, Manager, or something else?
I need the processes to wait until the variable described above equals to a certain value, signaling that it is their turn to write to the file. Is there any sort of waituntil() function that I can use?
What you are looking for is called Synchronization.
There are multitudes of different synchronization primitives to choose from.
You should never attempt to write synchronization primitives on your own, as it is non-trivial to do correctly!
In your case either an Event or a Condition might be suitable.

Categories

Resources