I am having problems with ProcessPoolExecutor from concurrent.futures - python

I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.

I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.

Related

multiprocessing pool example does not work and freeze the kernel

I'm trying to parallelize a script, but for an unknown reason the kernel just freeze without any errors thrown.
minimal working example:
from multiprocessing import Pool
def f(x):
return x*x
p = Pool(6)
print(p.map(f, range(10)))
Interestingly, all works fine if I define my function in another file then import it. How can I make it work without the need of another file?
I work with spyder (anaconda) and I have the same result if I run my code from the windows command line.
This happens because you didn't protect your "procedural" part of the code from re-execution when your child processes are importing f.
They need to import f, because Windows doesn't support forking as start method for new processes (only spawn). A new Python process has to be started from scratch, f imported and this import will also trigger another Pool to be created in all child-processes ... and their child-processes and their child-processes...
To prevent this recursion, you have to insert an if __name__ == '__main__': -line between the upper part, which should run on imports and a lower part, which should only run when your script is executed as the main script (only the case for the parent).
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__': # protect your program's entry point
p = Pool(6)
print(p.map(f, range(10)))
Separating your code like that is mandatory for multiprocessing on Windows and Unix-y systems when used with 'spawn' or 'forkserver' start-method instead of default 'fork'. In general, start-methods can be modified with multiprocessing.set_start_method(method).
Since Python 3.8, macOS also uses 'spawn' instead of 'fork' as default.
It's general a good practice to separate any script in upper "definition" and lower "execution as main", to make code importable without unnecessary executions of parts only relevant when run as top level script. Last but not least, it facilitates understanding the control flow of your program when you don't intermix definitions and executions.

Is it possible to use multiprocessing in a module with windows?

I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save

multiprocessing.Pool in jupyter notebook works on linux but not windows

I'm trying to run a few independent computations (though reading from the same data). My code works when I run it on Ubuntu, but not on Windows (windows server 2012 R2), where I get the error:
'module' object has no attribute ...
when I try to use multiprocessing.Pool (it appears in the kernel console, not as output in the notebook itself)
(And I've already made the mistake of defining the function AFTER creating the pool, and I've also corrected it, that's not the problem).
This happens even on the simplest of examples:
from multiprocessing import Pool
def f(x):
return x**2
pool = Pool(4)
for res in pool.map(f,range(20)):
print res
I know that it needs to be able to import the module (and I have no idea how this works when working in the notebook), and I've heard of IPython.Parallel, but I have been unable to find any documentation or examples.
Any solutions/alternatives would be most welcome.
I would post this as a comment since I don't have a full answer, but I'll amend as I figure out what is going on.
from multiprocessing import Pool
def f(x):
return x**2
if __name__ == '__main__':
pool = Pool(4)
for res in pool.map(f,range(20)):
print(res)
This works. I believe the answer to this question is here. In short, the subprocesses do not know they are subprocesses and are attempting to run the main script recursively.
This is the error I am given, which gives us the same solution:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.
from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
I get the following error three times (one for each thread in the pool), and it prints "3" through "19":
AttributeError: 'module' object has no attribute 'f'
The first three apply_async calls never return.
Meanwhile, if I try:
from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))
I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]
The multiprocessing docs have the following to say:
Functionality within this package requires that the main module be
importable by the children.
What does this mean?
To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?
Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.
Understanding the docs
Functionality within this package requires that the main module be importable by the children.
What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:
Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.
Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:
from multiprocessing.pool import ThreadPool
It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.
Pools must be initialized after the definitions of functions to be run on them
The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.
So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:
from multiprocessing import Pool
def f(x): return x # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
Or
from multiprocessing import Pool
def f(x): print(x) # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))
By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.
Using pools in modules
But wait, there's more (to using pools in modules that you want to import elsewhere)!
If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:
from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))
Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:
### module.py ###
from multiprocessing import Pool
POOL = Pool(5)
### module2.py ###
def f(x):
# Some function
from module import POOL
POOL.map(f, range(10))
And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))
### module2.py ###
import module
This, however, does work, as long as nothing imports module2:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))
### module2.py ###
import module
module.run_pool()
Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.
The function you want to execute on a thread pool must be already defined when you create the pool.
This should work:
from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
p = Pool(3)
p.map(f, range(20))
The reason is that (at least on Unix-based systems, which have fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.
On Windows it's a bit different, as Windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on Windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will re-execute the code and therefore spawn new processes infinitely, crashing the program (or the system).
There is another possible source for this error. I got this error when running the example code.
The source was that despite having installed multiprosessing correctly, the C++ compiler was not installed on my system, something pip informed me of when trying to update multiprocessing. So It might be worth checking that the compiler is installed.

importing and using a module that uses multiprocessing without causing infinite loop on Windows

I have a module named multi.py. If I simply wanted to execute multi.py as a script, then the workaround to avoid crashing on Windows (spawning an infinite number of processes) is to put the multiprocessing code under:
if __name__ == '__main__':
However, I am trying to import it as a module from another script and call multi.start(). How can this be accomplished?
# multi.py
import multiprocessing
def test(x):
x**=2
def start():
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()-2)
pool.map(test, (i for i in range(1000*1000)))
pool.terminate()
print('done.')
if __name__ == '__main__':
print('runs as a script,',__name__)
else:
print('runs as imported module,',__name__)
This is my test.py I run:
# test.py
import multi
multi.start()
I don't quite get what you're asking. You don't need to do anything to prevent this from spawning infinitely many processes. I just ran it on Windows XP --- imported the file and ran multi.start() --- and it completed fine in a couple seconds.
The reason you have to do the if __name__=="__main__" protection is that, on Windows, multiprocessing has to import the main script in order to run the target function, which means top-level module code in that file will be executed. The problem only arises if that top-level module code itself tries to spawn a new process. In your example, the top level module code doesn't use multiprocessing, so there's no infinite process chain.
Edit: Now I get what you're asking. You don't need to protect multi.py. You need to protect your main script, whatever it is. If you're getting a crash, it's because in your main script you are doing multi.start() in the top level module code. Your script needs to look like this:
import multi
if __name__=="__main__":
multi.start()
The "protection" is always needed in the main script.
if __name__ == '__main__':
print('runs as a script,',__name__)
else:
print('runs as imported module,',__name__)

Categories

Resources