I'm pretty new to Python, this question probably shows that. I'm working on multiprocessing part of my script, couldn't find a definitive answer to my problem.
I'm struggling with one thing. When using multiprocessing, part of the code has to be guarded with if __name__ == "__main__". I get that, my pool is working great. But I would love to import that whole script (making it a one big function that returns an argument would be the best). And here is the problem. First, how can I import something if part of it will only run when launched from the main/source file because of that guard? Secondly, if I manage to work it out and the whole script will be in one big function, pickle can't handle that, will use of "multiprocessing on dill" or "pathos" fix it?
Thanks!
You are probably confused with the concept. The if __name__ == "__main__" guard in Python exists exactly in order for it to be possible for all Python files to be importable.
Without the guard, a file, once imported, would have the same behavior as if it were the "root" program - and it would require a lot of boyler plate and inter-process comunication (like writting a "PID" file at a fixed filesystem location) to coordinate imports of the same code, including for multiprocessing.
Just leave under the guard whatever code needs to run for the root process. Everything else you move into functions that you can call from the importing code.
If you'd run "all" the script, even the part setting up the multiprocessing workers would run, and any simple job would create more workers exponentially until all machine resources were taken (i.e.: it would crash hard and fast, potentially taking the machine to an unresponsive state).
So, this is a good pattern - th "dothejob" function can call all
other functions you need, so you just need to import and call it,
either from a master process, or from any other project importing
your file as a Python module.
import multiprocessing
...
def dothejob():
...
def start():
# code to setup and start multiprocessing workers:
# like:
worker1 = multiprocessing.Process(target=dothejob)
...
worker1.start()
...
worker1.join()
if __name__ == "__main__":
start()
Related
I have attempted in a few different ways to perform Pool.starmap. I have tried various different suggestions and answers, and to no avail. Below is a sample of the code I am trying to run, however it gets caught and never terminates. What am I doing wrong here?
Side note: I am on python version 3.9.8
if __name__ == '__main__':
with get_context("spawn").Pool() as p:
tasks = [(1,1),(2,2),(3,3)]
print(p.starmap(add,tasks))
p.close()
p.join()
Multiprocessing in python has some complexity you should be aware of that make it dependent on how you run your script in addition to what OS, and python version you're using.
One of the big issues I see very often is the fact that Jupyter and other "notebook" style python environments don't always play nice with multiprocessing. There are technically some ways around this, but I typically just suggest executing your code from a more normal system terminal. The common thread is "interactive" interpreters don't work very well because there needs to be a "main" file, and in interactive mode there's no file; it just waits for user input.
I can't know exactly what your issue is here, as you haven't provided all your code, what OS you're using, and what IDE you're using but I can at least leave you with a working (on my setup) example. (windows 10; python 3.9; Spyder IDE with run settings -> execute in an external system terminal)
import multiprocessing as mp
def add(a, b): #I'm assuming your "add" function looks a bit like this...
return a+b
if __name__ == "__main__":
#this is critical when using "spawn" so code doesn't run when the file is imported
#you should only define functions, classes, and static data outside this (constants)
#most critically, it shouldn't be possible for a new child process to start outside this
ctx = mp.get_context("spawn")
#This is the only context available on windows, and the default for MacOS since python 3.8.
# Contexts are an important topic somewhat unique to python multiprocessing, and you should
# absolutely do some additional reading about "spawn" vs "fork". tldr; "spawn" starts a new
# process with no knowledge of the old one, and must `import` everything from __main__.
# "fork" on the other hand copies the existing process and all its memory before branching. This is
# faster than re-starting the interpreter, and re-importing everything, but sometimes things
# get copied that shouldn't, and other things that should get copied don't.
with ctx.Pool() as p:
#using `with` automatically shuts down the pool (forcibly) at the end of the block so you don't have to call `close` or `join`.
# It was also pointed out that due to the forcible shutdown, async calls like `map_async` may not finish unless you wait for the results
# before the end of the `with` block. `starmap` already waits for the results in this case however, so extra waiting is not needed.
tasks = [(1,1),(2,2),(3,3)]
print(p.starmap(add, tasks))
I am sorry for this rather long question but, since it is my first question on Stackoverflow, I wanted to be thorough in describing my problem and what I already tried.
I am doing simulations of stochastic processes and thought it to be a good idea to use multiprocessing in order to increase the speed of my simulations . Since the individual processes have no need to share information with each other, this is really a trivial application of multiprocessing – unfortunately I struggle with calling my script from the console.
My code for a testfunction looks like this:
#myscript.py
from multiprocessing import Pool
def testFunc (inputs):
print(inputs)
def multi():
print('Test2')
pool = Pool()
pool.map_async(testFunc, range(10))
if __name__ == '__main__':
print('Test1')
multi()
This works absolutely fine as long as I run the code from within my Spyder IDE. As the next step I want to execute my script on my university's cluster which I access via a slurm script; therefore, I need to be able to execute my python script via a bash script. Here I got some unexpected results.
What I tried – on my Mac Book Pro with iOS 10.15.7 and a work station with Ubuntu 18.04.5 – are the following console inputs: python myscript.py and python -c "from myscript import multi; multi()".
In each case my only output is Test1 and Test2, and testFunc never seems to be called. Following this answer Using python multiprocessing Pool in the terminal and in code modules for Django or Flask, I also tried various versions of omitting the if __name__ == '__main__' and importing the relevant functions to another module. For example I tried `
#myscript.py
from multiprocessing import Pool
def testFunc (inputs):
print(inputs)
pool = Pool()
pool.map_async(testFunc, range(10))
But all to no prevail. To confuse me even further I now found out that first opening the python interpreter of the console by simply typing python, pressing enter and then executing
from myscript import multi
multi()
inside the python interpreter does work.
As I said, I am very confused by this, since I thought this to be equivalent to python -c "from myscript import multi; multi()" and I really don't understand why one works and the other doesn't. Trying to reproduce this success I also tried executing the following bash script
python - <<'END_SCRIPT'
from multiTest import multi
multi()
END_SCRIPT
but, alas, also this doesn't work.
As a last "dicovery", I found out that all those problems only arise when using map_async instead of just map – however, I think that for my application asynchron processes are preferable.
I would be really grateful if someone could shed light on this mystery (at least for me it is a mystery).
Also, as I said this is my first question on Stackoverflow, so I apologize if I forgot relevant information or did accidentally not follow the formatting guidelines. All comments or edits helping me to improve my questions (and answers) in the future are also much appreciated!
You aren't waiting for the pool to finish what it's doing before your program exits.
def multi():
print('Test2')
with Pool() as pool:
result = pool.map_async(testFunc, range(10))
result.wait()
If the order in which the subprocesses process things isn't relevant, I'd suggest
with Pool() as pool:
for result in pool.imap_unordered(testFunc, range(10), 5):
pass
(change 5, the chunk size parameter, to taste.)
I'm developing a personal app which envolves web scraping stuffs, and inside its folders there's some files that need multiprocessing tools to makes things go more efficient as any other web scraping software. I know that Windows lacks of some aspects available on Linux such as fork, and thus when we want to instanciate a pool process in our program we need to make sure that the process code is enclosed by if __name__ == "__main__":.
But something is bothering me. Here's the scenario (there are several of them currently on my program). In my file main I'm calling import web_scraping_x, which contains a multiprocess pool. The question is: How do I protect my web_scraping_x code without messing up everything inside of it?
Here's some pseudo-code (for the better good of comprehension):
"__main__"
from web_scraping_x import scraping_manager
instructions = ["some list or dict"]
data = scraping_manager(instructions)
and
"web_scraping_x"
def escraping_manager(instructions):
if instructions is True:
scraping_a(instructios)
def scraping_a(instructions):
"do a lot of work ur proletariat"
I'm trying not to come up with a solution as bad as the problem on Windows, even though it may not be possible, such as importing my module with a changed __name__ ... so any tips humans I really appreciate. If needed, I could show my code, but I think it's just gonna make things more difficulty to share with you.
I am playing around with a library for my beginner students, and I'm using the multiprocessing module in Python. I ran into this problem: importing and using a module that uses multiprocessing without causing infinite loop on Windows
As an example, suppose I have a module mylibrary.py:
# mylibrary.py
from multiprocessing import Process
class MyProcess(Process):
def run(self):
print "Hello from the new process"
def foo():
p = MyProcess()
p.start()
And a main program that calls this library:
# main.py
import mylibrary
mylibrary.foo()
If I run main.py on Windows, it tries to import main.py into the new process, meaning the code is executed again which results in an infinite loop of process generation. I can fix it like so:
import mylibrary
if __name__ == "__main__":
mylibrary.foo()
But, this is pretty confusing for beginners, and moreover it seems like it shouldn't be necessary. The new process is being created in mylibrary, so why doesn't the new process just import mylibrary? Is there a way to work around this issue without having to change main.py?
I am using Python 2.7, by the way.
Windows doesn't have fork, so there's no way to make a new process just like the existing one. So the child process has to run your code again, but now you need a way to distinguish between the parent process and the child process, and __main__ is it.
This is covered in the docs here: http://docs.python.org/2/library/multiprocessing.html#windows
I don't know of another way to structure the code to avoid the fork bomb effect.
Google has a Python tutorial, and they describe boilerplate code as "unfortunate" and provide this example:
#!/usr/bin/python
# import modules used here -- sys is a very standard one
import sys
# Gather our code in a main() function
def main():
print 'Hello there', sys.argv[1]
# Command line args are in sys.argv[1], sys.argv[2] ..
# sys.argv[0] is the script name itself and can be ignored
# Standard boilerplate to call the main() function to begin
# the program.
if __name__ == '__main__':
main()
Now, I've heard boilerplate code being described as "seemingly repetitive code that shows up again and again in order to get some result that seems like it ought to be much simpler" (example).
Anyways, in Python, the part considered "boilerplate" code of the example above was:
if __name__ == '__main__':
main()
Now, my questions are as follows:
1) Does boilerplate code in Python (like the example provided) take on the same definition as the definition I provided? If so, why?
2) Is this code even necessary? It seems to me like the code runs whether or not there's a main method. What makes using this code better? Is it even better?
3) Why do we use that code and what service does it provide?
4) Does this occur throughout Python? Are there other examples of "boilerplate code"?
Oh, and just an off topic question: do you need to import sys to use command line arguments in Python? How does it handle such arguments if its not there?
It is repetitive in the sense that it's repeated for each script that you might execute from the command line.
If you put your main code in a function like this, you can import the module without executing it. This is sometimes useful. It also keeps things organized a bit more.
Same as #2 as far as I can tell
Python is generally pretty good at avoiding boilerplate. It's flexible enough that in most situations you can write code to produce the boilerplate rather then writing boilerplate code.
Off topic question:
If you don't write code to check the arguments, they are ignored.
The reason that the if __name__ == "__main__": block is called boilerplate in this case is that it replicates a functionality that is automatic in many other languages. In Java or C++, among many others, when you run your code it will look for a main() method and run it, and even complain if it's not there. Python runs whatever code is in your file, so you need to tell it to run the main() method; a simple alternative would be to make running the main() method the default functionality.
So, if __name__ == "__main__": is a common pattern that could be shorter. There's no reason you couldn't do something different, like:
if __name__ == "__main__":
print "Hello, Stack Overflow!"
for i in range(3):
print i
exit(0)
This will work just fine; although my example is a little silly, you can see that you can put whatever you like there. The Python designers chose this behavior over automatically running the main() method (which may well not exist), presumably because Python is a "scripting" language; so you can write some commands directly into a file, run it, and your commands execute. I personally prefer it the Python way because it makes starting up in Python easier for beginners, and it's always nice to have a language where Hello World is one line.
The reason you use an "if main" check is so you can have a module that runs some part of its code at toplevel (to create the things – constants, functions, or classes – it exports), and some part only when executed as a script (e.g. unit tests for its functionality).
The reason the latter code should be wrapped in a function is because local variables of the main() block would leak into the module's scope.
Now, an alternate design could be that a file executed as a script would have to declare a function named, say, __main__(), but that would mean adding a new magic function name to the language, while the __name__ mechanism is already there. (And couldn't be removed, because every module has to have a __name__, and a module executed as a script has to have a "special" name because of how module names are assigned.) Introducing two mechanisms to do the same thing just to get rid of two lines of boilerplate – and usually two lines of boilerplate per application – just doesn't seem worth it.
You don't need to add a if __name__ == '__main__' for one off scripts that aren't intended to be a part of a larger project. See here for a great explanation. You only need it if you want to run the file by itself AND include it as a module along with other python files.
If you just want to run one file, you can have zero boilerplate:
print 1
and run it with $ python your_file.py
Adding the shebang line #!/usr/bin/python and running chmod +x print_one.py gets you the ability to run with
./print_one.py
Finally, # coding: utf-8 allows you to add unicode to your file if you want to put ❤'s all over the place.
1) main boilerplate is common, but cannot be any simpler
2) main() is not called without the boilerplate
3) the boilerplate allows module usage both as a standalone script, and as a library in other programs
4) it’s very common. doctest is another one.
Train to become a Python guru…and good luck with the thesis! ;-)
Let’s take a moment to see what happened when you called import sys:
Python looks at a list and brings in the sys module
It finds the argv function and runs it
So, what’s happening here?
A function written elsewhere is being used to perform certain operations within the scope of the current program. Programming in this fashion has a lots of benefits. It separates the logic from actual labour.
Now, as far as the boilerplate is concerned, there are two parts:
the program itself (the logic), defined under main, and
the call part that checks if main exists
You essentially write your program under main, using all the functions you defined just before defining main (or elsewhere), and let Python look for main.
I am equally confused by what the tutorial means by "boilerplate code": does it mean that this code can be avoided in a simple script? Or it is a criticism towards Python features that force the use of this syntax? Or even an invitation to use this "boilerplate" code?
I don't know, however, after many years of Python programming, I have at least clear what the different syntaxes do, even if I am probably still not sure on what is the best way of doing it.
Often you want to put at the end of the script code for tests or code that want to execute, but this has some implications/side-effects:
the code gets executed even when the script is imported, that it is rarely what is wanted.
variables and values in the code are defined and exported in the calling namespace
the code at the end of the script can be executed by calling the script (python script.py) or by running from ipython shell (%run script.py), but there is no way to run it from other scripts.
The most basic mechanism to avoid to execute following code in all conditions, is the syntax:
if __name__ == '__main__':
which makes the code run only if the script is called or run, avoiding problem 1. The other two points still hold.
The "boilerplate" code with a separate main() function, adds a further step, excluding also above points 2 and 3, so for example you can call a number of tests from different scripts, that sometimes can take another level (e.g.: a number of functions, one for each test, so they can be individually be called from outside, and a main that calls all test functions, without needs to know from outside which one they are).
I add that the main reason I find this structures often unsatisfying, apart from its complexity, is that sometimes I would like to maintain point 2 and I lose this possibility if the code is moved to a separate function.