Run methods in parallel

Run methods in parallel - python

I have a program that, among other things, parses some big files, and I would like to have this done in parallel to save time.
The code flow looks something like this:
if __name__ == '__main__':
obj = program_object()
obj.do_so_some_stuff(argv)
obj.field1 = parse_file_one(f1)
obj.field2 = parse_file_two(f2)
obj.do_some_more_stuff()
I tried running the file parsing methods in separate processes like this:
p_1 = multiprocessing.Process(target=parse_file_one, args=(f1))
p_2 = multiprocessing.Process(target=parse_file_two, args=(f2))
p_1.start()
p_2.start()
p_1.join()
p_2.join()
There are 2 problems here. One is how to have the separate process modify the filed, but more importantly, forking the process duplicates my whole main! I get exception regarding argv when executing the
do_so_some_stuff(argv)
second time. That really is not what I wanted. It even happened when I run only 1 of the Processes.
How could I get just the file parsing methods to run in parallel to each other, and then continue back with main process like before?

Try putting the parsing methods in a separate module.

First, i guess instead of:
obj = program_object()
program_object.do_so_some_stuff(argv)
you mean:
obj = program_object()
obj.do_so_some_stuff(argv)
Second, try using threading like this:
#!/usr/bin/python
import thread
if __name__ == '__main__':
try:
thread.start_new_thread( parse_file_one, (f1) )
thread.start_new_thread( parse_file_two, (f2) )
except:
print "Error: unable to start thread"
But, as pointed out by Wooble, depending on the implementation of your parsing functions, this might not be a solution that executes truly in parallel, because of the GIL.
In that case, you should check the Python multiprocessing module that will do true concurrent execution:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine.

Related

How do I run two looping functions parallel to each other? [duplicate]

Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?

If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
print("a")
def loop_b():
while 1:
print("b")
if __name__ == '__main__':
Process(target=loop_a).start()
Process(target=loop_b).start()
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.

There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
taskA()
taskB()
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
multiprocess
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
coroutines
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.

Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
doTaskA()
doTaskB()
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.

from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
threadA.run()
threadB.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()

You could use threading or multiprocessing.

How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.

I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!

How to share variables in multiprocessing [duplicate]

The following does not work
one.py
import shared
shared.value = 'Hello'
raw_input('A cheap way to keep process alive..')
two.py
import shared
print shared.value
run on two command lines as:
>>python one.py
>>python two.py
(the second one gets an attribute error, rightly so).
Is there a way to accomplish this, that is, share a variable between two scripts?

Hope it's OK to jot down my notes about this issue here.
First of all, I appreciate the example in the OP a lot, because that is where I started as well - although it made me think shared is some built-in Python module, until I found a complete example at [Tutor] Global Variables between Modules ??.
However, when I looked for "sharing variables between scripts" (or processes) - besides the case when a Python script needs to use variables defined in other Python source files (but not necessarily running processes) - I mostly stumbled upon two other use cases:
A script forks itself into multiple child processes, which then run in parallel (possibly on multiple processors) on the same PC
A script spawns multiple other child processes, which then run in parallel (possibly on multiple processors) on the same PC
As such, most hits regarding "shared variables" and "interprocess communication" (IPC) discuss cases like these two; however, in both of these cases one can observe a "parent", to which the "children" usually have a reference.
What I am interested in, however, is running multiple invocations of the same script, ran independently, and sharing data between those (as in Python: how to share an object instance across multiple invocations of a script), in a singleton/single instance mode. That kind of problem is not really addressed by the above two cases - instead, it essentially reduces to the example in OP (sharing variables across two scripts).
Now, when dealing with this problem in Perl, there is IPC::Shareable; which "allows you to tie a variable to shared memory", using "an integer number or 4 character string[1] that serves as a common identifier for data across process space". Thus, there are no temporary files, nor networking setups - which I find great for my use case; so I was looking for the same in Python.
However, as accepted answer by #Drewfer notes: "You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter"; or in other words: either you have to use a networking/socket setup - or you have to use temporary files (ergo, no shared RAM for "totally separate python sessions").
Now, even with these considerations, it is kinda difficult to find working examples (except for pickle) - also in the docs for mmap and multiprocessing. I have managed to find some other examples - which also describe some pitfalls that the docs do not mention:
Usage of mmap: working code in two different scripts at Sharing Python data between processes using mmap | schmichael's blog
Demonstrates how both scripts change the shared value
Note that here a temporary file is created as storage for saved data - mmap is just a special interface for accessing this temporary file
Usage of multiprocessing: working code at:
Python multiprocessing RemoteManager under a multiprocessing.Process - working example of SyncManager (via manager.start()) with shared Queue; server(s) writes, clients read (shared data)
Comparison of the multiprocessing module and pyro? - working example of BaseManager (via server.serve_forever()) with shared custom class; server writes, client reads and writes
How to synchronize a python dict with multiprocessing - this answer has a great explanation of multiprocessing pitfalls, and is a working example of SyncManager (via manager.start()) with shared dict; server does nothing, client reads and writes
Thanks to these examples, I came up with an example, which essentially does the same as the mmap example, with approaches from the "synchronize a python dict" example - using BaseManager (via manager.start() through file path address) with shared list; both server and client read and write (pasted below). Note that:
multiprocessing managers can be started either via manager.start() or server.serve_forever()
serve_forever() locks - start() doesn't
There is auto-logging facility in multiprocessing: it seems to work fine with start()ed processes - but seems to ignore the ones that serve_forever()
The address specification in multiprocessing can be IP (socket) or temporary file (possibly a pipe?) path; in multiprocessing docs:
Most examples use multiprocessing.Manager() - this is just a function (not class instantiation) which returns a SyncManager, which is a special subclass of BaseManager; and uses start() - but not for IPC between independently ran scripts; here a file path is used
Few other examples serve_forever() approach for IPC between independently ran scripts; here IP/socket address is used
If an address is not specified, then an temp file path is used automatically (see 16.6.2.12. Logging for an example of how to see this)
In addition to all the pitfalls in the "synchronize a python dict" post, there are additional ones in case of a list. That post notes:
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
The workaround to dict['key'] getting and setting, is the use of the dict public methods get and update. The problem is that there are no such public methods as alternative for list[index]; thus, for a shared list, in addition we have to register __getitem__ and __setitem__ methods (which are private for list) as exposed, which means we also have to re-register all the public methods for list as well :/
Well, I think those were the most critical things; these are the two scripts - they can just be ran in separate terminals (server first); note developed on Linux with Python 2.7:
a.py (server):
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
syncarr = []
def get_arr():
return syncarr
def main():
# print dir([]) # cannot do `exposed = dir([])`!! manually:
MyListManager.register("syncarr", get_arr, exposed=['__getitem__', '__setitem__', '__str__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'])
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.start()
# we don't use the same name as `syncarr` here (although we could);
# just to see that `syncarr_tmp` is actually <AutoProxy[syncarr] object>
# so we also have to expose `__str__` method in order to print its list values!
syncarr_tmp = manager.syncarr()
print("syncarr (master):", syncarr, "syncarr_tmp:", syncarr_tmp)
print("syncarr initial:", syncarr_tmp.__str__())
syncarr_tmp.append(140)
syncarr_tmp.append("hello")
print("syncarr set:", str(syncarr_tmp))
raw_input('Now run b.py and press ENTER')
print
print 'Changing [0]'
syncarr_tmp.__setitem__(0, 250)
print 'Changing [1]'
syncarr_tmp.__setitem__(1, "foo")
new_i = raw_input('Enter a new int value for [0]: ')
syncarr_tmp.__setitem__(0, int(new_i))
raw_input("Press any key (NOT Ctrl-C!) to kill server (but kill client first)".center(50, "-"))
manager.shutdown()
if __name__ == '__main__':
main()
b.py (client)
import time
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
MyListManager.register("syncarr")
def main():
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.connect()
syncarr = manager.syncarr()
print "arr = %s" % (dir(syncarr))
# note here we need not bother with __str__
# syncarr can be printed as a list without a problem:
print "List at start:", syncarr
print "Changing from client"
syncarr.append(30)
print "List now:", syncarr
o0 = None
o1 = None
while 1:
new_0 = syncarr.__getitem__(0) # syncarr[0]
new_1 = syncarr.__getitem__(1) # syncarr[1]
if o0 != new_0 or o1 != new_1:
print 'o0: %s => %s' % (str(o0), str(new_0))
print 'o1: %s => %s' % (str(o1), str(new_1))
print "List is:", syncarr
print 'Press Ctrl-C to exit'
o0 = new_0
o1 = new_1
time.sleep(1)
if __name__ == '__main__':
main()
As a final remark, on Linux /tmp/mypipe is created - but is 0 bytes, and has attributes srwxr-xr-x (for a socket); I guess this makes me happy, as I neither have to worry about network ports, nor about temporary files as such :)
Other related questions:
Python: Possible to share in-memory data between 2 separate processes (very good explanation)
Efficient Python to Python IPC
Python: Sending a variable to another script

You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter.
If it's just simple variables you want, you can easily dump a python dict to a file with the pickle module in script one and then re-load it in script two.
Example:
one.py
import pickle
shared = {"Foo":"Bar", "Parrot":"Dead"}
fp = open("shared.pkl","w")
pickle.dump(shared, fp)
two.py
import pickle
fp = open("shared.pkl")
shared = pickle.load(fp)
print shared["Foo"]

sudo apt-get install memcached python-memcache
one.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
shared.set('Value', 'Hello')
two.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
print shared.get('Value')

What you're trying to do here (store a shared state in a Python module over separate python interpreters) won't work.
A value in a module can be updated by one module and then read by another module, but this must be within the same Python interpreter. What you seem to be doing here is actually a sort of interprocess communication; this could be accomplished via socket communication between the two processes, but it is significantly less trivial than what you are expecting to have work here.

you can use the relative simple mmap file.
you can use the shared.py to store the common constants. The following code will work across different python interpreters \ scripts \processes
shared.py:
MMAP_SIZE = 16*1024
MMAP_NAME = 'Global\\SHARED_MMAP_NAME'
* The "Global" is windows syntax for global names
one.py:
from shared import MMAP_SIZE,MMAP_NAME
def write_to_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_WRITE)
map_file.seek(0)
map_file.write('hello\n')
ret = map_file.flush() != 0
if sys.platform.startswith('win'):
assert(ret != 0)
else:
assert(ret == 0)
two.py:
from shared import MMAP_SIZE,MMAP_NAME
def read_from_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_READ)
map_file.seek(0)
data = map_file.readline().rstrip('\n')
map_file.close()
print data
*This code was written for windows, linux might need little adjustments
more info at - https://docs.python.org/2/library/mmap.html

Share a dynamic variable by Redis:
script_one.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
shared_var = 1
while True:
cli.set('share_place', shared_var)
shared_var += 1
sleep(1)
Run script_one in a terminal (a process):
$ python script_one.py
script_two.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
while True:
print(int(cli.get('share_place')))
sleep(1)
Run script_two in another terminal (another process):
$ python script_two.py
Out:
1
2
3
4
5
...
Dependencies:
$ pip install redis
$ apt-get install redis-server

I'd advise that you use the multiprocessing module. You can't run two scripts from the commandline, but you can have two separate processes easily speak to each other.
From the doc's examples:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()

You need to store the variable in some sort of persistent file. There are several modules to do this, depending on your exact need.
The pickle and cPickle module can save and load most python objects to file.
The shelve module can store python objects in a dictionary-like structure (using pickle behind the scenes).
The dbm/bsddb/dbhash/gdm modules can store string variables in a dictionary-like structure.
The sqlite3 module can store data in a lightweight SQL database.
The biggest problem with most of these are that they are not synchronised across different processes - if one process reads a value while another is writing to the datastore then you may get incorrect data or data corruption. To get round this you will need to write your own file locking mechanism or use a full-blown database.

If you wanna read and modify shared data between 2 scripts which run separately, a good solution would be to take advantage of python multiprocessing module and use a Pipe() or a Queue() (see differences here). This way you get to sync scripts and avoid problems regarding concurrency and global variables (like what happens if both scripts wanna modify a variable at the same time).
The best part about using pipes/queues is that you can pass python objects through them.
Also there are methods to avoid waiting for data if there hasn't been passed yet (queue.empty() and pipeConn.poll()).
See an example using Queue() below:
# main.py
from multiprocessing import Process, Queue
from stage1 import Stage1
from stage2 import Stage2
s1= Stage1()
s2= Stage2()
# S1 to S2 communication
queueS1 = Queue() # s1.stage1() writes to queueS1
# S2 to S1 communication
queueS2 = Queue() # s2.stage2() writes to queueS2
# start s2 as another process
s2 = Process(target=s2.stage2, args=(queueS1, queueS2))
s2.daemon = True
s2.start() # Launch the stage2 process
s1.stage1(queueS1, queueS2) # start sending stuff from s1 to s2
s2.join() # wait till s2 daemon finishes
# stage1.py
import time
import random
class Stage1:
def stage1(self, queueS1, queueS2):
print("stage1")
lala = []
lis = [1, 2, 3, 4, 5]
for i in range(len(lis)):
# to avoid unnecessary waiting
if not queueS2.empty():
msg = queueS2.get() # get msg from s2
print("! ! ! stage1 RECEIVED from s2:", msg)
lala = [6, 7, 8] # now that a msg was received, further msgs will be different
time.sleep(1) # work
random.shuffle(lis)
queueS1.put(lis + lala)
queueS1.put('s1 is DONE')
# stage2.py
import time
class Stage2:
def stage2(self, queueS1, queueS2):
print("stage2")
while True:
msg = queueS1.get() # wait till there is a msg from s1
print("- - - stage2 RECEIVED from s1:", msg)
if msg == 's1 is DONE ':
break # ends loop
time.sleep(1) # work
queueS2.put("update lists")
EDIT: just found that you can use queue.get(False) to avoid blockage when receiving data. This way there's no need to check first if the queue is empty. This is no possible if you use pipes.

Use text files or environnement variables. Since the two run separatly, you can't really do what you are trying to do.

In your example, the first script runs to completion, and then the second script runs. That means you need some sort of persistent state. Other answers have suggested using text files or Python's pickle module. Personally I am lazy, and I wouldn't use a text file when I could use pickle; why should I write a parser to parse my own text file format?
Instead of pickle you could also use the json module to store it as JSON. This might be preferable if you want to share the data to non-Python programs, as JSON is a simple and common standard. If your Python doesn't have json, get simplejson.
If your needs go beyond pickle or json -- say you actually want to have two Python programs executing at the same time and updating the persistent state variables in real time -- I suggest you use the SQLite database. Use an ORM to abstract the database away, and it's super easy. For SQLite and Python, I recommend Autumn ORM.

This method seems straight forward for me:
class SharedClass:
def __init__(self):
self.data = {}
def set_data(self, name, value):
self.data[name] = value
def get_data(self, name):
try:
return self.data[name]
except:
return "none"
def reset_data(self):
self.data = {}
sharedClass = SharedClass()
PS : you can set the data with a parameter name and a value for it, and to access the value you can use the get_data method, below is the example:
to set the data
example 1:
sharedClass.set_data("name","Jon Snow")
example 2:
sharedClass.set_data("email","jon#got.com")\
to get the data
sharedClass.get_data("email")\
to reset the entire state simply use
sharedClass.reset_data()
Its kind of accessing data from a json object (dict in this case)
Hope this helps....

You could use the basic from and import functions in python to import the variable into two.py. For example:
from filename import variable
That should import the variable from the file.
(Of course you should replace filename with one.py, and replace variable with the variable you want to share to two.py.)

You can also solve this problem by making the variable as global
python first.py
class Temp:
def __init__(self):
self.first = None
global var1
var1 = Temp()
var1.first = 1
print(var1.first)
python second.py
import first as One
print(One.var1.first)

Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.
from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
I get the following error three times (one for each thread in the pool), and it prints "3" through "19":
AttributeError: 'module' object has no attribute 'f'
The first three apply_async calls never return.
Meanwhile, if I try:
from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))
I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]
The multiprocessing docs have the following to say:
Functionality within this package requires that the main module be
importable by the children.
What does this mean?
To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?

Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.
Understanding the docs
Functionality within this package requires that the main module be importable by the children.
What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:
Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.
Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:
from multiprocessing.pool import ThreadPool
It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.
Pools must be initialized after the definitions of functions to be run on them
The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.
So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:
from multiprocessing import Pool
def f(x): return x # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
Or
from multiprocessing import Pool
def f(x): print(x) # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))
By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.
Using pools in modules
But wait, there's more (to using pools in modules that you want to import elsewhere)!
If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:
from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))
Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:
### module.py ###
from multiprocessing import Pool
POOL = Pool(5)
### module2.py ###
def f(x):
# Some function
from module import POOL
POOL.map(f, range(10))
And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))
### module2.py ###
import module
This, however, does work, as long as nothing imports module2:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))
### module2.py ###
import module
module.run_pool()
Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.

The function you want to execute on a thread pool must be already defined when you create the pool.
This should work:
from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
p = Pool(3)
p.map(f, range(20))
The reason is that (at least on Unix-based systems, which have fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.
On Windows it's a bit different, as Windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on Windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will re-execute the code and therefore spawn new processes infinitely, crashing the program (or the system).

There is another possible source for this error. I got this error when running the example code.
The source was that despite having installed multiprosessing correctly, the C++ compiler was not installed on my system, something pip informed me of when trying to update multiprocessing. So It might be worth checking that the compiler is installed.

How can I run a list of async processes when some items are dependent on each other?

I have a list of dynamically generated processes (command line libraries with arguments) which I need to run.
I know that some of them are dependent on each other. I already have some objects which contain this information. For example, standalone_exec_item contains process_data.process_id and also dependent_on_process_ids (which is a list of process ids.)
Currently I am thinking of using the multiprocessing library to run the list of processes asynchronously sort of like this:
from multiprocessing import Process
import subprocess
def execute_standalone_exec_items(standalone_exec_items):
standalones = []
def run_command(command):
output = subprocess.check_output(shlex.split(command))
return output
for standalone_exec_item in standalone_exec_items:
standalone_command = generate_command(standalone_exec_item.process_data)
standalone = Process(
target=run_command,
args=(standalone_command,)
)
standalones.append(standalone)
for standalone in standalones:
standalone.start()
while True:
flag = True
for standalone in standalones:
if standalone.is_alive():
flag = False
if flag:
break
However I want to know if there's a nicer way of waiting for the asynchronous processes to run before running the dependent processes. Can I use callbacks? I've heard about Twisted's deferred, can I use this?
What's the best practice?
Edit:
Is it correct that Popen is non-blocking and I don't need to use multiprocessing? Or do I need to use fcntl()?

I would use a message queue, where some process publishing message(s) which the to be called process will subscribe.

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?

Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...

The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run methods in parallel - python

Try putting the parsing methods in a separate module.

Related

How do I run two looping functions parallel to each other? [duplicate]

How to share variables in multiprocessing [duplicate]

Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

How can I run a list of async processes when some items are dependent on each other?

python -> multiprocessing module

Categories

Resources