Python execute a function in parallel in loop

Python execute a function in parallel in loop - python

I tried to improve the execution time of a script which import datas from CSV into Graphite/Go-Carbon DB time series.
this is the loop which parse all zipfiles and read them in function (execute_run) :
It tried this code but i got an error:
for idx4, Lst_f in enumerate(full_csvfile_paths):
if lst_metrics in Lst_f:
zip_file = Lst_f
with zipfile.ZipFile(zip_file) as zipobj:
print("Using ZipFile:",zipobj.filename)
#execute_run(zipobj.filename, confcsv_path, storage_type, serial)
output = subprocess.run(execute_run(zipobj.filename, confcsv_path, storage_type, serial),stdout=subprocess.PIPE)
print ("Return code: %i" % output.returncode)
print ("Output data: %s" % output.stdout)
Error:
Traceback (most recent call last):
File "./02-pickle-client.py", line 451, in <module>
main()
File "./02-pickle-client.py", line 361, in main
output = subprocess.run(execute_run(zipobj.filename, confcsv_path, storage_type, serial),stdout=subprocess.PIPE)
File "/usr/lib64/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1240, in _execute_child
args = list(args)
TypeError: 'NoneType' object is not iterable
Is there a way to execute X times the function :"execute_run" and control the correct running.
Many thanks for help.

The problem could be that the parallel processes is not set up to handle iterables correctly. Instead of subprocess.run, I would recommend using
multiprocessing.pool or multiprocessing.starmap as specified in these docs.
This could look something like this:
import multiprocessing as mp
# Step 1: Use multiprocessing.Pool() and specify number of cores to use (here I use 4).
pool = mp.Pool(4)
# Step 2: Use pool.starmap which takes a multiple iterable arguments
results = pool.starmap(My_Function, [(variable1,variable2,variable3) for i in data])
# Step 3: Don't forget to close
pool.close()

Related

Why does multiprocessing not working when opening a file?

As I am trying out the multiprocessing pool module, I noticed that it does not work when I am loading / opening any kind of file. The code below works as expected. When I uncomment lines 8-9, the script skips the pool.apply_async method, and loopingTest never runs.
import time
from multiprocessing import Pool
class MultiClass:
def __init__(self):
file = 'test.txt'
# with open(file, 'r') as f: # This is the culprit
# self.d = f
self.n = 50000000
self.cases = ['1st time', '2nd time']
self.multiProc(self.cases)
print("It's done")
def loopingTest(self, cases):
print(f"looping start for {cases}")
n = self.n
while n > 0:
n -= 1
print(f"looping done for {cases}")
def multiProc(self, cases):
test = False
pool = Pool(processes=2)
if not test:
for i in cases:
pool.apply_async(self.loopingTest, (i,))
pool.close()
pool.join()
if __name__ == '__main__':
start = time.time()
w = MultiClass()
end = time.time()
print(f'Script finished in {end - start} seconds')

You see this behavior because calling apply_async fails when you save the file descriptor (self.d) to your instance. When you call apply_async(self.loopingTest, ...), Python needs to pickle self.loopingTest to send it to the worker process, which also requires pickling self. When you have the open file descriptor saved as a property of self, the pickling fails, because file descriptors can't be pickled. You'll see this for yourself if you use apply instead of apply_async in your sample code. You'll get an error like this:
Traceback (most recent call last):
File "a.py", line 36, in <module>
w = MultiClass()
File "a.py", line 12, in __init__
self.multiProc(self.cases)
File "a.py", line 28, in multiProc
out.get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
You need to change your code either avoiding saving the file descriptor to self, only create it in the worker method (if that's where you need to use it), or by using the tools Python provides to control the pickle/unpickle process for your class. Depending on the use-case, you can also turn the method you're passing to apply_async into a top-level function, so that self doesn't need to be pickled at all.

Error pickling a `matlab` object in joblib `Parallel` context

I'm running some Matlab code in parallel from inside a Python context (I know, but that's what's going on), and I'm hitting an import error involving matlab.double. The same code works fine in a multiprocessing.Pool, so I am having trouble figuring out what the problem is. Here's a minimal reproducing test case.
import matlab
from multiprocessing import Pool
from joblib import Parallel, delayed
# A global object that I would like to be available in the parallel subroutine
x = matlab.double([[0.0]])
def f(i):
print(i, x)
with Pool(4) as p:
p.map(f, range(10))
# This prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
for _ in Parallel(4, backend='multiprocessing')(delayed(f)(i) for i in range(10)):
pass
# This also prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
# Now run with default `backend='loky'`
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
pass
# ^ this crashes.
So, the only problematic one is the one using the 'loky' backend.
The full traceback is:
exception calling callback for <Future at 0x7f63b5a57358 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 20, in <module>
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
self.retrieve()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Looking at the traceback, it seems like the root cause is an issue importing the matlab package in the child process.
It's probably worth noting that this all runs just fine if instead I had defined x = np.array([[0.0]]) (after importing numpy as np). And of course the main process has no problem with any matlab imports, so I am not sure why the child process would.
I'm not sure if this error has anything in particular to do with the matlab package, or if it's something to do with global variables and cloudpickle or loky. In my application it would help to stick with loky, so I'd appreciate any insight!
I should also note that I'm using the official Matlab engine for Python: https://www.mathworks.com/help/matlab/matlab-engine-for-python.html. I suppose that might make it hard for others to try out the test cases, so I wish I could reproduce this error with a type other than matlab.double, but I haven't found another yet.
Digging around more, I've noticed that the process of importing the matlab package is more circular than I would expect, and I'm speculating that this could be part of the problem? The issue is that when import matlab is run by loky's _ForkingPickler, first some file matlab/mlarray.py is imported, which imports some other files, one of which contains import matlab, and this causes matlab/__init__.py to be run, which internally has from mlarray import double, single, uint8, ... which is the line that causes the crash.
Could this circularity be the issue? If so, why can I import this module in the main process but not in the loky backend?

The error is caused by incorrect loading order of global objects in the child processes. It can be seen clearly in the traceback
_ForkingPickler.loads(res) -> ... -> import matlab -> from mlarray import ...
that matlab is not yet imported when the global variable x is loaded by cloudpickle.
joblib with loky seems to treat modules as normal global objects and send them dynamically to the child processes. joblib doesn't record the order in which those objects/modules were defined. Therefore they are loaded (initialized) in a random order in the child processes.
A simple workaround is to manually pickle the matlab object and load it after importing matlab inside your function.
import matlab
import pickle
px = pickle.dumps(matlab.double([[0.0]]))
def f(i):
import matlab
x=pickle.loads(px)
print(i, x)
Of course you can also use the joblib.dumps and loads to serialize the objects.
Use initializer
Thanks to the suggestion of #Aaron, you can also use an initializer (for loky) to import Matlab before loading x.
Currently there's no simple API to specify initializer. So I wrote a simple function:
def with_initializer(self, f_init):
# Overwrite initializer hook in the Loky ProcessPoolExecutor
# https://github.com/tomMoral/loky/blob/f4739e123acb711781e46581d5ed31ed8201c7a9/loky/process_executor.py#L850
hasattr(self._backend, '_workers') or self.__enter__()
origin_init = self._backend._workers._initializer
def new_init():
origin_init()
f_init()
self._backend._workers._initializer = new_init if callable(origin_init) else f_init
return self
It is a little bit hacky but works well with the current version of joblib and loky.
Then you can use it like:
import matlab
from joblib import Parallel, delayed
x = matlab.double([[0.0]])
def f(i):
print(i, x)
def _init_matlab():
import matlab
with Parallel(4) as p:
for _ in with_initializer(p, _init_matlab)(delayed(f)(i) for i in range(10)):
pass
I hope the developers of joblib will add initializer argument to the constructor of Parallel in the future.

Error using subprocess.check_output in Python

I am at a very basic level in Python, and I'm trying to learn how to use the subprocess module. I have a simple calculator program called x.py that takes in a number, multiplies it by 2, and returns the result. I am trying to execute that simple program from IDLE with the following two lines of code, but I get errors. The number 5 is the number I'm trying to feed into x.py to get a result. Would someone mind helping me understand what I'm doing wrong and help me get it right? Thanks!
import subprocess
result = subprocess.check_output(["C:\\Users\\Kyle\\Desktop\\x.py",5])
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
result = subprocess.check_output(["C:\\Users\\Kyle\\Desktop\\x.py",5])
File "C:\Python27\lib\subprocess.py", line 537, in check_output
process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "C:\Python27\lib\subprocess.py", line 679, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 855, in _execute_child
args = list2cmdline(args)
File "C:\Python27\lib\subprocess.py", line 587, in list2cmdline
needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

Pass 5 as a string:
import sys
result = subprocess.check_output([sys.executable, "C:\\Users\\Kyle\\Desktop\\x.py", '5'])

Python Pool Map() Gives ether Pickle Error Or Does Not Iterate List Properly

I have a function that takes a list of urls and adds a header to each url. The url_list can be about 25,000 long lists. So, I want to use multiprocessing. I have tried 2 ways both with failure:
First way- the url_list is not passing correctly...the function only gets the first letter 'h' of the url_list url:
headers = {}
header_token = {}
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(do_it, url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "head.py", line 95, in <module>
pool.map(do_it, url_list)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
requests.exceptions.MissingSchema: Invalid URL u'h': No schema supplied
The second way...the way I prefer since I don't have to make headers dictionary global. But I get a pickle error:
def wrapper(headers):
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
return do_it
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(wrapper(headers), url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

If you are looking to use your second implementation, then I think you should be able to use dill to serialize your wrapper function. Dill can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails. Dill has the same interface as python's pickle, but also provides some additional methods. If you want to use dill for serialization with multiprocessing, all you have to do is:
>>> import dill
>>> # your code goes here (as above)
And, if that doesn't work for some reason, you could swap out multiprocessing with pathos... which is built to do multiprocessing using dill -- and provides a multi-*args map function (exactly like the standard python map).

You need to use a Queue from the multiprocessing package. The datatype that your pulling from or adding to needs to be thread and process safe; a Queue is both.
http://docs.python.org/2/library/queue.html
http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

Python multiprocessing pool.map raises IndexError

I've developed a utility using python/cython that sorts CSV files and generates stats for a client, but invoking pool.map seems to raise an exception before my mapped function has a chance to execute. Sorting a small number of files seems to function as expected, but as the number of files grows to say 10, I get the below IndexError after calling pool.map. Does anyone happen to recognize the below error? Any help is greatly appreciated.
While the code is under NDA, the use-case is fairly simple:
Code Sample:
def sort_files(csv_files):
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
sorted_dicts = pool.map(sort_file, csv_files, 1)
return sorted_dicts
def sort_file(csv_file):
print 'sorting %s...' % csv_file
# sort code
Output:
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range

The IndexError is an error you get somewhere in sort_file(), i.e. in a subprocess. It is re-raised by the parent process. Apparently multiprocessing doesn't make any attempt to inform us about where the error really comes from (e.g. on which lines it occurred) or even just what argument to sort_file() caused it. I hate multiprocessing even more :-(

Check further up in the command output.
In Python 3.4 at least, multiprocessing.pool will helpfully print a RemoteTraceback above the parent process traceback. You'll see something like:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/path/to/your/code/here.py", line 80, in sort_file
something = row[index]
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range
In the case above, the code raising the error is at /path/to/your/code/here.py", line 80
see also debugging errors in python multiprocessing

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python execute a function in parallel in loop - python

Related

Why does multiprocessing not working when opening a file?

Error pickling a `matlab` object in joblib `Parallel` context

Error using subprocess.check_output in Python

Python Pool Map() Gives ether Pickle Error Or Does Not Iterate List Properly

Python multiprocessing pool.map raises IndexError

Categories

Resources