I'm trying to parallelize loop with pair of numbers and some other fixed data. That is, I'm iterating pair of numbers like (1,2), (6,4), (4,3)... with other fixed dataset.
def myfunc(list of iterates, list of various datasets)
doing stuff
if __name__ == '__main__':
read_data(a lot)
iterates = [[1,2], [3,8], [7,9], [12,5]] # means nothing
pool = multiprocessing.Pool(processes=4)
partial_myfunc = partial(myfunc, list of various datasets=some_data)
results = pool.map(partial_myfunc, iterates)
myfunc is doing stuff with two inputs, list of iterates, list of various datasets.
'list of iterates' consists of like this [[1,2],[3,4],[5,6]...],
'list of various datasets' contains various types of dataset like [dates, float, dataframe,... etc] but it's fixed.
And when I run the main program mentioned above, python says
Traceback (most recent call last):
File "path/main.py", line 117, in <module>
results = pool.map(partial_myfunc, iterates)
File "lib\multiprocessing\pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "lib\multiprocessing\pool.py", line 644, in get
raise self._value
File "lib\multiprocessing\pool.py", line 424, in _handle_tasks
put(task)
File "lib\multiprocessing\connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle module objects
Is there anybody can tell what's going on? I really need some help guys.
could you adjust your code for the threading to be the following please:
...
max_threads = multiprocessing.cpu_count()
...
#create pool with max_threads - feel free to do max_threads-1 so your device is still usable
p = multiprocessing.Pool(max_threads)
results = p.map(partial_myfunc, iterates)
also are you using python 2 or 3? The more information you proivde the easier it is to figure out :)
Let me know if it makes any difference!
Related
I am trying to run a simple multiprocessing example in python3.6 in a zeppelin notebook(in windows) but I am not able to execute it. Below is the code that i used:
def sqrt(x):
return x**0.5
numbers = [i for i in range(1000000)]
with Pool() as pool:
sqrt_ls = pool.map(sqrt, numbers)
After running this code I am getting the following error:
Traceback (most recent call last):
File "/tmp/zeppelin_python-3196160128578820301.py", line 315, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 6, in <module>
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sqrt at 0x7f6f84f1a620>: attribute lookup sqrt on __main__ failed
I am not sure if its just me who is facing the issue. As i have seen so many articles where people can run the code easily. If you know the solution please help
Thanks
From the multiprocessing documentation:
Note: Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.
Notebooks are running Python interactive interpreters behind the scene, so it's probably why you get this error. You can try to run your code from within a if __name__ == '__main__': statement.
A Zeppelin notebook does not emulate a normal module well enough to support the pickling that is used to identify the correct operation to another process. You can put all the functions you want to call into a proper module that you import in the usual fashion.
I am trying to map a function call to a pool. I am using the fixes from other similar threads involving calling functions with multiple arguments. I cannot include a more comprehensive example as the train methods for the strategy classes are very long and complicated and the data sets are large:
import multiprocessing as mp
from functools import partial
from numpy import array_split
def call_train(signals, args):
return args[0].train(signals, args[1])
pool = mp.Pool()
chunks = array_split(data.train_signals, pool._processes)
res = pool.map(partial(call_train, [strat, data.train_md]), chunks)
In the above, strat is a python object and both data.train_signals and data.train_md are pandas dataframes.
The error is as follows:
File "/home/jason/PycharmProjects/backtester/core/backtester.py", line 51, in evaluate
res = pool.map(partial(call_train, [strat, data.train_md]), chunks)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
put(task)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle generator objects
Update:
Nested 3 objects deep under my strat class I had a generator I had forgotten about.
Set use_multiprocessing=False in fit_generator, this somehow allows to walk-around the problem.
I wrote a script on a linux platform using the multiprocessing module of python. When I tried running the program on Windows this was not working directly which I found out is related to the fact how child-processes are generated on Windows. It seems to be crucial that the objects which are used can be pickled.
My main problem is, that I am using large numpy arrays. It seems that with a certain size they are not pickable any more. To break it down to a simple script, I want to do something like that:
### Import modules
import numpy as np
import multiprocessing as mp
number_of_processes = 4
if __name__ == '__main__':
def reverse_np_array(arr):
arr = arr + 1
return arr
a = np.ndarray((200,1024,1280),dtype=np.uint16)
def put_into_queue(_Queue,arr):
_Queue.put(reverse_np_array(arr))
Queue_list = []
Process_list = []
list_of_arrays = []
for i in range(number_of_processes):
Queue_list.append(mp.Queue())
for i in range(number_of_processes):
Process_list.append(mp.Process(target=put_into_queue, args=(Queue_list[i],a)))
for i in range(number_of_processes):
Process_list[i].start()
for i in range(number_of_processes):
list_of_arrays.append(Queue_list[i].get())
for i in range(number_of_processes):
Process_list[i].join()
I get the following error message:
Traceback (most recent call last):
File "Windows_multi.py", line 34, in <module>
Process_list[i].start()
File "C:\Program Files\Anaconda32\lib\multiprocessing\process.py", line 130, i
n start
self._popen = Popen(self)
File "C:\Program Files\Anaconda32\lib\multiprocessing\forking.py", line 277, i
n __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Program Files\Anaconda32\lib\multiprocessing\forking.py", line 199, i
n dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Program Files\Anaconda32\lib\pickle.py", line 224, in dump
self.save(obj)
File "C:\Program Files\Anaconda32\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Program Files\Anaconda32\lib\pickle.py", line 419, in save_reduce
save(state)
File "C:\Program Files\Anaconda32\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Program Files\Anaconda32\lib\pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
So I am basically creating a large array which I need in all processes to do calculations with this array and return it.
One important thing seems to be to write the definitions of the functions before the statement if __name__ = '__main__':
The whole thing is working if I reduce the array to (50,1024,1280).
However even if 4 processes are started and 4 cores are working, it is slower than writing the code without multiprocessing for one core only (on windows). So I think I have another problem here.
The function in my real program later on is in a cython module.
I am using the anaconda package with python 32-bit since I could not get my cython package compiled with the 64-bit version (I'll ask about that in a different thread).
Any help is welcome!!
Thanks!
Philipp
UPDATE:
First mistake I did was haveing the a "put_into_queue" function definition in the __main__.
Then I introduced shared arrays as suggested, however, uses a lot of memory and the used memory scales with the processes I use (which should of course not be the case).
Any ideas what I am doing wrong here? It seems not to be important where I place the definition of the shared array (in or outside __main__), though, I think it should be in the __main__. Got this from this post: Is shared readonly data copied to different processes for Python multiprocessing?
import numpy as np
import multiprocessing as mp
import ctypes
shared_array_base = mp.Array(ctypes.c_uint, 1280*1024*20)
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
#print shared_array
shared_array = shared_array.reshape(20,1024,1280)
number_of_processes = 4
def put_into_queue(_Queue,arr):
_Queue.put(reverse_np_array(arr))
def reverse_np_array(arr):
arr = arr + 1 + np.random.rand()
return arr
if __name__ == '__main__':
#print shared_arra
#a = np.ndarray((50,1024,1280),dtype=np.uint16)
Queue_list = []
Process_list = []
list_of_arrays = []
for i in range(number_of_processes):
Queue_list.append(mp.Queue())
for i in range(number_of_processes):
Process_list.append(mp.Process(target=put_into_queue, args=(Queue_list[i],shared_array)))
for i in range(number_of_processes):
Process_list[i].start()
for i in range(number_of_processes):
list_of_arrays.append(Queue_list[i].get())
for i in range(number_of_processes):
Process_list[i].join()
You didn't include the full traceback; the end is most important. On my 32-bit Python I get the same traceback that finally ends in
File "C:\Python27\lib\pickle.py", line 486, in save_string
self.write(BINSTRING + pack("<i", n) + obj)
MemoryError
MemoryError is the exception and it says you ran out of memory.
64-bit Python would get around this, but sending large amounts of data between processes can easily become a serious bottleneck in multiprocessing.
I have a function that takes a list of urls and adds a header to each url. The url_list can be about 25,000 long lists. So, I want to use multiprocessing. I have tried 2 ways both with failure:
First way- the url_list is not passing correctly...the function only gets the first letter 'h' of the url_list url:
headers = {}
header_token = {}
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(do_it, url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "head.py", line 95, in <module>
pool.map(do_it, url_list)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
requests.exceptions.MissingSchema: Invalid URL u'h': No schema supplied
The second way...the way I prefer since I don't have to make headers dictionary global. But I get a pickle error:
def wrapper(headers):
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
return do_it
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(wrapper(headers), url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
If you are looking to use your second implementation, then I think you should be able to use dill to serialize your wrapper function. Dill can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails. Dill has the same interface as python's pickle, but also provides some additional methods. If you want to use dill for serialization with multiprocessing, all you have to do is:
>>> import dill
>>> # your code goes here (as above)
And, if that doesn't work for some reason, you could swap out multiprocessing with pathos... which is built to do multiprocessing using dill -- and provides a multi-*args map function (exactly like the standard python map).
You need to use a Queue from the multiprocessing package. The datatype that your pulling from or adding to needs to be thread and process safe; a Queue is both.
http://docs.python.org/2/library/queue.html
http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
I've developed a utility using python/cython that sorts CSV files and generates stats for a client, but invoking pool.map seems to raise an exception before my mapped function has a chance to execute. Sorting a small number of files seems to function as expected, but as the number of files grows to say 10, I get the below IndexError after calling pool.map. Does anyone happen to recognize the below error? Any help is greatly appreciated.
While the code is under NDA, the use-case is fairly simple:
Code Sample:
def sort_files(csv_files):
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
sorted_dicts = pool.map(sort_file, csv_files, 1)
return sorted_dicts
def sort_file(csv_file):
print 'sorting %s...' % csv_file
# sort code
Output:
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range
The IndexError is an error you get somewhere in sort_file(), i.e. in a subprocess. It is re-raised by the parent process. Apparently multiprocessing doesn't make any attempt to inform us about where the error really comes from (e.g. on which lines it occurred) or even just what argument to sort_file() caused it. I hate multiprocessing even more :-(
Check further up in the command output.
In Python 3.4 at least, multiprocessing.pool will helpfully print a RemoteTraceback above the parent process traceback. You'll see something like:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/path/to/your/code/here.py", line 80, in sort_file
something = row[index]
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range
In the case above, the code raising the error is at /path/to/your/code/here.py", line 80
see also debugging errors in python multiprocessing