First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.
Here is my problem:
I want to implement Random Forest Algorithm, and a naive way to do so would be like this:
def random_tree(Data):
tree = calculation(Data)
forest.append(tree)
forest = list()
for i in range(300):
random_tree(Data)
And theforest with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?
Update:
I just tried Mukund M K's method, in a very simplified script:
from multiprocessing import Pool
def f(x):
return 2*x
data = np.array([1,2,5])
pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4)))
# I use range() instead of xrange() because I am using Python 3.4
And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:
> Process SpawnPoolWorker-1:
> Process SpawnPoolWorker-2:
> Traceback (most recent call last):
> Process SpawnPoolWorker-3:
> Traceback (most recent call last):
> Process SpawnPoolWorker-4:
> Traceback (most recent call last):
> Traceback (most recent call last):
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
return ForkingPickler.loads(res)
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
return ForkingPickler.loads(res)
> AttributeError: Can't get attribute 'f' on
> AttributeError: Can't get attribute 'f' on
File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
return ForkingPickler.loads(res)
> AttributeError: Can't get attribute 'f' on
File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
return ForkingPickler.loads(res)
> AttributeError: Can't get attribute 'f' on
Update: I edited my sample code according to some other example code like this:
from multiprocessing import Pool
import numpy as np
def f(x):
return 2*x
if __name__ == '__main__':
data = np.array([1,2,3])
with Pool(5) as p:
result = p.map(f, (data for i in range(300)))
And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?
You can do it with multiprocessing this way:
from multiprocessing import Pool
def random_tree(Data):
return calculation(Data)
pool = Pool(processes=4)
forest = pool.map(random_tree, (Data for i in range(300)))
Package processing might help you. Check it out here.
Related
I am trying to run a simple command that guesses gender by name using multiprocessing. This code worked on a previous machine so perhaps my setup had something to do with it.
Below is my multiprocessing code:
import sys
import gender_guesser.detector as gender
import multiprocessing
import time
d = gender.Detector()
def guess_gender (name):
n = name.title() # make first letter upper case and the rest lower case
g = d.get_gender(n) # guess gender
return g
ls = ['john','joe','amamda','derick','peter','ashley','john','joe','amamda','derick','peter','ashley']
t=time.time()
results=[]
def callBack(x):
results.append(x)
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()-1, maxtasksperchild=1)
for n in ls:
print (n)
pool.apply_async(guess_gender,args=[n],callback=callBack)
pool.close()
pool.join()
results = pd.concat(results)
print(time.time()-t)
It simply runs and doesn't do anything. In my cmd window, I see the following at the end of an error message:
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Am running python version 3.6.1 on Anaconda:
import sys
print(sys.version)
3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Update: Still cannot get it to work. Below is the entire cmd log when I ran the code provided. I appreciate any thoughts you may have!
C:\Users\ywu\Google Drive>jupyter notebook
[I 10:13:43.954 NotebookApp] Serving notebooks from local directory: C:\Users\ywu\Google Drive
[I 10:13:43.954 NotebookApp] 0 active kernels
[I 10:13:43.955 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=255a5c0c9af337a1c2187feb63f1c426fb903e5929a0b2f0
[I 10:13:43.956 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 10:13:43.959 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=255a5c0c9af337a1c2187feb63f1c426fb903e5929a0b2f0
[I 10:13:44.264 NotebookApp] Accepting one-time-token-authenticated connection from ::1
[W 10:13:44.319 NotebookApp] 404 GET /api/kernels/aceb78ee-73e4-4481-9993-63e5ee8f72cb/channels?session_id=AEA3C6B2B0A440FC84FF3BAF5F5CB615 (127.0.0.1): Kernel does not exist: aceb78ee-73e4-4481-9993-63e5ee8f72cb
[W 10:13:44.328 NotebookApp] 404 GET /api/kernels/aceb78ee-73e4-4481-9993-63e5ee8f72cb/channels?session_id=AEA3C6B2B0A440FC84FF3BAF5F5CB615 (127.0.0.1) 20.07ms referer=None
[I 10:13:54.740 NotebookApp] Creating new notebook in /code/python
[I 10:13:55.241 NotebookApp] Kernel started: 45ab2da6-7466-408c-aa5a-98f7db54e711
[W 10:14:00.341 NotebookApp] Replacing stale connection: aceb78ee-73e4-4481-9993-63e5ee8f72cb:AEA3C6B2B0A440FC84FF3BAF5F5CB615
Process SpawnPoolWorker-2:
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
Process SpawnPoolWorker-1:
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
Process SpawnPoolWorker-4:
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
Process SpawnPoolWorker-3:
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
Process SpawnPoolWorker-5:
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
Process SpawnPoolWorker-6:
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
Traceback (most recent call last):
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Process SpawnPoolWorker-7:
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Process SpawnPoolWorker-8:
Process SpawnPoolWorker-9:
Process SpawnPoolWorker-10:
Traceback (most recent call last):
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
Traceback (most recent call last):
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
Process SpawnPoolWorker-11:
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
Process SpawnPoolWorker-14:
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
Traceback (most recent call last):
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
Traceback (most recent call last):
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Users\ywu\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\queues.py", line 345, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'guess_gender' on <module '__main__' (built-in)>
[W 10:14:15.043 NotebookApp] 404 GET /api/kernels/c1224db6-69c6-470e-b74c-4c7b94fb48fe/channels?session_id=D8DC8A440B044EED8EBCA374EBEAF7C6 (127.0.0.1): Kernel does not exist: c1224db6-69c6-470e-b74c-4c7b94fb48fe
[W 10:14:15.046 NotebookApp] 404 GET /api/kernels/c1224db6-69c6-470e-b74c-4c7b94fb48fe/channels?session_id=D8DC8A440B044EED8EBCA374EBEAF7C6 (127.0.0.1) 7.48ms referer=None
I got multiprocessing to work from within a Jupyter notebook on Windows by saving my function in a separate .py file and including that file in my notebook.
Example:
f.py:
def f(name, output):
output.put('hello {0}'.format(name))
return
Code in Jupyter notebook:
from multiprocessing import Process, Queue
#Having the function definition here results in
#AttributeError: Can't get attribute 'f' on <module '__main__' (built-in)>
#The solution seems to be importing the function from a separate file.
import f
#Also, the original version of f only had a print statement in it.
#That doesn't work with Process - in the sense that it prints to the console
#instead of the notebook.
#The trick is to let f write the string to print into an output-queue.
#When Process is done, the result is retrieved from the queue and printed.
if __name__ == '__main__':
# Define an output queue
output=Queue()
# Setup a list of processes that we want to run
p = Process(target=f.f, args=('Bob',output))
# Run process
p.start()
# Exit the completed process
p.join()
# Get process results from the output queue
result = output.get(p)
print(result)
I'm a Python newby and I may have missed all sorts of details, but this works for me.
After much research it appears that multiprocessing is not an option to use in a notebook on windows. I am closing but please open if you have a solution. I will switch over to pathos.
This problem would be headache for people using Jupyter on windows. The code would run fine on Linux system.
In order to run the code on windows,
Put the function definition in a separate ipynb file.
import the file using from ipynb.fs.full.functions import func ( make sure you pip install ipynb first)
This would definitely solve this.
How about this:
Code:
#!/usr/bin/env python3
import sys
import time
import gender_guesser.detector as gender
import pandas as pd
import multiprocessing as mp
d = gender.Detector()
def guess_gender(name):
n = name.title()
g = d.get_gender(n)
return g
def run():
ls = ['john','joe','amamda','derick','peter','ashley','john',\
'joe','amamda','derick','peter','ashley']
num_cpus = mp.cpu_count() - 1
pool = mp.Pool(processes=num_cpus)
result = pool.map(guess_gender, ls)
df = pd.DataFrame(result, columns=["gender"])
print("\ntook {} secs to classify\n".format(str(time.time() - st)))
print(df) # or you could save the dataframe using .to_csv()
st = time.time()
if __name__ == "__main__":
run()
Output:
took 0.0150408744812 secs to classify
gender
0 male
1 male
2 unknown
3 male
4 male
5 mostly_female
6 male
7 male
8 unknown
9 male
10 male
11 mostly_female
I'm using the ProcessPoolExecuter from concurrent.future to distribute a task across a number of processes.
The processes return results which I collect into a list in the main process. However, I get an InvalidStateError (and a BrokenProcessPool error) when iterating over these results, and don't understand how to avoid this.
Here's the relevant code:
from concurrent.futures import ProcessPoolExecutor as Pool # requires python 3.8
# ...
with Pool() as pool:
result = pool.map(self.run_sample, dataset)
# This is the line that seems to cause the error:
for i, sample in enumerate(result):
# ...
# ...
def run_sample(self, sample:DataSample ):
# Function run in seperate Processes
# Do something with sample
# ...
return sample
When I iterate over that list of results, I sometimes (i.e. every ~30 000 samples or so) get the following error. Note that the error seems to be caused by the iteration in for i, sample in enumerate(result):
Exception in thread QueueManagerThread:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.8/concurrent/futures/process.py", line 394, in _queue_management_worker
work_item.future.set_exception(bpe)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 547, in set_exception
raise InvalidStateError('{}: {!r}'.format(self._state, self))
concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x7f40f279b250 state=cancelled>
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Traceback (most recent call last):
File "src/run_nonrigid_displacement.py", line 78, in <module>
pipeline.run( dataset )
File "/home/me/Projects/Deformation/nonrigid-data-generation-pipeline2/src/core/pipeline.py", line 170, in run
for i, sample in enumerate(result):
File "/usr/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
How should I (safely and cleanly) aggregate and process results from the ProcessPoolExecuter?
Using python3.8 and pip list | grep future returns "0.18.2"
I've started using Python and Netmiko to connect to Cisco devices and scrape CLI commands into data. This is working great, but I need to bring down the time scripts take to complete when there is a large number of devices.
I can get multiprocessing to work as long as I just print out the output in the function I'm using. But when I try to use a shared list with multiprocessing, it fails.
For this example, I am generating a list of dictionaries called "device_list" in what I call "devices.py":
device_list = []
ip = 211
for x in range(5):
device_list.append({'host': '192.168.0.' + str(ip),
'username': 'myusername',
'password': 'mypassword',
'device_type': 'cisco_ios'})
ip = ip + 1
This list of dictionaries will be used by the Netmiko module to connect to the devices in my test lab.
Here is the main program:
import multiprocessing as mp
from time import time
from netmiko import Netmiko
from devices import device_list
def connect_to_dev(output, **device):
net_connect = Netmiko(**device)
show_ver = net_connect.send_command('show version', use_textfsm = True)
net_connect.disconnect()
hostname = show_ver[0]['hostname']
image = show_ver[0]['running_image'][28:62]
result = '{}\nHostname: {}, version {}'.format(line, hostname, image)
output.append(result)
if __name__ == '__main__':
line = '*'*70
start = time()
mgr = mp.Manager()
output = mgr.list()
processes = []
for device in device_list:
p = mp.Process(target=connect_to_dev, args=output, kwargs=device)
processes.append(p)
p.start()
for p in processes:
p.join()
end = time()
runtime = round((end - start), 2)
print(output)
print('{}\nTime elapsed: {} seconds\n{}'.format(line, runtime, line))
I am trying to get each "result" generated by the "connect_to_dev" function appended to the "output" list. I'm getting this error:
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: connect_to_dev() missing 1 required positional argument: 'output'
Process Process-3:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
Process Process-5:
TypeError: connect_to_dev() missing 1 required positional argument: 'output'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: connect_to_dev() missing 1 required positional argument: 'output'
Process Process-6:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: connect_to_dev() missing 1 required positional argument: 'output'
Process Process-4:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: connect_to_dev() missing 1 required positional argument: 'output'
[]
**********************************************************************
Time elapsed: 0.05 seconds
**********************************************************************
I have a list of 80,000 strings that I am running through a discourse parser, and in order to increase the speed of this process I have been trying to use the python multiprocessing package.
The parser code requires python 2.7 and I am currently running it on a 2-core Ubuntu machine using a subset of the strings. For short lists, i.e. 20, the process runs without an issue on both cores, however if I run a list of about 100 strings, both workers will freeze at different points (so in some cases worker 1 won't stop until a few minutes after worker 2). This happens before all the strings are finished and anything is returned. Each time the cores stop at the same point given the same mapping function is used, but these points are different if I try a different mapping function, i.e. map vs map_async vs imap.
I have tried removing the strings at those indices, which did not have any affect and those strings run fine in a shorter list. Based on print statements I included, when the process appears to freeze the current iteration seems to finish for the current string and it just does not move on to the next string. It takes about an hour of run time to reach the spot where both workers have frozen and I have not been able to reproduce the issue in less time. The code involving the multiprocessing commands is:
def main(initial_file, chunksize = 2):
entered_file = pd.read_csv(initial_file)
entered_file = entered_file.ix[:, 0].tolist()
pool = multiprocessing.Pool()
result = pool.map_async(discourse_process, entered_file, chunksize = chunksize)
pool.close()
pool.join()
with open("final_results.csv", 'w') as file:
writer = csv.writer(file)
for listitem in result.get():
writer.writerow([listitem[0], listitem[1]])
if __name__ == '__main__':
main(sys.argv[1])
When I stop the process with Ctrl-C (which does not always work), the error message I receive is:
^CTraceback (most recent call last):
File "Combined_Script.py", line 94, in <module>
main(sys.argv[1])
File "Combined_Script.py", line 85, in main
pool.join()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 474, in join
p.join()
File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
res = self._popen.wait(timeout)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
return self.poll(0)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Process PoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 117, in worker
put((job, i, result))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
^CProcess PoolWorker-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 117, in worker
put((job, i, result))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 392, in put
return send(obj)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/lib/python2.7/multiprocessing/util.py", line 305, in _exit_function
_run_finalizers(0)
File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
finalizer()
File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 500, in _terminate_pool
outqueue.put(None) # sentinel
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/lib/python2.7/multiprocessing/util.py", line 305, in _exit_function
_run_finalizers(0)
File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
finalizer()
File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 500, in _terminate_pool
outqueue.put(None) # sentinel
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
When I look at the memory in another command window using htop, memory is at <3% once the workers freeze. This is my first attempt at parallel processing and I am not sure what else I might be missing?
I was not able to solve the issue with multiprocessing pool, however I came across the loky package and was able to use it to run my code with the following lines:
executor = loky.get_reusable_executor(timeout = 200, kill_workers = True)
results = executor.map(discourse_process, entered_file)
You could define a time to your process to return a result, otherwise it would raise an error:
try:
result.get(timeout = 1)
except multiprocessing.TimeoutError:
print("Error while retrieving the result")
Also you could verify if your process is succesful with
import time
while True:
try:
result.succesful()
except Exception:
print("Result is not yet succesful")
time.sleep(1)
Finally, checking out https://docs.python.org/2/library/multiprocessing.html ,is helpful.
I have the following codes:
from multiprocessing import Process, Manager, Event
manager = Manager()
shared_Queue = manager.Queue(10)
ev = Event()
def do_this(shared_queue, ev):
while not ev.is_set():
if not shared_Queue.__getattribute__('empty')():
item = shared_queue.get()
print item
print 'released!'
subprocs = []
for i in xrange(10):
subproc = Process(target=do_this, args=(shared_Queue, ev, ))
subprocs.append(subproc)
subproc.start()
now, if I run this, and I ask whether these processes are alive:
for subproc in subprocs: print subproc.is_alive()
of course I get all Trues.
After couple of doing these: * there is no error if I don't do these!
shared_Queue.put(3)
shared_Queue.put(5)
Now I want to set the Event to kill all of them using:
ev.set()
But then instead of seeing 'released!' 10 times, I get varying number of these prints, and after about 2 to 5 seconds, I get a bunch of errors:
released!
released!
released!
released!
released!
released!
released!
Process Process-10:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 258, in _bootstrap
self.run()
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "<input>", line 10, in do_this
File "<string>", line 2, in get
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError
Process Process-5:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 258, in _bootstrap
self.run()
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "<input>", line 10, in do_this
File "<string>", line 2, in get
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError
Process Process-7:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 258, in _bootstrap
self.run()
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "<input>", line 10, in do_this
File "<string>", line 2, in get
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError
Why is it that some processes are unable to recognize the Event set and show up as errors later? Is there a better way to signal them to die?
Thanks for the comment stovfl, you are right, ev.set() does not kill anything I was carelessly using the word.
As for the issue I was having, I learned that multiprocessing Queue is process and thread safe, meaning, my process will halt just before writing something into the Queue if the Queue is already full.
If I try to set the event while some of the processes are still waiting to write something to the full Queue, they will not recognize the event set.
The key was to empty all the Queue, let the process write to it, and the get to the first line where it can check on the event!