Apologies if the title is not descriptive and I struggle to find a good title for the question.
My question involves Python, CouchDB (to a lesser degree), multiprocessing and networking. It started out as I was trying to debug a co-worker's program using Python's multiprocessing module to parallelize requests to a CouchDB database using couchdb-python. I created a minimal program to exhibit the bug and eventually solved the issue, but the solution drew another question which I was not able to answer to my best knowledge. I'm hoping experts on SO could help me with this, so here it goes.
The premise of the problem is pretty simple. We have n resources, all of which can be retrieved concurrently. Instead of making n serial requests, my co-worker is using the multiprocessing module to fetch all n resources in parallel. Here's a program I wrote to demonstrate the issue:
The Script (bug.py)
import couchdb
import multiprocessing
server = couchdb.Server(SERVER)
try:
database = server.create('test')
except:
server.delete('test')
database = server.create('test')
database.save({'_id': '1', 'type': 'dog', 'name': 'chase'})
database.save({'_id': '2', 'type': 'dog', 'name': 'rubble'})
database.save({'_id': '3', 'type': 'cat', 'name': 'kali'})
def query_id(id):
print(dict(database[id]))
def main():
args = [
['dog', 'chase'],
['dog', 'rubble'],
['cat', 'kali'],
]
print('-' * 80)
processes = []
for id_ in ['1', '2', '3']:
proc = multiprocessing.Process(target=query_id, args=(id_))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
if __name__ == '__main__':
main()
Pretty innocent code, right? Well, running it on the latest couchdb and couchdb-python gives the following error:
The output
--------------------------------------------------------------------------------
Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
File "bug.py", line 25, in query_id
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
return Document(data)
TypeError: 'ResponseBody' object is not iterable
return Document(data)
TypeError: 'ResponseBody' object is not iterable
After some digging, I finally found out that couchdb-python's implementation of ConnectionPool is not multiprocess safe. See this PR for more details. Basically, all processes share the same ConnectionPool object, and was given the same httplib.HTTPConnection object, and when they all simultaneously try to read from the connection, the string being returned is garbled, and the bug ensued. You can see the evidence of it if you put print(os.getpid(), line) inside httplib.HTTPResponse._read_status method. Here's a sample output after the print statement is added:
(26490, 'TP1.120 O\r\n')
(26489, 'T/ 0KServer: CouchDB/1.6.1 (Erlang OTP/17)\r\n')
Process Process-2:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
self._target(*self._args, **self._kwargs)
File "bug.py", line 25, in query_id
print(dict(database[id]))
File "/home/kevin/src/couchdb-python/couchdb/client.py", line 418, in __getitem__
return Document(data)
TypeError: 'ResponseBody' object is not iterable
return Document(data)
TypeError: 'ResponseBody' object is not iterable
As seen here, the first line being read from the sub-processes are only partial, indicating a race condition here. If I further inspect the HTTPConnection object, I can see all three processes are sharing the same connection object, the socket to the server and the file descriptor from the socket that's being used for reading.
Puzzle
So far so good. I've identified the root cause of the problem and put together a fix. However, complication arises when I put the couchdb instance behind a reverse proxy. In this case, I'm using haproxy. Here's a sample config:
global
...
defaults
...
listen couchdb
bind *:9999
mode http
stats enable
option httpclose
option forwardfor
server couchdb-1 127.0.0.1:5984 check
and point the couchdb server url to http://localhost:9999 in the bug script, reran the script, and everything was fine! I also inspected the connection object, the socket and the file descriptor, and there were also shared among all processes.
This got me puzzled. I brought up mitmproxy and inspected what's going on in the two cases: with or without haproxy.
Without haproxy
When the parallel requests are made without haproxy, I observed in the mitmproxy details tab (I showed a single request, but the timing sequence is the same for all 3 concurrent requests):
The event sequence here suggests a blocking synchronous request.
With haproxy
You can see the sequence here is different from that without haproxy. Request is considered complete without the server connection being initiated.
Question
I'm not used to working at this low level so I know my knowledge in this regard is pretty lacking here. I want to understand what difference did putting haproxy in front of it brought that subverted the multiprocessing bug in couchdb-python? haproxy is event-based, so I suspect that has something to do with it, but would really appreciate someone explaining the difference!
Thanks a bunch in advance!
Related
I know that this questions seems repeated but the existing questions didn't apply to my case. I don't have any file named multiprocess.pool in my directory and I still get this error when trying to run the traffic generator.
Traceback (most recent call last):
File "run.py", line 1, in <module>
import generator
File "/home/alaa/synthetic_traffic_generator-master/synthetic_traffic_generator-master/generator.py", line 13, in <module>
from multiprocessing.pool import Pool, Process
ImportError: cannot import name 'Process' from 'multiprocessing.pool' (/usr/lib/python3.8/multiprocessing/pool.py)
This is the part of the code where it uses Process:
def generate_synthethic_users_and_traffic(number_of_users=NUMBER_OF_SYNTHETIC_USERS):
# Makes the random numbers predictable, to make the experiment reproducible.
seed(1234)
user_generator = UserDistribution(number_of_users)
# generate_and_write_synthetic_traffic is a very light method, thus it
# is not necessary to create a pool of processes and join them later on
for user in user_generator.users():
Process(target=user.generate_and_write_synthetic_traffic).start()
I believe this parts needs to be updated by I have no idea how. Any help with this issue is appreciated.
Thanks in advance.
EDIT:
I followed the first answer and now the error changed to this:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/alaa/synthetic_traffic_generator-master/synthetic_traffic_generator-master/generator.py", line 348, in generate_and_write_synthetic_traffic
self.generate_synthetic_traffic()
File "/home/alaa/synthetic_traffic_generator-master/synthetic_traffic_generator-master/generator.py", line 265, in generate_synthetic_traffic
for hour, traffic_model in self.traffic_model_per_hour.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'
EDIT 2:
I followed this question to solve the second issue and now it works.
There is no multiprocessing.pool.Process And the Github Repository you are following is 7 Years Old! Which hasn't been updated since then. And not compatible with the current version of python So, it's obvious to expect errors like this. But,
You can try replacing that import code block in generator.py line 13 which is from multiprocessing.pool import Pool, Process delete that line and add the followings:
from multiprocessing import Pool, Process
Hey guys I'm working on a prototype for a project at my school (I'm a research assistant so this isn't a graded project). I'm running celery on a server cluster (with 48 workers/cores) which is already setup and working. The nutshell of my project is that we want to use celery for some number crunching of a rather large amount of files/tasks.
Because of this it is very important that we save results to an actual file, we have gigs upon gigs of data and it WON'T fit in RAM while running the traditional task queue/backend.
Anyways...
My prototype (with a trivial add function):
task.py
from celery import Celery
app=Celery()
#app.task
def mult(x,y):
return x*y
And this works great when I execute: $ celery worker -A task -l info
But if I try and add a new backend:
from celery import Celery
app=Celery()
app.conf.update(CELERY_RESULT_BACKEND = 'file://~/Documents/results')
#app.task
def mult(x,y):
return x*y
I get a rather large error:
[2017-08-04 13:22:18,133: CRITICAL/MainProcess] Unrecoverable error:
AttributeError("'NoneType' object has no attribute 'encode'",)
Traceback (most recent call last):
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/kombu/utils/objects.py", line 42, in __get__
return obj.__dict__[self.__name__]
KeyError: 'backend'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/bartolucci/anaconda3/lib/python3.6/site- packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/bootsteps.py", line 115, in start
self.on_start()
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/apps/worker.py", line 143, in on_start
self.emit_banner()
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/apps/worker.py", line 158, in emit_banner
' \n', self.startup_info(artlines=not use_image))),
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/apps/worker.py", line 221, in startup_info
results=self.app.backend.as_uri(),
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/kombu/utils/objects.py", line 44, in __get__
value = obj.__dict__[self.__name__] = self.__get(obj)
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/app/base.py", line 1183, in backend
return self._get_backend()
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/app/base.py", line 902, in _get_backend
return backend(app=self, url=url)
File "/home/bartolucci/anaconda3/lib/python3.6/site-packages/celery/backends/filesystem.py", line 45, in __init__
self.path = path.encode(encoding)
AttributeError: 'NoneType' object has no attribute 'encode'
I am only 2 days into this project and have never worked with celery (or a similar library) before (I come from the algorithmic, mathy side of the fence). I currently wrangling with celery's user guide docs, but they're honestly pretty sparse on this detail.
Any help is much appreciated and thank you.
Looking at the celery code for filesystem backed result backend here.
https://github.com/celery/celery/blob/master/celery/backends/filesystem.py#L54
Your path needs to start with file:/// (3 slashes)
Your settings has it starting with file:// (2 slashes)
You might also want to use the absolute path instead of the ~.
I want to iterate over a list with 2 function using multiprocessing one function iterate over the main_list from leading and other from trailing, I want this function each time that iterates over the sample list (g) put the element in main list till one of them find a duplicate in list then I want the terminate both processes and return the seen elements.
I expect that the first process return :
['a', 'b', 'c', 'd', 'e', 'f']
And the second return :
['l', 'k', 'j', 'i', 'h', 'g']
this is my code that returns an Error:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.list()
# Fn definitions and such
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if i in main_path:
return l
main_path.append(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if i in main_path:
return l
main_path.append(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
g2=g[::-1]
p1 = Process(target=a, args=(d,g))
p2 = Process(target=b, args=(d,g2))
p1.start()
p2.start()
And this is the Traceback:
a
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/bluebird/Desktop/persiantext.py", line 17, in a
if i in main_path:
File "<string>", line 2, in __contains__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 755, in _callmethod
self._connect()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
b
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/bluebird/Desktop/persiantext.py", line 27, in b
if i in main_path:
File "<string>", line 2, in __contains__
File "/usr/lib/python2.7/multiprocessing/managers.py", line 755, in _callmethod
self._connect()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory
Note that i have not any idea that how terminate both processes after that one of them find a duplicated element!!
There are all kinds of other problems in your code, but since I already explained them on your other question, I won't get into them here.
The new problem is that you're not joining your child processes. In your threaded version, this wasn't an issue just because your main thread accidentally had a "block forever" before the end. But here, you don't have that, so the main process reaches the end of the script while the background processes are still running.
When this happens, it's not entirely defined what your code will do.* But basically, you're destroying the manager object, which shuts down the manager server while the background processes are still using it, so they're going to raise exceptions the next time they try to access a managed object.
The solution is to add p1.join() and p2.join() to the end of your script.
But that really only gets you back to the same situation as your threaded code (except not blocking forever at the end). You've still got code that's completely serialized, and a big race condition, and so on.
If you're curious why this happens:
At the end of the script, all of your module's globals go out of scope.** Since those variables are the only reference you have to the manager and process objects, those objects get garbage-collected, and their destructors get called.
For a manager object, the destructor shuts down the server.
For a process object, I'm not entirely sure, but I think the destructor does nothing (rather than join it and/or interrupt it). Instead, there's an atexit function, that runs after all of the destructors, that joins any still-running processes.***
So, first the manager goes away, then the main process starts waiting for the children to finish; the next time each one tries to access a managed object, it fails and exits. Once all of them do that, the main process finishes waiting and exits.
* The multiprocessing changes in 3.2 and the shutdown changes in 3.4 make things a lot cleaner, so if we weren't talking about 2.7, there would be less "here's what usually happens but not always" and "here's what happens in one particular implementation on one particular platform".
** This isn't actually guaranteed by 2.7, and garbage-collecting all of the modules' globals doesn't always happen. But in this particular simple case, I'm pretty sure it will always work this way, at least in CPython, although I don't want to try to explain why.
*** That's definitely how it works with threads, at least on CPython 2.7 on Unix… again, this isn't at all documented in 2.x, so you can only tell by reading the source or experimenting on the platforms/implementations/versions that matter to you… And I don't want to track this through the source unless there's likely to be something puzzling or interesting to find.
I've never used the multiprocessing library before, so all advice is welcome..
I've got a python program that uses the multiprocessing library to do some memory-intensive tasks in multiple processes, which occasionally runs out of memory (I'm working on optimizations, but that's not what this question is about). Sometimes, an out-of-memory error gets thrown in a way that I can't seem to catch (output below), and then the program hangs on pool.join() (I'm using multiprocessing.Pool. How can I make the program do something other than indefinitely wait when this problem occurs?
Ideally, The memory error is propagated back to the main process which then dies.
Here's the memory error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 325, in _handle_workers
pool._maintain_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 229, in _maintain_pool
self._repopulate_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
w.start()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in __init__
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
And here's where i manage multiprocessing:
mp_pool = mp.Pool(processes=num_processes)
mp_results = list()
for datum in input_data:
data_args = {
'value': 0 // actually some other simple dict key/values
}
mp_results.append(mp_pool.apply_async(_process_data, args=(common_args, data_args)))
frame_pool.close()
frame_pool.join() // hangs here when that thread dies..
for result_async in mp_results:
result = result_async.get()
// do stuff to collect results
// rest of the code
When I interrupt the hanging program, I get:
Process process_003:
Traceback (most recent call last):
File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/pool.py", line 102, in worker
task = get()
File "/opt/rh/python27/root/usr/lib64/python2.7/multiprocessing/queues.py", line 374, in get
return recv()
racquire()
KeyboardInterrupt
This is actually a known bug in python's multiprocessing module, fixed in python 3 (here's a summarizing blog post I found). There's a patch attached to python issue 22393, but that hasn't been officially applied.
Basically, if one of a multiprocess pool's sub-processes die unexpectedly (out of memory, killed externally, etc.), the pool will wait indefinitely.
In a Django view I spawn a thread (class of threading.Thread) which in turn creates a multiprocessing pool of 5 workers.
Yes, I know using a task queue like Celery is usually the accepted way of doing things, but in this case we needed threads/multiprocessing.
Both the Thread and each of the Multiprocess Workers access items in the database. However, doing any call to a Django Model in the Thread or Worker causes a "django.core.exceptions.AppRegistryNotReady: Models aren't loaded yet" exception.
Here is the full stack trace:
Process SpawnPoolWorker-2:
Traceback (most recent call last):
File "C:\Python34\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Python34\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Python34\lib\multiprocessing\pool.py", line 108, in worker
task = get()
File "C:\Python34\lib\multiprocessing\queues.py", line 357, in get
return ForkingPickler.loads(res)
File "d:\bryan\Documents\Projects\spreadmodel_3.4_venv\lib\site-packages\djang
o\db\models\fields\__init__.py", line 59, in _load_field
return apps.get_model(app_label, model_name)._meta.get_field_by_name(field_n
ame)[0]
File "d:\bryan\Documents\Projects\spreadmodel_3.4_venv\lib\site-packages\djang
o\apps\registry.py", line 199, in get_model
self.check_models_ready()
File "d:\bryan\Documents\Projects\spreadmodel_3.4_venv\lib\site-packages\djang
o\apps\registry.py", line 131, in check_models_ready
raise AppRegistryNotReady("Models aren't loaded yet.")
django.core.exceptions.AppRegistryNotReady: Models aren't loaded yet.
It is odd to me that I don't see any part of my code in the stack trace.
I've tried doing django.setup() in the Thread init and at the beginning of the method the Workers start, still with no success.
At no point in my models do I try to read anything from the database like the common issue of doing a foreign key to the user model.
EDIT:
I can get the database queries to work in the Thread by putting django.setup just under the Simulation class instead of having it in the init method.
But I'm still having issues with the queries in the Workers.
Edit2:
If I modify Python's queue.Queue file and put the django.setup() call in the get function, everything works great.
However, this is not a valid solution. Any ideas?
Edit3:
If I run the tests inside PyCharm, then the test associated with this problem works. Running the test in the normal command line outside of PyCharm (or running the view server from a server [django test server or CherryPy]) results in the above error.
If it helps, here is a link to the views.py on GitHub.
https://github.com/NAVADMC/SpreadModel/blob/b4bbbcf7020a3e4df0d021942ddcc5039234bd88/Results/views.py
For future reference (after we fix the bug), you can see the odd behaviour on commit b4bbbcf7 (linked above).