We are working with celery at the last year, with ~15 workers, each one defined with concurrency between 1-4.
Recently we upgraded our celery from v3.1 to v4.1
Now we are having the following errors in each one of the workers logs, any ideas what can cause to such error?
2017-08-21 18:33:19,780 94794 ERROR Control command error: error(104, 'Connection reset by peer') [file: pidbox.py, line: 46]
Traceback (most recent call last):
File "/srv/dy/venv/lib/python2.7/site-packages/celery/worker/pidbox.py", line 42, in on_message
self.node.handle_message(body, message)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 129, in handle_message
return self.dispatch(**body)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 112, in dispatch
ticket=ticket)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 135, in reply
serializer=self.mailbox.serializer)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 265, in _publish_reply
**opts
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/channel.py", line 1748, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 64, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 178, in write_frame
write(view[:offset])
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/transport.py", line 272, in write
self._write(s)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer
BTW: our tasks in the form:
#app.task(name='EXAMPLE_TASK'],
bind=True,
base=ConnectionHolderTask)
def example_task(self, arg1, arg2, **kwargs):
# task code
We are also having massive issues with celery... I spend 20% of my time just dancing around weird idle-hang/crash issues with our workers sigh
We had a similar case that was caused by a high concurrency combined with a high worker_prefetch_multiplier, as it turns out fetching thousands of tasks is a good way to frack the connection.
If that's not the case: try to disable the broker pool by setting broker_pool_limit to None.
Just some quick ideas that might (hopefully) help :-)
Related
In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html
I am using celery to run task one by one with redis broker, but when i run 2 task then after completing the first redis given an timeout socket error for second task so that second task would be failed.
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/result.py", line 194, in get
on_message=on_message,
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/backends/async.py", line 189, in wait_for_pending
for _ in self._wait_for_pending(result, **kwargs):
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/backends/async.py", line 256, in _wait_for_pending
on_interval=on_interval):
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/backends/async.py", line 57, in drain_events_until
yield self.wait_for(p, wait, timeout=1)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/backends/async.py", line 66, in wait_for
wait(timeout=timeout)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/celery/backends/redis.py", line 69, in drain_events
m = self._pubsub.get_message(timeout=timeout)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/client.py", line 2513, in get_message
response = self.parse_response(block=False, timeout=timeout)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/client.py", line 2430, in parse_response
return self._execute(connection, connection.read_response)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/client.py", line 2408, in _execute
return command(*args)
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/home/ubuntu/.virtualenvs/aide_venv/local/lib/python2.7/site-packages/redis/connection.py", line 187, in _read_from_socket
raise TimeoutError("Timeout reading from socket")
TimeoutError: Timeout reading from socket
I am running celery by using this command:
celery -A flask_application.celery worker --loglevel=info --max-tasks-per-child=1 --concurrency=1
I am calling celery task by using: .delay() function
celery_response = run_algo.run_pipeline.delay(request.get_json())
Getting output by using: .get() function
output_file_path = celery_response.get()
There's a warning in the docs of AsyncResult.get saying that calling it it within an async task can cause a deadlock, which may be what's happening here, though it's hard to tell without more context of where these things are being called.
When my celery service is running after 7-10 days I received this exception out of nowhere, this causes my Tasks not to be processed. A restart of celery fixes the problem.
INTERNAL ERROR: RuntimeError('Acquire on closed pool',)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 253, in trace_task
I, R, state, retval = on_error(task_request, exc, uuid)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 201, in on_error
R = I.handle_error_state(task, eager=eager)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 85, in handle_error_state
}[self.state](task, store_errors=store_errors)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 118, in handle_failure
req.id, exc, einfo.traceback, request=req,
File "/usr/lib/python2.7/dist-packages/celery/backends/base.py", line 121, in mark_as_failure
traceback=traceback, request=request)
File "/usr/lib/python2.7/dist-packages/celery/backends/amqp.py", line 124, in store_result
with self.app.amqp.producer_pool.acquire(block=True) as producer:
File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 868, in acquire
R = self.prepare(R)
File "/usr/lib/python2.7/dist-packages/kombu/pools.py", line 63, in prepare
conn = self._acquire_connection()
File "/usr/lib/python2.7/dist-packages/kombu/pools.py", line 38, in _acquire_connection
return self.connections.acquire(block=True)
File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 859, in acquire
raise RuntimeError('Acquire on closed pool')
RuntimeError: Acquire on closed pool
Software versions
software -> celery:3.1.20 (Cipater) kombu:3.0.35 py:2.7.6
billiard:3.3.0.22 py-amqp:1.4.9
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqp results:amqp
CELERY_ACCEPT_CONTENT: ['json', 'pickle', 'yaml']
CELERY_ENABLE_UTC: True
CELERY_IGNORE_RESULT: False
CELERY_IMPORTS:
('catalogue.app.voice.cluster.deploy_cluster',
'catalogue.app.common.install_uc',
'hypervisor.app.deploy_esx',
'hypervisor.app.vm_operations',
'tools.deploy_tools')
CELERYD_CHDIR: '/usr/local/src/imbue/application/app'
CELERY_TASK_RESULT_EXPIRES: 18000
CELERY_RESULT_PERSISTENT: True
CELERY_TIMEZONE: 'US/Eastern'
BROKER_URL: 'amqp://******:********#rabbitmq:5672//'
CELERY_RESULT_BACKEND: 'amqp'
Only workaround now is to restart.
Ubuntu 14.04 2 GB RAM/2 CPU/40 GB HDD
This looks like a bug in celery. Asksol fixed this few days back.
You can install celery from source code and try it. If it is still causing problems, please create new issue on github.
I recently upgraded to Celery 3.0.1 from 2.3.0 and all the tasks run fine. Unfortunately. I'm getting a "Framing Error" exception pretty frequently. I'm also running supervisor to restart the threads but since these are never really killed supervisor has no way of knowing that celery needs to be restarted. Has anyone seen this before?
2012-07-13 18:53:59,004: ERROR/MainProcess] Unrecoverable error: Exception('Framing Error, received 0x00 while expecting 0xce',)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/__init__.py", line 350, in start
component.start()
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 360, in start
self.consume_messages()
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 445, in consume_messages
drain_nowait()
File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 175, in drain_nowait
self.drain_events(timeout=0)
File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 171, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/amqplib.py", line 262, in drain_events
return connection.drain_events(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/amqplib.py", line 97, in drain_events
chanmap, None, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/amqplib.py", line 155, in _wait_multiple
channel, method_sig, args, content = read_timeout(timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/amqplib.py", line 129, in read_timeout
return self.method_reader.read_method()
File "/usr/local/lib/python2.7/dist-packages/amqplib/client_0_8/method_framing.py", line 221, in read_method
raise m
Exception: Framing Error, received 0x00 while expecting 0xce
While I am not sure why this actually happens, switching from amqplib to librabbitmq helped me to overcome this trouble.
I haven't changed anything in configuration, just:
pip uninstall amqplib
pip install librabbitmq
And restarted celery workers.
Got this idea form https://github.com/celery/celery/issues/922
I am using xmlrpc to contact a local server. On the client side, Sometimes the following socket timeout error and happens and its not a consistent error.
Why is it happening? What could be the reason for socket timeout?
<class 'socket.timeout'>: timed out
args = ('timed out',)
errno = None
filename = None
message = 'timed out'
strerror = None
Traceback on the server side is as follows
Exception happened during processing of request from ('127.0.0.1', 34855)
Traceback (most recent call last):
File "/usr/lib/python2.4/SocketServer.py", line 222, in handle_request
self.process_request(request, client_address)
File "/usr/lib/python2.4/SocketServer.py", line 241, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python2.4/SocketServer.py", line 254, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.4/SocketServer.py", line 521, in __init__
self.handle()
File "/usr/lib/python2.4/BaseHTTPServer.py", line 314, in handle
self.handle_one_request()
File "/usr/lib/python2.4/BaseHTTPServer.py", line 308, in handle_one_request
method()
File "/usr/lib/python2.4/SimpleXMLRPCServer.py", line 441, in do_POST
self.send_response(200)
File "/usr/lib/python2.4/BaseHTTPServer.py", line 367, in send_response
self.send_header('Server', self.version_string())
File "/usr/lib/python2.4/BaseHTTPServer.py", line 373, in send_header
self.wfile.write("%s: %s\r\n" % (keyword, value))
File "/usr/lib/python2.4/socket.py", line 256, in write
self.flush()
File "/usr/lib/python2.4/socket.py", line 243, in flush
self._sock.sendall(buffer)
error: (32, 'Broken pipe')
I killed the server and restarted it. Its working fine now.
What could be the reason?
My machine's RAM went full yesterday night by a process and came back to normal today morning.
Will this error be because of some swapping of processes?
Looks like the client socket it timing out waiting for the server to respond. Is it possible that your server might take a lot time to respond some times? Also, if the server is causing the machine to go into swap, that would slow it down making a timeout possible.
If I remember right, socket timeout is not set in xmlrpc in python. Are you doing socket.setdefaulttimeout somewhere in your code?
If it is expected that your server will take time once in a while, then you could set a higher timeout value using above.
HTH