celery .delay hangs (recent, not an auth problem) - python

I am running Celery 2.2.4/djCelery 2.2.4, using RabbitMQ 2.1.1 as a backend. I recently brought online two new celery servers -- I had been running 2 workers across two machines with a total of ~18 threads and on my new souped up boxes (36g RAM + dual hyper-threaded quad-core), I am running 10 workers with 8 threads each, for a total of 180 threads -- my tasks are all pretty small so this should be fine.
The nodes have been running fine for the last few days, but today I noticed that .delaay() is hanging. When I interrupt it, I see a traceback that points here:
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
(20, 41), # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
s = self.sock.recv(65536)
I've checked the Rabbit logs, and I see it the process trying to connect as:
=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569
I have my Celery log level set to INFO, but I don't see anything particularly interesting in the Celery logs EXCEPT that 2 of the workers can't connect to the broker:
[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...
All of the other nodes are able to connect without issue.
I know that there was a posting ( RabbitMQ / Celery with Django hangs on delay/ready/etc - No useful log info ) last year of a similar nature, but I'm pretty certain that this is different. Could it be that the sheer number of workers is creating some sort of a race condition in amqplib -- I found this thread which seems to indicate that amqplib is not thread-safe, not sure if this matters for Celery.
EDIT: I've tried celeryctl purge on both nodes -- on one it succeeds, but on the other it fails with the following AMQP error:
AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
amqplib.client_0_8.exceptions.AMQPConnectionException:
(530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX'
with different type, durable or autodelete value", (40, 10), 'Channel.exchange_declare')
On both nodes, inspect stats hangs with the "can't close connection" traceback above. I'm at a loss here.
EDIT2: I was able to delete the offending exchange using exchange.delete from camqadm and now the second node hangs too :(.
EDIT3: One thing that also recently changed is that I added an additional vhost to rabbitmq, which my staging node connects to.

Hopefully this will save somebody a lot of time...though it certainly does not save me any embarrassment:
/var was full on the server that was running rabbit. With all of the nodes that I added, rabbit was doing a lot more logging and it filled up /var -- I couldn't write to /var/lib/rabbitmq, and so no messages were going through.

I had the same symptoms, but not the same cause, for anyone else who stumbles up on this, mine was solved by https://stackoverflow.com/a/63591450/284164 -- I wasn't importing the celery app at the project level, and .delay() was hanging until I added that.

Related

Python Hive Metastore partition timeout

We have ETL jobs in Python (Luigi). They all connect to Hive Metastore to get partitions info.
Code:
from hive_metastore import ThriftHiveMetastore
client = ThriftHiveMetastore.Client(protocol)
partitions = client.get_partition_names('sales', 'salesdetail', -1)
-1 is max_parts (max partitions returned)
It randomly times out like this:
File "/opt/conda/envs/etl/lib/python2.7/site-packages/luigi/contrib/hive.py", line 210, in _existing_partitions
partition_strings = client.get_partition_names(database, table, -1)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1703, in get_partition_names
return self.recv_get_partition_names()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1716, in recv_get_partition_names
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
buff = self.trans.readAll(4)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz - have)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 159, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 105, in read
buff = self.handle.recv(sz)
timeout: timed out
This error happens occasionally.
There is 15 minute timeout on Hive Metastore.
When I investigate to run get_partition_names separately, it returns data within a few seconds.
Even when I set socket.timeout to 1 or 2 seconds, query completes.
There is no record of socket close connection message in Hive metastore logs cat /var/log/hive/..log.out
The tables it usually times out on have large number of partitions ~10K+. But as mentioned before, they only time out randomly. And they return partitions metadata quickly when that portion of code alone is tested.
Any ideas why it times out randomly, or how to catch these timeout errors in metastore logs, or how to fix them ?
The issue was a thread overlap in LUIGI
We used a Singleton to implement a poor-man's connection pool. But Luigi's different worker threads stepped on each other, and caused strange behavior when one thread's get_partition_names conflict's with another's.
We fixed the issue by ensuring each thread's connection object gets its own 'key' in the connection pool (instead of all threads sharing the process id key)

pjsua.error, error = address already in use

I am trying to make calls using PJSIP module in python. For setup of SIP transport, I am doing like
trans_cfg = pj.TransportConfig()
# port for VoIP communication
trans_cfg.port = 5060
# local system address
trans_cfg.bound_addr = inputs.client_addr
transport = lib.create_transport(pj.TransportType.UDP,trans_cfg)
when I finish the call I am clearing the transport setup as, transport = None.
I am able to make call to user by running my program. But every time I restart my PC alone, I get an error while I run my python program
File "pjsuatrail_all.py", line 225, in <module>
main()
File "pjsuatrail_all.py", line 169, in main
transport = transport_setup()
File "pjsuatrail_all.py", line 54, in transport_setup
transport = lib.create_transport(pj.TransportType.UDP,trans_cfg)
File "/usr/local/lib/python2.7/dist-packages/pjsua.py", line 2304, in
create_transport
self._err_check("create_transport()", self, err)
File "/usr/local/lib/python2.7/dist-packages/pjsua.py", line 2723, in _err_check
raise Error(op_name, obj, err_code, err_msg)
pjsua.Error: Object: Lib, operation=create_transport(), error=Address already in use
Exception AttributeError: "'NoneType' object has no attribute 'destroy'" in <bound method Lib.__del__ of <pjsua.Lib instance at 0x7f8a4bbb6170>> ignored
For this currently I am doing like
$sudo lsof -t -i:5060
>> 1137
$sudo kill 1137
Then I run my code it works fine.
By instance from error, I can understand that somewhere I am not closing my transport configuration properly. Can anyone help in this regards.
Reference code used
From the inputs you give, it can be understood that its not the problem with pjsip wrapper. Transport configurations looks fine.
Looking in to the 'create_transport' error, the program is not able to create the connection because 5060 port is already occupied with some other program.
For that you are killing that process and you are able to run the program with out any error. And you say it only on restart, so on your system restart some program is occupying the port.
You can try like this
sudo netstat -nlp|grep 5060
in your case it will give like
1137/ProgramName
go to the 'ProgramName' in your startup configurations and make modifications such that it wont pickup the port.

redis.exceptions.ConnectionError after approximately one day celery running

This is my full trace:
Traceback (most recent call last):
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/app/trace.py", line 283, in trace_task
uuid, retval, SUCCESS, request=task_request,
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 256, in store_result
request=request, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 490, in _store_result
self.set(self.get_key_for_task(task_id), self.encode(meta))
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 160, in set
return self.ensure(self._set, (key, value), **retry_policy)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 149, in ensure
**retry_policy
File "/home/server/backend/venv/lib/python3.4/site-packages/kombu/utils/__init__.py", line 243, in retry_over_time
return fun(*args, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 169, in _set
pipe.execute()
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2593, in execute
return execute(conn, stack, raise_on_error)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2447, in _execute_transaction
connection.send_packed_command(all_cmds)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 532, in send_packed_command
self.connect()
File "/home/pserver/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 436, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 0 connecting to localhost:6379. Error.
[2016-09-21 10:47:18,814: WARNING/Worker-747] Data collector is not contactable. This can be because of a network issue or because of the data collector being restarted. In the event that contact cannot be made after a period of time then please report this problem to New Relic support for further investigation. The error raised was ConnectionError(ProtocolError('Connection aborted.', BlockingIOError(11, 'Resource temporarily unavailable')),).
I really searched for ConnectionError but there was no matching problem with mine.
My platform is ubuntu 14.04. This is a part of my redis config. (I can share if you need the whole redis.conf file. By the way all parameters are closed on LIMITS section.)
# By default Redis listens for connections from all the network interfaces
# available on the server. It is possible to listen to just one or multiple
# interfaces using the "bind" configuration directive, followed by one or
# more IP addresses.
#
# Examples:
#
# bind 192.168.1.100 10.0.0.1
bind 127.0.0.1
# Specify the path for the unix socket that will be used to listen for
# incoming connections. There is no default, so Redis will not listen
# on a unix socket when not specified.
#
# unixsocket /var/run/redis/redis.sock
# unixsocketperm 755
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 0
# TCP keepalive.
#
# If non-zero, use SO_KEEPALIVE to send TCP ACKs to clients in absence
# of communication. This is useful for two reasons:
#
# 1) Detect dead peers.
# 2) Take the connection alive from the point of view of network
# equipment in the middle.
#
# On Linux, the specified value (in seconds) is the period used to send ACKs.
# Note that to close the connection the double of the time is needed.
# On other kernels the period depends on the kernel configuration.
#
# A reasonable value for this option is 60 seconds.
tcp-keepalive 60
This is my mini redis wrapper:
import redis
from django.conf import settings
REDIS_POOL = redis.ConnectionPool(host=settings.REDIS_HOST, port=settings.REDIS_PORT)
def get_redis_server():
return redis.Redis(connection_pool=REDIS_POOL)
And this is how i use it:
from redis_wrapper import get_redis_server
# view and task are working in different, indipendent processes
def sample_view(request):
rs = get_redis_server()
# some get-set stuff with redis
#shared_task
def sample_celery_task():
rs = get_redis_server()
# some get-set stuff with redis
Package versions:
celery==3.1.18
django-celery==3.1.16
kombu==3.0.26
redis==2.10.3
So the problem is that; this connection error occurs after some time of starting celery workers. And after first seem of that error, all the tasks ends with this error until i restart all of my celery workers. (Interestingly, celery flower also fails during that problematic period)
I suspect of my redis connection pool usage method, or redis configuration or less probably network issues. Any ideas about the reason? What am i doing wrong?
(PS: I will add redis-cli info results when i will see this error today)
UPDATE:
I temporarily solved this problem by adding --maxtasksperchild parameter to my worker starter command. I set it to 200. Ofcourse it is not the proper way to solve this problem, it is just a symptomatic cure. It basically refreshes the worker instance periodically (closes old process and creates new one when old one reached 200 task) and refreshes my global redis pool and connections. So i think i should focus on global redis connection pool usage way and i'm still waiting for new ideas and comments.
Sorry for my bad English and thanks in advance.
Have you enabled the rdb background save method in redis ??
if so check for the size of the dump.rdb file in /var/lib/redis.
Sometimes the file grows in size and fill the root directory and the redis instance cannot save to that file anymore.
You can stop the background save process by issuing
config set stop-writes-on-bgsave-error no
command on redis-cli

How do I configure my uWsgi server to protect against the Unreadable Post Error?

This is the problem:
File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/six.py", line 535, in next
return type(self).__next__(self)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/multipartparser.py", line 344, in __next__
output = next(self._producer)
File "/app/.heroku/python/lib/python2.7/site-packages/django/utils/six.py", line 535, in next
return type(self).__next__(self)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/multipartparser.py", line 406, in __next__
data = self.flo.read(self.chunk_size)
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/request.py", line 267, in read
six.reraise(UnreadablePostError, UnreadablePostError(*e.args), sys.exc_info()[2])
File "/app/.heroku/python/lib/python2.7/site-packages/django/http/request.py", line 265, in read
return self._stream.read(*args, **kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 59, in read
result = self.buffer + self._read_limited(size - len(self.buffer))
File "/app/.heroku/python/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 47, in _read_limited
result = self.stream.read(size)
UnreadablePostError: error during read(65536) on wsgi.input
My current configuration reads like this:
[uwsgi]
http-socket = :$(PORT)
master = true
processes = 4
die-on-term = true
module = app.wsgi:application
memory-report = true
chunked-input-limit = 25000000
chunked-input-timeout = 300
socket-timeout = 300
Python: 2.7.x | uWsgi: 2.0.10
And to make the problem even more specific, this is happening when I process images synchronously along with an image upload. I know that ideally I must do this using Celery, but because of a business requirement I am not able to do that. So need to configure the timeout in such a way that it allows me to accept a large image file, process it and then return response.
Any kind of light on the question will be extremely helpful. Thank you.
The error quoted in the description isn't the full picture; the relevant part is this lot entry:
[uwsgi-body-read] Error reading 65536 bytes … message: Client closed connection uwsgi_response_write_body_do() TIMEOUT
This specific error is being raised because (most probably) the client, or something between it and uWSGI, aborted the request.
There are a number of possible causes for this:
A buggy client
Network-level filtering (DPI or some misconfigured firewall)
Bugs / misconfiguration in the server in front of uWSGI
The last one is covered in the uWSGI docs:
If you plan to put uWSGI behind a proxy/router be sure it supports chunked input requests (or generally raw HTTP requests).
To verify your issue really isn't in uWSGI, try to upload the file via the console on the server hosting your uWSGI application. Hit the HTTP endpoint directly, bypassing nginx/haproxy and friends.

cherrypy not closing the sockets

I am using cherrypy as a webserver. It gives good performance for my application but there is a very big problem with it. cherrypy crashes after couple of hours stating that it could not create a socket as there are too many files open:
[21/Oct/2008:12:44:25] ENGINE HTTP Server
cherrypy._cpwsgi_server.CPWSGIServer(('0.0.0.0', 8080)) shut down
[21/Oct/2008:12:44:25] ENGINE Stopped thread '_TimeoutMonitor'.
[21/Oct/2008:12:44:25] ENGINE Stopped thread 'Autoreloader'.
[21/Oct/2008:12:44:25] ENGINE Bus STOPPED
[21/Oct/2008:12:44:25] ENGINE Bus EXITING
[21/Oct/2008:12:44:25] ENGINE Bus EXITED
Exception in thread HTTPServer Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.3/threading.py", line 436, in __bootstrap
self.run()
File "/usr/lib/python2.3/threading.py", line 416, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.3/site-packages/cherrypy/process/servers.py", line 73, in
_start_http_thread
self.httpserver.start()
File "/usr/lib/python2.3/site-packages/cherrypy/wsgiserver/__init__.py", line 1388, in start
self.tick()
File "/usr/lib/python2.3/site-packages/cherrypy/wsgiserver/__init__.py", line 1417, in tick
s, addr = self.socket.accept()
File "/usr/lib/python2.3/socket.py", line 167, in accept
sock, addr = self._sock.accept()
error: (24, 'Too many open files')
[21/Oct/2008:12:44:25] ENGINE Waiting for child threads to terminate..
I tried to figure out what was happening. My application does not open any file or any socket etc. My file only opens couple of berkeley dbs. I investigated this issue further. I saw the file descriptors used by my cherrypy process with id 4536 in /proc/4536/fd/
Initially there were new sockets created and cleaned up properly but after an hour I found that it had about 509 sockets that were not cleaned. All the sockets were in CLOSE_WAIT state. I got this information using the following command:
netstat -ap | grep "4536" | grep CLOSE_WAIT | wc -l
CLOSE_WAIT state means that the remote client has closed the connection. Why is cherrypy then not closing the socket and free the file descriptors? What can I do to resolve the problem?
I tried to play with the following:
cherrypy.config.update({'server.socketQueueSize': '10'})
I thought that this would restrict the number of sockets open at any time to 10 but it was not effective at all. This is the only config I have set, so , rest of the configs hold their default values.
Could somebody throw light on this? Do you think its a bug in cherrypy? How can I resolve it? Is there a way I can close these sockets myself?
Following is my systems info:
CherryPy-3.1.0
python 2.3.4
Red Hat Enterprise Linux ES release 4 (Nahant Update 7)
Thanks in advance!
I imagine you're storing (in-memory) some piece of data which has a reference to the socket; if you store the request objects anywhere, for instance, that would likely do it.
The last-ditch chance for sockets to be closed is when they're garbage-collected; if you're doing anything that would prevent garbage collection from reaching them, there's your problem. I suggest that you try to reproduce with a Hello World program written in CherryPy; if you can't reproduce there, you know it's in your code -- look for places where you're persisting information which could (directly or otherwise) reference the socket.

Categories

Resources