I am using cherrypy as a webserver. It gives good performance for my application but there is a very big problem with it. cherrypy crashes after couple of hours stating that it could not create a socket as there are too many files open:
[21/Oct/2008:12:44:25] ENGINE HTTP Server
cherrypy._cpwsgi_server.CPWSGIServer(('0.0.0.0', 8080)) shut down
[21/Oct/2008:12:44:25] ENGINE Stopped thread '_TimeoutMonitor'.
[21/Oct/2008:12:44:25] ENGINE Stopped thread 'Autoreloader'.
[21/Oct/2008:12:44:25] ENGINE Bus STOPPED
[21/Oct/2008:12:44:25] ENGINE Bus EXITING
[21/Oct/2008:12:44:25] ENGINE Bus EXITED
Exception in thread HTTPServer Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.3/threading.py", line 436, in __bootstrap
self.run()
File "/usr/lib/python2.3/threading.py", line 416, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.3/site-packages/cherrypy/process/servers.py", line 73, in
_start_http_thread
self.httpserver.start()
File "/usr/lib/python2.3/site-packages/cherrypy/wsgiserver/__init__.py", line 1388, in start
self.tick()
File "/usr/lib/python2.3/site-packages/cherrypy/wsgiserver/__init__.py", line 1417, in tick
s, addr = self.socket.accept()
File "/usr/lib/python2.3/socket.py", line 167, in accept
sock, addr = self._sock.accept()
error: (24, 'Too many open files')
[21/Oct/2008:12:44:25] ENGINE Waiting for child threads to terminate..
I tried to figure out what was happening. My application does not open any file or any socket etc. My file only opens couple of berkeley dbs. I investigated this issue further. I saw the file descriptors used by my cherrypy process with id 4536 in /proc/4536/fd/
Initially there were new sockets created and cleaned up properly but after an hour I found that it had about 509 sockets that were not cleaned. All the sockets were in CLOSE_WAIT state. I got this information using the following command:
netstat -ap | grep "4536" | grep CLOSE_WAIT | wc -l
CLOSE_WAIT state means that the remote client has closed the connection. Why is cherrypy then not closing the socket and free the file descriptors? What can I do to resolve the problem?
I tried to play with the following:
cherrypy.config.update({'server.socketQueueSize': '10'})
I thought that this would restrict the number of sockets open at any time to 10 but it was not effective at all. This is the only config I have set, so , rest of the configs hold their default values.
Could somebody throw light on this? Do you think its a bug in cherrypy? How can I resolve it? Is there a way I can close these sockets myself?
Following is my systems info:
CherryPy-3.1.0
python 2.3.4
Red Hat Enterprise Linux ES release 4 (Nahant Update 7)
Thanks in advance!
I imagine you're storing (in-memory) some piece of data which has a reference to the socket; if you store the request objects anywhere, for instance, that would likely do it.
The last-ditch chance for sockets to be closed is when they're garbage-collected; if you're doing anything that would prevent garbage collection from reaching them, there's your problem. I suggest that you try to reproduce with a Hello World program written in CherryPy; if you can't reproduce there, you know it's in your code -- look for places where you're persisting information which could (directly or otherwise) reference the socket.
Related
I am basically trying to run two servers using a simple script. I have used the solutions from here, there and others.
For instance in the example below I am trying to host two directories in ports 8000 and 8001.
import http.server
import socketserver
import os
import multiprocessing
def run_webserver(path, PORT):
os.chdir(path)
Handler = http.server.SimpleHTTPRequestHandler
httpd = socketserver.TCPServer(('0.0.0.0', PORT), Handler)
httpd.serve_forever()
return
if __name__ == '__main__':
# Define "services" directories and ports to use to host them
server_details = [
("path/to/directory1", 8000),
("path/to/directory2", 8001)
]
# Run servers
servers = []
for s in server_details:
p = multiprocessing.Process(
target=run_webserver,
args=(s[0], s[1])
)
servers.append(p)
for server in servers:
server.start()
for server in servers:
server.join()
Once I execute the code below it all works fine and I can access both directories using http://localhost:8000 and http://localhost:8001. However, when I exit the script using Ctrl+C and then try to run the script again I get the following error:
Traceback (most recent call last):
"/home/user/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/user/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "repos/scripts/webhosts.py", line 12, in run_webserver
File "/home/user/anaconda3/lib/python3.6/socketserver.py", line 453, in __init__
self.server_bind()
File "/home/user/anaconda3/lib/python3.6/socketserver.py", line 467, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
This error only pops up if I actually access the server when running. If I do not access to it, I can re-run the script... From the error message it looks like there is something still accessing the server when typing lsof -n -i4TCP:8000 and lsof -n -i4TCP:8001 I don't get anything... And after a while this error stops appearing and I can actually run the script.
Before starting the server, add:
socketserver.TCPServer.allow_reuse_address = True
or change that attribute of the instance, again before calling serve_forever():
htttp.allow_reuse_address = True
Documentation reference:
BaseServer.allow_reuse_address
Whether the server will allow the reuse of an address. This defaults to False, and can be set in subclasses to change the policy.
Expanding on my comment:
In a previous edit OP registered an exit handler atexit.register(exit_handler). The question is, does it cleans your systems resources (a.k.a open sockets)?
If your program exits without closing the sockets (because you interrupted it with Ctrl+C) it make take some time for the OS clean up the sockets (cause they're in TIME_WAIT state), you can read about TIME_WAIT and how to avoid it in here.
Using exit handlers is a good way to avoid this
import atexit
atexit.register(clean_sockets)
def clean_sockets():
mysocket.server_close() #
I am trying to make calls using PJSIP module in python. For setup of SIP transport, I am doing like
trans_cfg = pj.TransportConfig()
# port for VoIP communication
trans_cfg.port = 5060
# local system address
trans_cfg.bound_addr = inputs.client_addr
transport = lib.create_transport(pj.TransportType.UDP,trans_cfg)
when I finish the call I am clearing the transport setup as, transport = None.
I am able to make call to user by running my program. But every time I restart my PC alone, I get an error while I run my python program
File "pjsuatrail_all.py", line 225, in <module>
main()
File "pjsuatrail_all.py", line 169, in main
transport = transport_setup()
File "pjsuatrail_all.py", line 54, in transport_setup
transport = lib.create_transport(pj.TransportType.UDP,trans_cfg)
File "/usr/local/lib/python2.7/dist-packages/pjsua.py", line 2304, in
create_transport
self._err_check("create_transport()", self, err)
File "/usr/local/lib/python2.7/dist-packages/pjsua.py", line 2723, in _err_check
raise Error(op_name, obj, err_code, err_msg)
pjsua.Error: Object: Lib, operation=create_transport(), error=Address already in use
Exception AttributeError: "'NoneType' object has no attribute 'destroy'" in <bound method Lib.__del__ of <pjsua.Lib instance at 0x7f8a4bbb6170>> ignored
For this currently I am doing like
$sudo lsof -t -i:5060
>> 1137
$sudo kill 1137
Then I run my code it works fine.
By instance from error, I can understand that somewhere I am not closing my transport configuration properly. Can anyone help in this regards.
Reference code used
From the inputs you give, it can be understood that its not the problem with pjsip wrapper. Transport configurations looks fine.
Looking in to the 'create_transport' error, the program is not able to create the connection because 5060 port is already occupied with some other program.
For that you are killing that process and you are able to run the program with out any error. And you say it only on restart, so on your system restart some program is occupying the port.
You can try like this
sudo netstat -nlp|grep 5060
in your case it will give like
1137/ProgramName
go to the 'ProgramName' in your startup configurations and make modifications such that it wont pickup the port.
This is my full trace:
Traceback (most recent call last):
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/app/trace.py", line 283, in trace_task
uuid, retval, SUCCESS, request=task_request,
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 256, in store_result
request=request, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/base.py", line 490, in _store_result
self.set(self.get_key_for_task(task_id), self.encode(meta))
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 160, in set
return self.ensure(self._set, (key, value), **retry_policy)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 149, in ensure
**retry_policy
File "/home/server/backend/venv/lib/python3.4/site-packages/kombu/utils/__init__.py", line 243, in retry_over_time
return fun(*args, **kwargs)
File "/home/server/backend/venv/lib/python3.4/site-packages/celery/backends/redis.py", line 169, in _set
pipe.execute()
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2593, in execute
return execute(conn, stack, raise_on_error)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/client.py", line 2447, in _execute_transaction
connection.send_packed_command(all_cmds)
File "/home/server/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 532, in send_packed_command
self.connect()
File "/home/pserver/backend/venv/lib/python3.4/site-packages/redis/connection.py", line 436, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 0 connecting to localhost:6379. Error.
[2016-09-21 10:47:18,814: WARNING/Worker-747] Data collector is not contactable. This can be because of a network issue or because of the data collector being restarted. In the event that contact cannot be made after a period of time then please report this problem to New Relic support for further investigation. The error raised was ConnectionError(ProtocolError('Connection aborted.', BlockingIOError(11, 'Resource temporarily unavailable')),).
I really searched for ConnectionError but there was no matching problem with mine.
My platform is ubuntu 14.04. This is a part of my redis config. (I can share if you need the whole redis.conf file. By the way all parameters are closed on LIMITS section.)
# By default Redis listens for connections from all the network interfaces
# available on the server. It is possible to listen to just one or multiple
# interfaces using the "bind" configuration directive, followed by one or
# more IP addresses.
#
# Examples:
#
# bind 192.168.1.100 10.0.0.1
bind 127.0.0.1
# Specify the path for the unix socket that will be used to listen for
# incoming connections. There is no default, so Redis will not listen
# on a unix socket when not specified.
#
# unixsocket /var/run/redis/redis.sock
# unixsocketperm 755
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 0
# TCP keepalive.
#
# If non-zero, use SO_KEEPALIVE to send TCP ACKs to clients in absence
# of communication. This is useful for two reasons:
#
# 1) Detect dead peers.
# 2) Take the connection alive from the point of view of network
# equipment in the middle.
#
# On Linux, the specified value (in seconds) is the period used to send ACKs.
# Note that to close the connection the double of the time is needed.
# On other kernels the period depends on the kernel configuration.
#
# A reasonable value for this option is 60 seconds.
tcp-keepalive 60
This is my mini redis wrapper:
import redis
from django.conf import settings
REDIS_POOL = redis.ConnectionPool(host=settings.REDIS_HOST, port=settings.REDIS_PORT)
def get_redis_server():
return redis.Redis(connection_pool=REDIS_POOL)
And this is how i use it:
from redis_wrapper import get_redis_server
# view and task are working in different, indipendent processes
def sample_view(request):
rs = get_redis_server()
# some get-set stuff with redis
#shared_task
def sample_celery_task():
rs = get_redis_server()
# some get-set stuff with redis
Package versions:
celery==3.1.18
django-celery==3.1.16
kombu==3.0.26
redis==2.10.3
So the problem is that; this connection error occurs after some time of starting celery workers. And after first seem of that error, all the tasks ends with this error until i restart all of my celery workers. (Interestingly, celery flower also fails during that problematic period)
I suspect of my redis connection pool usage method, or redis configuration or less probably network issues. Any ideas about the reason? What am i doing wrong?
(PS: I will add redis-cli info results when i will see this error today)
UPDATE:
I temporarily solved this problem by adding --maxtasksperchild parameter to my worker starter command. I set it to 200. Ofcourse it is not the proper way to solve this problem, it is just a symptomatic cure. It basically refreshes the worker instance periodically (closes old process and creates new one when old one reached 200 task) and refreshes my global redis pool and connections. So i think i should focus on global redis connection pool usage way and i'm still waiting for new ideas and comments.
Sorry for my bad English and thanks in advance.
Have you enabled the rdb background save method in redis ??
if so check for the size of the dump.rdb file in /var/lib/redis.
Sometimes the file grows in size and fill the root directory and the redis instance cannot save to that file anymore.
You can stop the background save process by issuing
config set stop-writes-on-bgsave-error no
command on redis-cli
I've developed a tornado app but when more than one user logs in it seems to log the previous user out. I come from an Apache background so I thought tornado would either spawn a thread or fork a process but seems like that is not what is happening.
To mitigate this I've installed nginx and configured it as a reverse proxy to forward incoming requests to an available tornado process. Nginx seems to work fine however when I try to start more than one tornado process using a different port I get the following error:
http_server.listen(options.port)
File "/usr/local/lib/python2.7/dist-packages/tornado/tcpserver.py", line 125, in listen
sockets = bind_sockets(port, address=address)
File "/usr/local/lib/python2.7/dist-packages/tornado/netutil.py", line 145, in bind_sockets
sock.bind(sockaddr)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 98] Address already in use
basically i get this for each process I try to start on a different port.
I've read that I should use supervisor to manage my tornado processes but I'm thinking that is more of a convenience. At the moment I'm wondering if the problem has to do with my actual tornado code or my setup somewhere? My python code looks like this:
from tornado.options import define, options
define("port", default=8000, help="run on given port", type=int)
....
http_server = tornado.httpserver.HTTPServer(app)
http_server.listen(options.port)
tornado.ioloop.IOLoop.instance().start()
my handlers all work fine and I can access the site when I go to localhost:8000 just need a pair of fresh eyes please. ;)
Well I solved the problem. I had .sh file that tried to start multiple processes with:
python initpumpkin.py --port=8000&
python initpumpkin.py --port=8001&
python initpumpkin.py --port=8002&
python initpumpkin.py --port=8003&
unfortunately I didn't tell tornado to parse the command line options so I would always get that address always in use error since port '8000' was defined as my default port, so it would attempt to listen on that port each time. In order to mitigate this make sure to declare tornado.options.parse_command_line() after main:
if __name__ == "__main__":
tornado.options.parse_command_line()
then run from the CLI with whatever arguments.
Have you tried starting your server in this manner:
server = tornado.httpserver.HTTPServer(app)
server.bind(port, "0.0.0.0")
server.start(0)
IOLoop.current().start()
server.start takes a parameter for number of processes where 0 tells Tornado to use one process per CPU on the machine
I am running Celery 2.2.4/djCelery 2.2.4, using RabbitMQ 2.1.1 as a backend. I recently brought online two new celery servers -- I had been running 2 workers across two machines with a total of ~18 threads and on my new souped up boxes (36g RAM + dual hyper-threaded quad-core), I am running 10 workers with 8 threads each, for a total of 180 threads -- my tasks are all pretty small so this should be fine.
The nodes have been running fine for the last few days, but today I noticed that .delaay() is hanging. When I interrupt it, I see a traceback that points here:
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
(20, 41), # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
s = self.sock.recv(65536)
I've checked the Rabbit logs, and I see it the process trying to connect as:
=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569
I have my Celery log level set to INFO, but I don't see anything particularly interesting in the Celery logs EXCEPT that 2 of the workers can't connect to the broker:
[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...
All of the other nodes are able to connect without issue.
I know that there was a posting ( RabbitMQ / Celery with Django hangs on delay/ready/etc - No useful log info ) last year of a similar nature, but I'm pretty certain that this is different. Could it be that the sheer number of workers is creating some sort of a race condition in amqplib -- I found this thread which seems to indicate that amqplib is not thread-safe, not sure if this matters for Celery.
EDIT: I've tried celeryctl purge on both nodes -- on one it succeeds, but on the other it fails with the following AMQP error:
AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
amqplib.client_0_8.exceptions.AMQPConnectionException:
(530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX'
with different type, durable or autodelete value", (40, 10), 'Channel.exchange_declare')
On both nodes, inspect stats hangs with the "can't close connection" traceback above. I'm at a loss here.
EDIT2: I was able to delete the offending exchange using exchange.delete from camqadm and now the second node hangs too :(.
EDIT3: One thing that also recently changed is that I added an additional vhost to rabbitmq, which my staging node connects to.
Hopefully this will save somebody a lot of time...though it certainly does not save me any embarrassment:
/var was full on the server that was running rabbit. With all of the nodes that I added, rabbit was doing a lot more logging and it filled up /var -- I couldn't write to /var/lib/rabbitmq, and so no messages were going through.
I had the same symptoms, but not the same cause, for anyone else who stumbles up on this, mine was solved by https://stackoverflow.com/a/63591450/284164 -- I wasn't importing the celery app at the project level, and .delay() was hanging until I added that.