python SocketServer stuck on waitpid() syscall

python SocketServer stuck on waitpid() syscall - python

I am using Python (2.7) SocketServer with ForkingMixIn. It worked well.
However sometimes on heavy usage (tons of rapidly connecting/disconnecting clients) the "server" stuck, consuming all the idle CPU (shown 100% CPU by top). If I use strace from CLI on the process it shows it does endless sequence of waitpid() syscall. According to command "ps" there are no child processes though at this point.
After this problem my server implementation goes unusable and only its restarting helps :( Clients can connect but no anwser, I guess just the "backlog" queue is used on OS side, but the python code never accepts the connection.
It can be easily reproduced eg with some privimitive HTTP implementation, and a browser (I used chrome) with CTRL-R (reload) hold down for something like 10 seconds. Of course the problem is triggered without this "brutal" try as well "on normal usage" just more rarely, and it was quite hard to even come with the idea what can be the problem. I wrote my own implementation of something like SocketServer with os.fork(), and socket functions, and it does not have this problem, but I am more happy with some "already ready", and "standard" solution.
The problem: it is not a nice thing, as my script implementing a server can be DoS'ed very easily in this way.
What I could notice: I installed a singal handler for SIGCHLD. It seems if I remove that, I can't reproduce the problem, however then I can see zombie processes (I guess since they are not wait()'ed). Even if I install signal handler with signal.SIG_IGN, I expereince this problem.
Can anybody help what can be the problem and how I can solve this? I'd like use singal handler anyway since it's also not so nice to leave many zombie processes, especially after a long run.
Thanks for any idea.

maybe related: What is the cost of many TIME_WAIT on the server side?
it is possible that you have all your max connections in a time_wait state.
check sysctl net.core.somaxconn for maximum connections.
check sysctl net.ipv4 for other configuration details (e.g. tw
check ulimit -n for max open file descriptors (sockets included)
you can try: sysctl net.ipv4.tcp_tw_reuse=1 to quickly reuse those sockets (don't keep it enabled unless you know what you're doing.)
check for file handle leaks.
[not-so] stupid question: how is your SocketServer implementation different from the standard one + ForkingMixIn?
However, it is really easy to abuse a ForkingMixIn (fork bomb), you might want to use green threads, e.g. the eventlet library ( http://eventlet.net/doc/index.html )
this might be your problem.
this: http://bugs.python.org/issue7978
this: http://mail.python.org/pipermail/python-bugs-list/2010-April/095492.html
this: http://twistedmatrix.com/trac/ticket/733
you will see that SIGCHLD handler is discouraged unless you take some extra measures (signal.siginterrupt(signal.SIGCHLD, False) in handler, or using a wake-up fd in select() call)

Related

How to trigger clean shutdown of FastAPI/Uvicorn

I am running a number of FastAPI instances with uvicorn with python's subprocess.Popen. I have a small GUI made with PySimpleGUI with which I want to be able to close servers and restart them at will.
The first problem I encountered is that, at least in Windows, starting the uvicorn server appear to create not one, but two, new processes, and calling Popen.terminate() only closes one of these processes, which does not free up the port associated with the server. I fixed this problem using the psutil package to check what new processes have been created after I instantiate a Popen object, and track and terminate the second process with psutil.
What is still a major problem, is that calling psutil.terminate() on the process does not call the FastAPI function under #app.on_event("shutdown"). In the past, we have run all of our servers in individual terminal windows, and find that ctrl-c on those terminal windows will call the shutdown event, but I have found no other way of doing so. ctrl-c on my interface will obviously take down the interface and all the servers, and is somewhat unreliable in hitting the shutdown events for all the servers. My other idea was use psutil.send_signal(signal.CTRL_C_EVENT), but this has the same effect as calling ctrl-c in terminal.
So I am at a loss. I have seen multiple posts around saying that this is a general shortcoming of uvicorn, but have not seen anything that directly confirms my own experience or offers a solution. I also know that the "shutdown" and "startup" events in FastAPI are ported in from Starlette, and are not very well documented in either package. I have seen suggestions to use guvicorn, but my brief look into that confirmed that it is not compatible with windows. Any suggestions?

TL;DR:
APIs are meant to be long-running processes
there is a whole industry around virtualizing to manage the orchestration automatically of when to start or stop a service
there is also "serverless" infrastructure you can hang any of your processes with you not having to spend any effort in this field as it is not meant to be a thing
If you still want to go against everyone else and do manage it your self you can do as this answered question
##### SOLUTION #####
pid = proc.pid
parent = psutil.Process(pid)
for child in parent.children(recursive=True):
child.kill()
##### SOLUTION END ####
A bit of explanation:
From the conception of Rest API as an architecture pattern it was meant to be awaiting always for user's requests coming over the web. it has never been the general intent to manage gracefully and develop a product to handle gracefully the shut down of something that "was meant to run forever" and we build processes to do work to keep it running 24/7/365 as an industry.
If you ever want to leverage the ability to start or stop one to many APIS simultaneously withing the same device is highly recommended d you at least go with something like containers and Kubernetes and just scripting commands against the CLI of Kubernetes for such purpose. In exchange for the extra effort you will gain process isolation from others and the base OS layer ( which will still be less effort than building all that tooling yourself on your own.
My personal favorite is not doing even that and going straight with lambdas as is way easier and better in so many ways. Don't take it from me but from one of the industry-leading companies, Cloudflare and their statements on the subject
Serverless computing offers a number of advantages over traditional cloud-based or server-centric infrastructure. For many developers, serverless architectures offer greater scalability, more flexibility, and quicker time to release, all at a reduced cost.

Is there anyway to terminate a running function from a thread?

I've tried lately to write my own Socket-Server in python.
While i was writing a thread to handle server commands (sort of command line in the server), I've tried to implement a code that will restart the server when the raw_input() receives specific command.
Basically, i want to restart the server as soon as the "Running" variable changes its state from True to False, and when it does, i would like to stop the function (The function that called the thread) from running (get back to main function) and then run it again. Is there a way to do it?
Thank you very much, and i hope i was clear about my problem,
Idan :)

Communication between threads can be done with Events, Queues, Semaphores, etc. Check them out and choose the one, that fits your problem best.

You can't abort a thread, or raise an exception into it asynchronously, in Python.
The standard Unix solution to this problem is to use a non-blocking socket, create a pipe with pipe, replace all your blocking sock.recv calls with a blocking r, _, _ = select.select([sock, pipe], [], []), and then the other thread can write to the pipe to wake up the other thread.
To make this portable to Windows you'll need to create a UDP localhost socket instead of a pipe, which makes things slightly more complicated, but it's still not hard.
Or, of course, you can use a higher-level framework, like asyncio in 3.4+, or twisted or another third-party lib, which will wrap this up for you. (Most of them are already running the equivalent of a loop around select to service lots of clients in one thread or a small thread pool, so it's trivial to toss in a stop pipe.)
Are there other alternatives? Yes, but all less portable and less good in a variety of other ways.
Most platforms have a way to asynchronously kill or signal another thread, which you can access via, e.g., ctypes. But this is a bad idea, because it will prevent Python from doing any normal cleanup. Even if you don't get a segfault, this could mean files never get flushed and end up with incomplete/garbage data, locks are left acquired to deadlock your program somewhere completely unrelated a short time later, memory gets leaked, etc.
If you're specifically trying to interrupt the main thread, and you only care about CPython on Unix, you can use a signal handler and the kill function. The signal will take effect on the next Python bytecode, and if the interpreter is blocked on any kind of I/O (or most other syscalls, e.g., inside a sleep), the system will return to the interpreter with an EINTR, allowing it to interrupt immediately. If the interpreter is blocked on something else, like a call to a C library that blocks signals or just does nothing but CPU work for 30 seconds, then you'll have to wait 30 seconds (although that doesn't come up that often, and you should know if it will in your case). Also, threads and signals don't play nice on some older *nix platforms. And signals don't work the same way on Windows, or in some other Python implementations like Jython.
On some platforms (including Windows--but not most modern *nix plafforms), you can wake up a blocking socket call just by closing the socket out from under the waiting thread. On other platforms, this will not unblock the thread, or will do it sometimes but not other times (and theoretically it could even segfault your program or leave the socket library in an unusable state, although I don't think either of those will happen on any modern platform).

As far as I understand the documentation, and some experiments I've over the last weeks, there is no way to really force another thread to 'stop' or 'abort'. Unless the function is aware of the possibility of being stopped and has a foolproof method of avoiding getting stuck in some of the I/O functions. Then you can use some communication method such as semaphores. The only exception is the specialized Timer function, which has a Cancel method.
So, if you really want to stop the server thread forcefully, you might want to think about running it in a separate process, not a thread.
EDIT: I'm not sure why you want to restart the server - I just thought it was in case of a failure. Normal procedure in a server is to loop waiting for connections on the socket, and when a connection appears, attend it and return to that loop.
A better way, is to use the GIO library (part of glib), and connect methods to the connection event, to attend the connection even asynchronously. This avoids the loop completely. I don't have any real code for this in Python, but here's an example of a client in Python (which uses GIO for reception events) and a server in C, which uses GIO for connections.
Use of GIO makes life so much easier...

Named Event in Python

In python, is there a cross platform way of creating something similar to Windows named Event in one process, and set it from another process to signal something to the first one?
My specific problem is that I need to create a process that on startup will check if any other instances of itself are running, and if so, signal them to quit. With Windows API I would use CreateEvent with the lpName parameter, and SetEvent.

I've spent about a day now searching for a good answer to this and here is what I am coming up with at this moment:
It is possible to use signals to indicate to the process that some change needs to take place, however in a more complex legacy codebase I am dealing with it causes the process to crash. Signaling interrupts various I/O processes and alike based on python signal docs. You can implement signal handler with signal.SIGUSR1
import signal
def signal_handler(signum, stack):
print('Signal %d received'%signum)
signal.signal(signal.SIGUSR1, signal_handler)
This code can be triggered in Linux et al. through:
$ kill -s SIGUSR1 $pid
I am presently leaning towards kazoo Python Zookeeper library. It requires to stand up Zookeeper as infrastructure.
I do have an additional need for toggling configuration values in my case. However Zookeeper supports a number of interprocessor communication tools that will serve your needs.
UPDATE:
I finally settled on a named pipe (FIFO), calling it inside a thread with readline.
if not os.path.exists(fifo_name):
os.mkfifo(fifo_name)
while True:
with open(fifo_name, 'r') as config_fifo:
line = config_fifo.readline()[:-1]
print(line)
I used tempfile.gettempdir() to find a good location to place the FIFO in the file system. It requires quite a bit of refinement however, since I did not care to parse passed content while you might. Also if you are planing on having more then one consumer of the event you are going to have it propagated to only one consumer as it is a queue.

It seems to me that this is not so much a question as to whether this is possible in Python, but whether such a cross-platform approach exists: if one does, then even if no directly written Python exists, one can always make system calls using subprocess.call() and the like.
As for whether it's a possibility, I can't profess to be much of an expert, but a bit of a search has thrown up these discussions which might prove helpful to you.

ssh - difference blocking and non blocking mode

It's sounds like so stupid question since every developer who use any SSH library should have probably asked himself this question (?). But I can't really find what is the difference between blocking or non-blocking...
I mean ok... One blocks till it receives the answer, the other sends the queries and returns immediately, then you check by yourself the reply buffer... I got that part.
But why to use one rather than the other? I can't manage to find the answer...
Is it about performances? And if there is a difference, why?
Thanks in advance for any answer to this questions.
--- Edit: Forget about the following "bonus question", I've finally coded non blocking mode and experience the same problem, it must be something in libssh2. So I still don't get the added value of non-blocking mode... ---
Bonus question:
I'm not really sure could this difference explain something I'm experiencing?
I have a python script which connects to many hosts to run several commands.
It was using paramiko library in non-blocking mode. Paramiko is pure python and really slow for establishing ssh connections to many hosts...
I'm changing it for pylibssh2 which is python bindings for the C library libssh2. Since I didn't get the difference, I started to code in blocking mode.
Results:
- libss2 is much faster than paramiko (connection to 230 hosts in parallel in 4s instead of 1m30s)
- For running commands successively, libssh2 is also faster.
- When I run commands through ssh from several parallel threads, the code with libssh2 in blocking mode becomes slowlier than paramiko in non-blocking mode.
- I also noticed that the CPU consumption is very low compared with previous version. I guess part of this is related to C vs python but it seems than beyond the SSH API, my script itself performs less actions. Are threads blocking each other when sending commands through SSH in blocking mode?

The reason is that if you want to do two things at once, say read from some other network connection, and your SSH session, you have two options:
use blocking APIs, and use two threads or processes so you can do them both
use non-blocking APIs so the same thread can do both
This latter approach is called Asynchronous I/O. See for example twisted which uses it extensively.

too many threads due to synch communication

I'm using threads and xmlrpclib in python at the same time. Periodically, I create a bunch of thread to complete a service on a remote server via xmlrpclib. The problem is that, there are times that the remote server doesn't answer. This causes the thread to wait forever for a response which it never gets. Over time, number of threads in this state increases and will reach the maximum number of allowed threads on the system (I'm using fedora).
I tried to use socket.setdefaulttimeout(10); but the exception that is created by that will cause the server to defunct. I used it at server side but it seems that it doesn't work :/
Any idea how can I handle this issue?

You are doing what I usually call (originally in Spanish xD) "happy road programming". You should implement your programs to handle undesired cases, not only the ones you want to happen.
The threads here are only showing an underlying mistake: your server can't handle a timeout, and the implementation is rigid in a way that adding a timeout causes the server to crash due to an unhandled exception.
Implement it more robustly: it must be able to withstand an exception, servers can't die because of a misbehaving client. If you don't fix this kind of problem now, you may have similar issues later on.

It seems like your real problem is that the server hangs on certain requests, and dies if the client closes the socket - the threads are just a side effect of the implementation. If I'm understanding what you're saying correctly, then the only way to fix this would be to fix the server to respond to all requests, or to be more robust with network failure, or (preferably) both.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.