too many threads due to synch communication - python

I'm using threads and xmlrpclib in python at the same time. Periodically, I create a bunch of thread to complete a service on a remote server via xmlrpclib. The problem is that, there are times that the remote server doesn't answer. This causes the thread to wait forever for a response which it never gets. Over time, number of threads in this state increases and will reach the maximum number of allowed threads on the system (I'm using fedora).
I tried to use socket.setdefaulttimeout(10); but the exception that is created by that will cause the server to defunct. I used it at server side but it seems that it doesn't work :/
Any idea how can I handle this issue?

You are doing what I usually call (originally in Spanish xD) "happy road programming". You should implement your programs to handle undesired cases, not only the ones you want to happen.
The threads here are only showing an underlying mistake: your server can't handle a timeout, and the implementation is rigid in a way that adding a timeout causes the server to crash due to an unhandled exception.
Implement it more robustly: it must be able to withstand an exception, servers can't die because of a misbehaving client. If you don't fix this kind of problem now, you may have similar issues later on.

It seems like your real problem is that the server hangs on certain requests, and dies if the client closes the socket - the threads are just a side effect of the implementation. If I'm understanding what you're saying correctly, then the only way to fix this would be to fix the server to respond to all requests, or to be more robust with network failure, or (preferably) both.

Related

How to handle a burst of connection to a port?

I've built a server listening on a specific port on my server using Python (asyncore and sockets) and I was curious to know if there was anything possible to do when there is too many people connecting at once on my server.
The code in itself cannot be changed, but will adding more process works? or is it from an hardware perspective and I should focus on adding a load balancer in front and balancing the requests on multiple servers?
This questions is borderline StackOverflow (code/python) and ServerFault (server management). I decided to go with SO because of the code, but if you think ServerFault is better, let me know.
1.
asyncore relies on operating system for whole connection handling, therefore what you are asking is OS dependent. It has very little to do with Python. Using twisted instead of asyncore wouldn't solve your problem.
On Windows, for example, you can listen only for 5 connections coming in simultaneously.
So, first requirement is, run it on *nix platform.
The rest depends on how long your handlers are taking and on your bandwith.
2.
What you can do is combine asyncore and threading to speed-up waiting for next connection.
I.e. you can make Handlers that are running in separate threads. It will be a little messy but it is one of possible solutions.
When server accepts a connection, instead of creating new traditional handler (which would slow down checking for following connection - because asyncore waits until that handler does at least a little bit of its job), you create a handler that deals with read and write as non-blocking.
I.e. it starts a thread and does the job, then, when it has data ready, only then sends it upon following loop()'s check.
This way, you allow asyncore.loop() to check the server's socket more often.
3.
Or you can use two different socket_maps with two different asyncore.loop()s.
You use one map (dictionary), let say the default one - asyncore.socket_map to check the server, and use one asyncore.loop(), let say in main thread, only for server().
And you start the second asyncore.loop() in a thread using your custom dictionary for client handlers.
So, One loop is checking only server that accepts connections, and when it arrives, it creates a handler which goes in separate map for handlers, which is checked by another asyncore.loop() running in a thread.
This way, you do not mix the server connection checks and client handling. So, server is checked immediately after it accepts one connection. The other loop balances between clients.
If you are determined to go even faster, you can exploit the multiprocessor computers by having more maps for handlers.
For example, one per CPU and as many threads with asyncore.loop()s.
Note, sockets are IO operations using system calls and select() is one too, therefore GIL is released while asyncore.loop() is waiting for results. This means, that you will have total advantage of multithreading and each CPU will deal with its number of clients in literally parallel way.
What you would have to do is make the server distributing the load and starting threading loops upon connection arrivals.
Don't forget that asyncore.loop() ends when the map empties. So the loop() in a thread that manages clients must be started when new connection is accepted and restarted if at some time there are no more connections present.
4.
If you want to be able to run your server on multiple computers and use them as a cluster, then you install the process balancer in front.
I do not see the serious need for it if you wrote the asyncore server correctly and want to run it on single computer only.

What happens when you have an infinite loop in Django view code?

Something that I just thought about:
Say I'm writing view code for my Django site, and I make a mistake and create an infinite loop.
Whenever someone would try to access the view, the worker assigned to the request (be it a Gevent worker or a Python thread) would stay in a loop indefinitely.
If I understand correctly, the server would send a timeout error to the client after 30 seconds. But what will happen with the Python worker? Will it keep on working indefinitely? That sounds dangerous!
Imagine I've got a server in which I've allocated 10 workers. I let it run and at some point, a client tries to access the view with the infinite loop. A worker will be assigned to it, and will be effectively dead until the next server restart. The dangerous thing is that at first I wouldn't notice it, because the site would just be imperceptibly slower, having 9 workers instead of 10. But then it might happen again and again throughout a long span of time, maybe months. The site would just get progressively slower, until eventually it would be really slow with just one worker.
A server restart would solve the problem, but I'd hate to have my site's functionality depend on server restarts.
Is this a real problem that happens? Is there a way to avoid it?
Update: I'd also really appreciate a way to take a stacktrace of the thread/worker that's stuck in an infinite loop, so I could have that emailed to me so I'll be aware of the problem. (I don't know how to do this because there is no exception being raised.)
Update to people saying things to the effect of "Avoid writing code that has infinite loops": In case it wasn't obvious, I do not spend my free time intentionally putting infinite loops into my code. When these things happen, they are mistakes, and mistakes can be minimized but never completely avoided. I want to know that even when I make a mistake, there'll be a safety net that will notify me and allow me to fix the problem.
It is a real problem. In case of gevent, due to context switching, it can even immediately stop your website from responding.
Everything depends on your environment. For example, when running django in production through uwsgi you can set harakiri - that is time in seconds, after which thread handling the request will be killed if it didn't finish handling the response. It is strongly recommended to set such a value in order to deal with some faulty requests or bad code. Such event is reported in uwsgi log. I believe other solutions for running Django in production have similar options.
Otherwise, due to network architecture, client disconnection will not stop the infinite loop, and by default there will be no response at all - just infinite loading. Various timeout options (one of which harakiri is) may end up showing connection timeout - for example, php has (as far as i remember) default timeout of 30 seconds and it will return 504 gateway timeout. Socket disconnection timeout depends on http server settings and it will not stop application thread, it will only close client socket.
If not using gevent (or any other green threads), infinite loop will tend to take up 100% of available CPU power (limited to one core), possibly eating up more and more memory, so your website will work pretty slow and/or timeout really quick. Django itself is not aware of request time, so - as mentioned before - your production environment stack is the way to prevent this from happening. In case of uwsgi, http://uwsgi-docs.readthedocs.org/en/latest/Options.html#harakiri-verbose is the way to go.
Harakiri does print stack trace of the killed proces: (https://uwsgi-docs.readthedocs.org/en/latest/Tracebacker.html?highlight=harakiri) straight to uwsgi log, and due to alarm system you can get notified through e-mail (http://uwsgi-docs.readthedocs.org/en/latest/AlarmSubsystem.html)
I just tested this on Django's development server.
Results:
Does not give a timeout after 30 seconds. (this might because its not a production server though)
Stays in loading until i close the page.
I guess one way to avoid it, without actually just avoiding a code like that, would be to use threading to have control of timeouts and be able to stop the thread.
Maybe something like:
import threading
from django.http import HttpResponse
class MyThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
print "your possible infinite loop code here"
def possible_loop_view(request):
thread = MyThread()
thread.start()
return HttpResponse("html response")
Yes, your analysis is correct. The worker thread/process will keep running. Moreover, if there is no wait/sleep in the loop, it will hog the CPU. Other threads/process will get very little cpu, resulting your entire site on slow response.
Also, I don't think server will send any timeout error to client explicitly. If the TCP timeout is set, TCP connection will be closed.
Client may also have some timeout setting to get response, which may come into picture.
Avoiding such code is best way to avoid such code. You can also have some monitoring tool on server to look for CPU/memory usage and notify for abnormal activity so that you can take action.

multiprocessing and sockets. How to wait?

I have a cluster with 4 nodes and a master server. The master dispatches jobs that may take from 30 seconds to 15 minutes to end.
The nodes are listening with a SocketServer.TCPServer and in the master, I open a connection and wait for the job to end.
def run(nodes, args):
pool = multiprocessing.Pool(len(nodes))
return pool.map(load_job, zip(nodes, args))
the load_job function sends the data with socket.sendall and right after that, it uses socket.recv (The data takes a long time to arrive).
The program runs fine until about 200 or 300 of theses jobs run. When it breaks, the socket.recv receives an empty string and cannot run any more jobs until I kill the node processes and run them again.
How should I wait for the data to come? Also, error handling in pool is very poor because it saves the error from another process and show without the proper traceback and this error is not so common to repeat...
EDIT:
Now I think this problem has nothing to do with sockets:
After some research, looks like my nodes are opening way to many processes (because they also run their jobs in a multiprocessing.Pool) and somehow they are not being closed!
I found these SO question (here and here) talking about zombie processes when using multiprocessing in a daemonized process (exactly my case!).
I'll need to further understand the problem, but for now I'm killing the nodes and restoring them after some time.
(I'm replying to the question before the edit, because I don't understand exactly what you meant in it).
socket.recv is not the best way to wait for data on a socket. The best way I know is to use the select module (documentation here). The simplest use when waiting for data on a single socket would be select.select([your_socket],[],[]), but it can certainly be used for more complex tasks as well.
Regarding the issue of socket.recv receives an empty string; When the socket is a TCP socket (as it is in your case), this means the socket has been closed by the peer.
Reasons for this may vary, but the important thing to understand is that after this happens, you will no longer receive any data from this socket, so the best thing you can do with it is close it (socket.close). If you don't expect it to close, this is where you should search for the problem.
Good luck!

python SocketServer stuck on waitpid() syscall

I am using Python (2.7) SocketServer with ForkingMixIn. It worked well.
However sometimes on heavy usage (tons of rapidly connecting/disconnecting clients) the "server" stuck, consuming all the idle CPU (shown 100% CPU by top). If I use strace from CLI on the process it shows it does endless sequence of waitpid() syscall. According to command "ps" there are no child processes though at this point.
After this problem my server implementation goes unusable and only its restarting helps :( Clients can connect but no anwser, I guess just the "backlog" queue is used on OS side, but the python code never accepts the connection.
It can be easily reproduced eg with some privimitive HTTP implementation, and a browser (I used chrome) with CTRL-R (reload) hold down for something like 10 seconds. Of course the problem is triggered without this "brutal" try as well "on normal usage" just more rarely, and it was quite hard to even come with the idea what can be the problem. I wrote my own implementation of something like SocketServer with os.fork(), and socket functions, and it does not have this problem, but I am more happy with some "already ready", and "standard" solution.
The problem: it is not a nice thing, as my script implementing a server can be DoS'ed very easily in this way.
What I could notice: I installed a singal handler for SIGCHLD. It seems if I remove that, I can't reproduce the problem, however then I can see zombie processes (I guess since they are not wait()'ed). Even if I install signal handler with signal.SIG_IGN, I expereince this problem.
Can anybody help what can be the problem and how I can solve this? I'd like use singal handler anyway since it's also not so nice to leave many zombie processes, especially after a long run.
Thanks for any idea.
maybe related: What is the cost of many TIME_WAIT on the server side?
it is possible that you have all your max connections in a time_wait state.
check sysctl net.core.somaxconn for maximum connections.
check sysctl net.ipv4 for other configuration details (e.g. tw
check ulimit -n for max open file descriptors (sockets included)
you can try: sysctl net.ipv4.tcp_tw_reuse=1 to quickly reuse those sockets (don't keep it enabled unless you know what you're doing.)
check for file handle leaks.
[not-so] stupid question: how is your SocketServer implementation different from the standard one + ForkingMixIn?
However, it is really easy to abuse a ForkingMixIn (fork bomb), you might want to use green threads, e.g. the eventlet library ( http://eventlet.net/doc/index.html )
this might be your problem.
this: http://bugs.python.org/issue7978
this: http://mail.python.org/pipermail/python-bugs-list/2010-April/095492.html
this: http://twistedmatrix.com/trac/ticket/733
you will see that SIGCHLD handler is discouraged unless you take some extra measures (signal.siginterrupt(signal.SIGCHLD, False) in handler, or using a wake-up fd in select() call)

Multi-Threading and Asynchronous sockets in python

I'm quite new to python threading/network programming, but have an assignment involving both of the above.
One of the requirements of the assignment is that for each new request, I spawn a new thread, but I need to both send and receive at the same time to the browser.
I'm currently using the asyncore library in Python to catch each request, but as I said, I need to spawn a thread for each request, and I was wondering if using both the thread and the asynchronous is overkill, or the correct way to do it?
Any advice would be appreciated.
Thanks
EDIT:
I'm writing a Proxy Server, and not sure if my client is persistent. My client is my browser (using firefox for simplicity)
It seems to reconnect for each request. My problem is that if I open a tab with http://www.google.com in it, and http://www.stackoverflow.com in it, I only get one request at a time from each tab, instead of multiple requests from google, and from SO.
I answered a question that sounds amazingly similar to your, where someone had a homework assignment to create a client server setup, with each connection being handled in a new thread: https://stackoverflow.com/a/9522339/496445
The general idea is that you have a main server loop constantly looking for a new connection to come in. When it does, you hand it off to a thread which will then do its own monitoring for new communication.
An extra bit about asyncore vs threading
From the asyncore docs:
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.
As this quote suggests, using asyncore and threading should be for the most part mutually exclusive options. My link above is an example of the threading approach, where the server loop (either in a separate thread or the main one) does a blocking call to accept a new client. And when it gets one, it spawns a thread which will then continue to handle the communication, and the server goes back into a blocking call again.
In the pattern of using asyncore, you would instead use its async loop which will in turn call your own registered callbacks for various activity that occurs. There is no threading here, but rather a polling of all the open file handles for activity. You get the sense of doing things all concurrently, but under the hood it is scheduling everything serially.

Categories

Resources