What can cause a “Resource temporarily unavailable” on sock connect() command - python

I am debugging a Python flask application. The application runs atop uWSGI configured with 6 threads and 1 process. I am using Flask-Executor to offload some slower tasks. These tasks create a connection with the Flask application, i.e., the same process, and perform some HTTP GET requests. The executor is configured to use 2 threads max. This application runs on Ubuntu 16.04.3 LTS.
Every once in a while the threads in the executor completely stop working. The code uses the Python requests library to do the requests. The underlying error message is:
Action failed. HTTPSConnectionPool(host='somehost.com', port=443): Max retries exceeded with url: /api/get/value (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f8d75bb5860>: Failed to establish a new connection: [Errno 11] Resource temporarily unavailable',))
The code that is running within the executor looks like this:
adapter = requests.adapters.HTTPAdapter(max_retries=3)
session = requests.Session()
session.mount('http://somehost.com:80', adapter)
session.headers.update({'Content-Type': 'application/json'})
...
session.get(uri, params=params, headers=headers, timeout=3)
I've spent a good amount of time trying to peel back the Python requests stack down to the C sockets that it uses. I've also tried reproducing this error using small C and Python programs. At first I thought it could be that sockets were not getting closed and so we were running out of allowable sockets as a resource, but that gives me a message more along the lines of "too many files are open".
Setting aside the Python stack, what could cause a [Errno 11] Resource temporarily unavailable on a socket connect() command? Also, if you've run into this using requests, are there arguments that I could pass in to prevent this?
I've seen the What can cause a “Resource temporarily unavailable” on sock send() command StackOverflow post, but I'm that's on a send() command and not on the initial connect(), which is what I suspect is where the code is getting hung up.

The error message Resource temporarily unavailable corresponds to the error code EAGAIN.
The connect() manpage states, that the error `EAGAIN occurs in the following situations:
No more free local ports or insufficient entries in the routing cache. For AF_INET see the description of /proc/sys/net/ipv4/ip_local_port_range ip(7) for information on how to increase the number of local ports.
This can happen, when very many connections to the same IP/port combination are in use and no local port for automatic binding can be found. You can check with
netstat -tulpen
which connections exactly cause this.

Related

Azure functions python - how to prevent SNAT port exhaustion?

So I have an Azure functions app written in python and quite often the code throws an error like this.
HTTPSConnectionPool(host='www.***.com', port=443): Max retries exceeded with url: /x/y/z (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7faba31d0438>: Failed to establish a new connection: [Errno 110] Connection timed out',))
This happens in a few diffrent functions that make https connections.
I contacted support and they told me that this was caused by SNAT port exhaustion and adviced me to: "Modify the application to reuse connections instead of creating a connection per request, use connection pooling, use service endpoints if the you are connecting to resources in Azure." They sent me this link https://4lowtherabbit.github.io/blogs/2019/10/SNAT/ and also this https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections
Problem is I am unsure about how to practically reuse and or pool connections in python and I am unsure what the primary cause of exhaustion is, as this data is not publicly available.
So I am looking for help with applying their advice to all our http(s) and database connections.
I made the assumption that pymongo and pyodbc (the database clients we use) would handle pooling an reuse despite me creating a new client each time a function runs. Is this incorrect and if so, how do I reuse these database clients in python to prevent this?
The problem has so far only been caused when using requests (or the zeep SOAP library that internally defaults to using requests) to hit a https endpoint. Is there any way I could improve how I use requests. Like reusing sessions or closing connections explicitly. I am aware that requests creates a session in the background when calling requests.get. But my knowledge about the library is insufficient to figure out if this is the problem and how I could solve it. I am thinking I might be able to create and reuse a single session instance for each specific http(s) call in each function, but I am unsure if this is correct and also I have no idea on how to actually do it.
In a few places I also use aiohttp and if possible would like to achive the same thing there.
I haven't looked into service endpoints yet but I am about to.
So in short. What can I in pratice do to ensure reusage/pooling with requests, pyodbc, pymongo and aiohttp?

I wish to avoid overrunning the http connections pool

I am creating a tool that will run many simultaneous calls to a RESTful API. I am using the python "Requests" module and the "threading" module. Once I stack too many simultaneous gets on the system I am getting exceptions like this:
ConnectionError: HTTPConnectionPool(host='xxx.net', port=80): Max retries exceeded with url: /thing/subthing/ (Caused by : [Errno 10055] An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full)
What can I do to either increase the buffer and queue space, or ask the Requests module to wait for an available slot?
(I know I could stuff it in a "try" loop, but that seems clumsy)
Use a session. If you use the requests.request family of methods (get, post, ...), each request will use it's own session with it's own connection pool, therfore not making any use of connection pooling.
If you need to fine-tune the number of connections used within a session, you can do this by changing it's HTTPAdapter

Python -- ConnectionError: Max retries exceeded

I occasionally get this error when my server (call it Server A) makes requests to a resource on another one of my servers (all it Server B):
ConnectionError: HTTPConnectionPool(host='some_ip', port=some_port): Max retries exceeded with url: /some_url/ (Caused by : [Errno 111] Connection refused)
The message in the exception is
message : None: Max retries exceeded with url: /some_url/ (Caused by redirect)
which I include because it has that extra piece of information (caused by redirect).
As I said, I control both servers involved in this request, so I can make changes to either and/or both. Also, the error appears to be intermittent, in that it doesn't happen every time.
Potentially relevant information -- Server A is a Python server running apache, and Server B is a NodeJS server. I am not exactly a web server wizard, so beyond that, I'm not exactly sure what information would be relevant.
Does anyone know exactly what this error means, or how to go about investigating a fix? Or, does anyone know which server is likely to be the problem, the one making the request, or the one receiving it?
Edit: The error has begun happening with our calls to external web resources also.
You are getting a CONN Refused on "some_ip" and port. That's likely caused by
- No server actually listening on that port/IP combination
- Firewall settings that send Conn Refused (less likely a cause!)
- Third - a misconfigured (more likely) or busy server, that cannot handle requests.
I Believe When - server A is trying to connect to server B you are getting that error. (Assuming it's Linux and/or some unix derivative) what does netstat -ln -tcp show on the server? (man netstat to understand the flags - what we are doing here is - trying to find which all programs are listening on which port). If that indeed shows your server B listening - iptables -L -n to show the firewall rules. If nothing's wrong there - it's a bad configuration of listen queue most probably. (http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/023/2333/2333s2.html) or google for listen backlog.
This most likely is a bad configuration issue on your server B. (Note: a redirect loop as someone mentioned above - not handled correctly could just end up making the server busy! so possibly solving that could solve your problem as well)
If you're using gevent on your python server, you might need to upgrade the version. It looks like there's just some bug with gevent's DNS resolution.
This is a discussion from the requests library: https://github.com/kennethreitz/requests/issues/1202#issuecomment-13881265
This looks like a redirect loop on the Node side.
You mention server B is the node server, you can accidentally create a redirect loop if you set up the routes incorrectly. For example, if you are using express on server B - the Node server, you might have two routes, and assuming you keep your route logic in a separate module:
var routes = require(__dirname + '/routes/router')(app);
//... express setup stuff like app.use & app.configure
app.post('/apicall1', routes.apicall1);
app.post('/apicall2', routes.apicall2);
Then your routes/router.js might look like:
module.exports = Routes;
function Routes(app){
var self = this;
if (!(self instanceof Routes)) return new Routes(app);
//... do stuff with app if you like
}
Routes.prototype.apicall1 = function(req, res){
res.redirect('/apicall2');
}
Routes.prototype.apicall2 = function(req, res){
res.redirect('/apicall1');
}
That example is obvious, but you might have a redirect loop hidden in a bunch of conditions in some of those routes. I'd start with the edge cases, like what happens at the end of the conditionals within the routes in question, what is the default behavior if the call for example doesn't have the right parameters and what is the exception behavior?
As an aside, you can use something like node-validator (https://github.com/chriso/node-validator) to help determine and handle incorrect request or post parameters
// Inside router/routes.js:
var check = require('validator').check;
function Routes(app){ /* setup stuff */ }
Routes.prototype.apicall1 = function(req, res){
try{
check(req.params.csrftoken, 'Invalid CSRF').len(6,255);
// Handle it here, invoke appropriate business logic or model,
// or redirect, but be careful! res.redirect('/secure/apicall2');
}catch(e){
//Here you could Log the error, but don't accidentally create a redirect loop
// send appropriate response instead
res.send(401);
}
}
To help determine if it is a redirect loop you can do one of several things, you can use curl to hit the url with the same post parameters (assuming it is a post, otherwise you can just use chrome, it'll error out in the console if it notices a redirect loop), or you can write to stdout on the Node server or syslog out inside of the offending route(s).
Hope that helps, good thing you mentioned the "caused by redirect" part, that is I think the problem.
The example situation above uses express to describe the situation, but of course the problem can exist using just connect, other frameworks, or even your own handler code as well if you aren't using any frameworks or libraries at all. Either way, I'd make it a habit to put in good parameter checking and always test your edge cases, I've run myself into this problem exactly when I've been in a hurry in the past.

PyAPNS SSL3_WRITE_PENDING error

I'm using PyAPNS module and Bottle framework in demo of my app to send push notifications to all registered devices.
At the beginning everything works fine, I've followed manual for PyAPNS. But after some time my service is running in the background on server, I start to recieve error:
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
After restarting service everything works fine again. What should I do with that? Or how should I run such service in background? (for now I'm just running it in another screen)
I had the same issue as you did when using this library (I'm assuming you are in fact using https://github.com/simonwhitaker/PyAPNs, which is what I'm using. There is at least one other lib out there with a similar name, but I don't think you'd be using that).
AFAIK when you're using the simple notification service the APNS server might hang up on you for reasons including: using an incorrect token, having a malformed request, etc. Or perhaps your connection might get broken if your network connection drops out or you. The PyAPNS code doesn't handle such a hangup very gracefully right now and it attempts to re-use the socket even after it has been closed. My experience with seeing the SSL3_WRITE_PENDING error was that I would always see an error such as "error: [Errno 110] Connection timed out" happen on the socket before I would then get SSL3_WRITE_PENDING error when PyAPNS tried to re-use the socket.
If you are seeing the server hangup on you and you want to know why it's doing that, it helps to use the enhanced version of APNS, so that the server will write back info about what you did wrong.
As it happens, there is currently a pull request (https://github.com/simonwhitaker/PyAPNs/pull/23/files) that both moves PyAPNS to use enhanced APNS AND handles disconnections more gracefully. You'll see I commented on that pull request and have created my own fork of PyAPNS that handles disconnections in the way that suited my use case the best.
So you can use the code from pull request to perhaps find out why the APNS server is hanging up on you. And / or you could use it to simplify your failure recovery so you just retry the send if an exception is thrown rather than have to re-create the APNS object.
Hopefully the pull request will be merged to master soon (possibly including my changes as well).

104, 'Connection reset by peer' socket error, or When does closing a socket result in a RST rather than FIN?

We're developing a Python web service and a client web site in parallel. When we make an HTTP request from the client to the service, one call consistently raises a socket.error in socket.py, in read:
(104, 'Connection reset by peer')
When I listen in with wireshark, the "good" and "bad" responses look very similar:
Because of the size of the OAuth header, the request is split into two packets. The service responds to both with ACK
The service sends the response, one packet per header (HTTP/1.0 200 OK, then the Date header, etc.). The client responds to each with ACK.
(Good request) the server sends a FIN, ACK. The client responds with a FIN, ACK. The server responds ACK.
(Bad request) the server sends a RST, ACK, the client doesn't send a TCP response, the socket.error is raised on the client side.
Both the web service and the client are running on a Gentoo Linux x86-64 box running glibc-2.6.1. We're using Python 2.5.2 inside the same virtual_env.
The client is a Django 1.0.2 app that is calling httplib2 0.4.0 to make requests. We're signing requests with the OAuth signing algorithm, with the OAuth token always set to an empty string.
The service is running Werkzeug 0.3.1, which is using Python's wsgiref.simple_server. I ran the WSGI app through wsgiref.validator with no issues.
It seems like this should be easy to debug, but when I trace through a good request on the service side, it looks just like the bad request, in the socket._socketobject.close() function, turning delegate methods into dummy methods. When the send or sendto (can't remember which) method is switched off, the FIN or RST is sent, and the client starts processing.
"Connection reset by peer" seems to place blame on the service, but I don't trust httplib2 either. Can the client be at fault?
** Further debugging - Looks like server on Linux **
I have a MacBook, so I tried running the service on one and the client website on the other. The Linux client calls the OS X server without the bug (FIN ACK). The OS X client calls the Linux service with the bug (RST ACK, and a (54, 'Connection reset by peer')). So, it looks like it's the service running on Linux. Is it x86_64? A bad glibc? wsgiref? Still looking...
** Further testing - wsgiref looks flaky **
We've gone to production with Apache and mod_wsgi, and the connection resets have gone away. See my answer below, but my advice is to log the connection reset and retry. This will let your server run OK in development mode, and solidly in production.
I've had this problem. See The Python "Connection Reset By Peer" Problem.
You have (most likely) run afoul of small timing issues based on the Python Global Interpreter Lock.
You can (sometimes) correct this with a time.sleep(0.01) placed strategically.
"Where?" you ask. Beats me. The idea is to provide some better thread concurrency in and around the client requests. Try putting it just before you make the request so that the GIL is reset and the Python interpreter can clear out any pending threads.
Don't use wsgiref for production. Use Apache and mod_wsgi, or something else.
We continue to see these connection resets, sometimes frequently, with wsgiref (the backend used by the werkzeug test server, and possibly others like the Django test server). Our solution was to log the error, retry the call in a loop, and give up after ten failures. httplib2 tries twice, but we needed a few more. They seem to come in bunches as well - adding a 1 second sleep might clear the issue.
We've never seen a connection reset when running through Apache and mod_wsgi. I don't know what they do differently, (maybe they just mask them), but they don't appear.
When we asked the local dev community for help, someone confirmed that they see a lot of connection resets with wsgiref that go away on the production server. There's a bug there, but it is going to be hard to find it.
Normally, you'd get an RST if you do a close which doesn't linger (i.e. in which data can be discarded by the stack if it hasn't been sent and ACK'd) and a normal FIN if you allow the close to linger (i.e. the close waits for the data in transit to be ACK'd).
Perhaps all you need to do is set your socket to linger so that you remove the race condition between a non lingering close done on the socket and the ACKs arriving?
I had the same issue however with doing an upload of a very large file using a python-requests client posting to a nginx+uwsgi backend.
What ended up being the cause was the the backend had a cap on the max file size for uploads lower than what the client was trying to send.
The error never showed up in our uwsgi logs since this limit was actually one imposed by nginx.
Upping the limit in nginx removed the error.

Categories

Resources