Python urllib2: Cannot assign requested address - python

I am sending thousands of requests using urllib2 with proxies. I have received many of the following error on execution:
urlopen error [Errno 99] Cannot assign requested address
I read here that it may be due to a socket already being bonded. Is that the case? Any suggestions on how to fix this?

Here is an answer to a similar looking question that I prepared earlier.... much earlier...
Socket in use error when reusing sockets
The error is different, but the underlying problem is probably the same: you are consuming all available ports and trying to reuse them before the TIME_WAIT state has ended.
[EDIT: in response to comments]
If it is within the capability/spec for your application, one obvious strategy is to control the rate of connections to avoid this situation.
Alternatively, you could use the httplib module. httplib.HTTPConnection() lets you specify a source_address tuple with which you can specify the port from which to make the connection, e.g. this will connect to localhost:1234 from localhost:9999:
import httplib
conn = httplib.HTTPConnection('localhost:1234', source_address=('localhost',9999))
conn.request('GET', '/index.html')
Then it is a matter of managing the source port assignment as described in my earlier answer. If you are on Windows you can use this method to get around the default range of ports 1024-5000.
There is (of course), an upper limit to how many connections you are going to be able to make and it is questionable what sort of an application would require making thousands of connections in rapid succession.

As mhawke suggested, the issue of TIME_WAIT seems most likely. The system wide fix for your situation can be to adjust kernel parameters so such connections are cleaned up more often. Two options:
$ sysctl net.ipv4.tcp_tw_recycle=1
This will let the kernel reuse connections in TIME_WAIT state. This may cause issues with NAT setups. Another one is:
$ sysctl net.ipv4.tcp_max_orphans=8192
$ sysctl net.ipv4.tcp_orphan_retries=1
This tells the kernel to keep at most 8192 connections not attached to any user process and only retry once before killing TCP connections.
Note that these are not permanent changes. Add the setting to /etc/sysctl.conf to make them permanent.
http://code.google.com/p/lusca-cache/issues/detail?id=89#c4
http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.obscure.html

I have had a similar issue but was using POST command using python's request library though!!
To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
Just give it a try with request library.

Related

How do I keep a FTP connection alive?

I used ftputil to download a batch of files from a FTP server. It raised the error ftputil.error.FTPIOError: [Errno 60] Operation timed out.
As described in Documentation – ftputil,
keep_alive() attempts to keep the connection to the remote server active in order to prevent timeouts from happening. This method is primarily intended to keep the underlying FTP connection of an FTPHost object alive while a file is uploaded or downloaded. This will require either an extra thread while the upload or download is in progress or calling keep_alive from a callback function.
I called keep_alive from a callback function with,
ftp_host.download(source, target, callback=ftp_host.keep_alive)
but it raised ERROR __main__ keep_alive() takes 1 positional argument but 2 were given.
How do I keep a FTP connection alive?
This isn't directly an answer to your question, but it may help finding an answer for your particular problem yourself. Also, a ticket on the ftputil website is better for help with debugging a problem. That said, I think it's fine to ask on StackOverflow first since you don't know in advance if the problem is a simple one or not. :-)
Since FTP is a stateful protocol, client and server can't send arbitrary commands at a given time. The allowed commands and possibly replies are determined by the state the connection is in. See also the state diagrams in RFC 959.
To work around this limitation, ftputil creates a new FTP connection behind the scenes for each remote file object [1]. With this approach, you can still send commands like chdir or start a download while another is still in progress. However, this means that from the perspective of the server, all these FTP connections that come from a single FTPHost object are independent connections, so each of these connections can have their timeout at different times, depending on the usage pattern of the respective connection.
For example, there was ftputil ticket 141, where presumably the main connection initiated by the FTPHost object timed out while a connection used for downloading was still usable.
In your case, it might be helpful to find out which of the underlying connections is timing out (the initial connection or a connection for a remote file). You can use ftputil.session.session_factory to create factories that have FTP debugging enabled (see the documentation).
Unfortunately, a timeout of 60 seconds is quite short, so there are relatively many chances for timeouts.
Especially given the possibility of timeouts in FTP connections, my advice is to write software for FTP transfers in a way so that you can restart the operation (ideally with a new FTPHost object for robustness) where it was interrupted by the timeout. So far I haven't been able to come up with a way to universally work around timeouts. In simple cases you may actually be better off using ftplib directly, although ftputil has robustness and latency improvements that ftplib doesn't have. Using ftplib doesn't save you from timeouts, but at least you don't have any "hidden" connections that may make debugging more difficult.
[1] That said, if you close a remote file in ftputil, the underlying FTP connection can be reused unless it's not timed out. The library checks for a timeout before it reuses the connection.
The picture regarding timeouts is even more complicated by ftputil caching a lot of information from the server to reduce latency. For example, if you call FTPHost.getcwd(), the current directory is retrieved from a cached attribute, not by sending a PWD command to the server and thereby resetting the timeout. Stat information from directory listings is also usually cached.
After couple hours looking for solutions I got it running without '421 Timeout' errors calling keepalive from separate thread. However your I/O Timeout error probably was caused by connection problems.
import ftputil
from threading import Thread
from time import sleep
fhandle = ftputil.FTPHost('host', 'user', 'pwd')
quitThread = 0
def _thread_keep_alive():
while quitThread == 0:
print("KEEPALIVE!")
fhandle.keep_alive()
sleep(25)
thread = Thread(target = _thread_keep_alive)
thread.start()
# some downloading...
quitThread = 1
fhandle.close()

Ignore a request in Flask

I want to not answer a request handled by Flask. I don't want to return any error code, data, or an answer at all.
What I am trying to accomplish by doing this is that there is an endpoint takes sensor data and do not return any information. The clients POST the data to this endpoint, but they do not wait for an answer and shutdown (I have no control over the clients.) So I'm seeing the following error: "[Errno 10053] An established connection was aborted by the software in your host machine". So I asked myself, why do I even respond to these requests.
I can think of two reasons to do something like this:
You have a "friend" that you want to prevent from accessing your site, or
You have the misguided notion that this will help prevent (D)DoS attacks.
When you say "ignore a request totally" you kind of actually can't do that, generally speaking. Unless you know the IP address that the traffic is coming from, and then you can instruct your OS, Network card, router, switch, load balancer, maybe even ISP to filter out the traffic coming from that IP.
Otherwise, you're kind of out of luck because of how the Internet works.
HTTP works over TCP*. Specifically the client process looks something like this:
Translate DNS (e.g. google.com) to IP address (e.g. 216.58.218.174)
open up a TCP connection to 216.58.218.174:80 (using google for the example)
send the HTTP header over to Google:
GET / HTTP/1.1
read the response
Once that TCP/IP connection has been created to your server, at the very least you're going to have to terminate the connection.
There's really no good way to do this from within Python itself, and certainly not within Flask.
As you've updated your answer, it turns out you really don't have to change anything, Flask is already handling the error behind the scenes. It may be routing the message to a specific logger that you might be able to handle if you really don't want to see the messages, but it's not really important.
The only thing you may want to do, if your return processing is expensive (like tying up the database with a several second long query) is look into streaming your response instead, which will fail much more cheaply.
*Mostly. Sure you can do it over UDP, but you probably aren't

Python Requests, warning: urllib3.connectionpool:Connection pool is full

I'm using the requests library in python 3 and despite my best efforts I can't get the following warning to disappear:
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: myorganization.zendesk.com
I'm using requests in a multithreaded environment to get and post json files concurrently to a single host, definitely no subdomains. In this current set up I'm using just 20 threads.
I attempted to use a Session in order to get requests to reuse connections and thus get rid of the problem, but it hasn't worked. This is the code in my class constructor:
self.session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
pool_connections=100, pool_maxsize=100)
self.session.mount('http://', adapter)
self.session.headers.update({'Connection':'Keep-Alive'})
self.session.auth = (self._user+"/token", self._token)
According to advice from here I shouldn't need to increase the pooled connections by that much considering the number of threads I'm using, but despite this I get this warning even when raising by 100.
This makes me think that connections are not being reused at all, or if they are, too many are being created for some reason. I've updated requests, so it is the most up to date version.
Does anyone have any ideas how I can get rid of this? I'm debugging some code and I think this is the blame for some requests not being made correctly.
Related:
Can I change the connection pool size for Python's "requests" module?
Since zendesk communicates over https, you just need to mount the adapter to the https protocol, i.e.
self.session.mount('https://', adapter)

Which ports will python use to send html request? with urllib or urllib2

I'm using webpy to make a small site. When I want to use OAuth, i find that the firewall will stop the http request to any site, I even can't use IE to browse the Internet.
So i asked the aministrator to open some ports for me, but i don't know which ports will be used by python or IE to send http request.
Thanks!
I assume you're talking about the remote ports. In that case, just tell the admin to open the standard web ports. Really, if your admin doesn't know how to make IE work through the firewall, he's hopeless. I suggest walking up to random people on the street and say "80 and 443" until someone looks up, then fire your admin and hire that guy; he can't be any worse.
If your admin does know what he's doing, and wants you to use an HTTP proxy instead of connecting directly, ask him to set up the proxy for you in IE, look at the settings he uses, then come back here and search for how to use HTTP proxies in Python (there are lots of answers on that), and ask for help if you get stuck.
If you're talking about the local ports, because you're got an insane firewall, they'll be picked at random from some large range. If you want to cover every common OS, you need all of 1024-65535 to be opened up, although if you only need to deal with a single platform, most use a smaller range than that, and if the machine won't be doing much but running your program, most have a way to restrict it to an even smaller range (e.g., as small as 255 ports on Windows). Google "ephemeral port" for details.
If you need to restrict your local port, the key is to call bind on your socket before calling connect. If you think you're talking about the local ports, you're probably wrong. Go ask your admin (or the new one you just hired) and make sure. But if you are…
If you're using urllib/urllib2, it does not have any way to do what you want, so you can't do that anymore. You can drop down to httplib, which lets you pass a source_address, a (host, port) tuple that it will use to bind the socket before connecting. It's not as simple as what you're using, but it's a lot easier than implementing HTTP from scratch.
You might also want to look at requests, which I know has its own native OAuth support, and probably has a way to specify a source address. (I haven't looked, but I usually find that whenever I want to know if requests can do X, it can, and in the most obvious way I think of…) The API for requests is generally similar to urllib2 when urllib2 is sane, simpler and cleaner when urllib2 is messy, so it's usually pretty easy to port things.
At any rate, however you do this, you will have to consider the fact that only one socket can be bound to the same local port at a time. So, what happens if two programs are running at once, and they both need to make an outbound connections, and your admin has only given you one port? One of them will fail. Is that acceptable?
If not, what you really need to do is open a range of ports, and write code that does a random.shuffle on the range, then tries to bind them until one succeeds. Which means you'll need an HTTP library that lets you feed in a socket factory or a pre-opened socket instead of just specifying a port, which most of them do not, which probably means you'll be hacking up a copy of the httplib source.
If all else fails, you can always set up a local proxy that binds to whatever source port (or port range) you want when proxying outward. Then you can just use your favorite high-level library, as-is, and connect to the local proxy, and there's no way the firewall can tell what's going on behind the proxy.
As you can see, this is not easy. That's mainly because you very actually rarely this.
Generally when a program wants to use a port but doesn't care which number it has, it uses an "ephemeral port." This is typical for client applications, where the remote port is fixed (by the server), but the local port doesn't make any difference.
Often a firewall will allow outgoing connections from any port, simply blocking incoming connections to unknown ports, on the theory that internal machines making outgoing requests should be allowed to decide what is proper, and that bad actors are all on the "public" side.
You may find that your administrator requires you to use an "HTTP proxy." If so, here are the instructions for Ruby which I imagine you can port to Python: Ruby and Rails - oauth and http proxy

How to speed-up a HTTP request

I need to get json data and I'm using urllib2:
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
opener = urllib2.build_opener()
connection = opener.open(request)
data = connection.read()
but although the data aren't so big it is too slow.
Is there a way to speed it up? I can use 3rd party libraries too.
Accept-Encoding:gzip means that the client is ready to gzip Encoded content if the Server is ready to send it first. The rest of the request goes down the sockets and to over your Operating Systems TCP/IP stack and then to physical layer.
If the Server supports ETags, then you can send a If-None-Match header to ensure that content has not changed and rely on the cache. An example is given here.
You cannot do much with clients only to improve your HTTP request speed.
You're dependant on a number of different things here that may not be within your control:
Latency/Bandwidth of your connection
Latency/Bandwidth of server connection
Load of server application and its individual processes
Items 2 and 3 are probably where the problem lies and you won't be able to do much about it. Is the content cache-able? This will depend on your own application needs and HTTP headers (e.g. ETags, Cache-Control, Last-Modified) that are returned from the server. The server may only up date every day in which case you might be better off only requesting data every hour.
There is unlikely an issue with urllib. If you have network issues and performance problems: consider using tools like Wireshark to investigate on the network level. I have very strong doubts that this is related to Python in any way.
If you are making lots of requests, look into threading. Having about 10 workers making requests can speed things up - you don't grind to a halt if one of them takes too long getting a connection.

Categories

Resources