For now I am doing this: (Python3, urllib)
url = 'someurl'
headers = '(('HOST', 'somehost'), /
('Connection', 'keep-alive'),/
('Accept-Encoding' , 'gzip,deflate'))
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
for h in headers:
opener.addheaders.append(x)
data = 'some logging data' #username, pw etc.
opener.open('somesite/login.php, data)
res = opener.open(someurl)
data = res.read()
... some stuff here...
res1 = opener.open(someurl2)
data = res1.read()
etc.
What is happening is this;
I keep getting gzipped responses from server and I stayed logged in (I am fetching some content which is not available if I were not logged in) but I think the connection is dropping between every request opener.open;
I think that because connecting is very slow and it seems like there is new connection every time. Two questions:
a)How do I test if in fact the connection is staying-alive/dying
b)How to make it stay-alive between request for other urls ?
Take care :)
This will be a very delayed answer, but:
You should see urllib3. It is for Python 2.x but you'll get the idea when you see their README document.
And yes, urllib by default doesn't keep connections alive, I'm now implementing urllib3 for Python 3 to be staying in my toolbag :)
Just if you didn't know yet, python-requests offer keep-alive feature, thanks to urllib3.
Related
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
I'm having trouble with a POST using httplib. Here is the code:
import base64
import urllib
import httplib
http = urllib3.PoolManager()
head = {"Authorization":"Basic %s" % base64.encodestring("foo:bar")}
fields = {"token":"088cfe772ce0b7760186fe4762843a11"}
conn = httplib.HTTPSConnection("foundation.iplantc.org")
conn.set_debuglevel(2)
conn.request('POST', '/auth-v1/renew', urllib.urlencode(fields), head)
print conn.getresponse().read()
conn.close()
The POST that comes out is correct. I know I started a telnet session and typing it in worked fine. Here it is:
'POST /auth-v1/renew HTTP/1.1\r\nHost: foundation.iplantc.org\r\nAccept-Encoding: identity\r\nContent-Length: 38\r\nAuthorization: Basic YXRlcnJlbDpvTnl12aesf==\n\r\n\r\ntoken=088cfe772ce0b7760186fe4762843a11'
But the response from the server is "token not found" when the python script sends it. BTW this does work fine with urllib3 (urllib2 shows the same error), which uses an multipart encoding, but I want to know what is going wrong with the above. I would rather not depend on yet another 3rd party package.
httplib doesn't automatically add a Content-Type header, you have to add it yourself.
(urllib2 automatically adds application/x-www-form-urlencoded as Content-Type).
But what's probably throwing the server off is the additional '\n' after your authorization header, introduced by base64.encodestring. You should better use base64.urlsafe_b64encode instead.
For a given url, how can I detect final internet location after HTTP redirects, without downloading final page (e.g. HEAD request.) using python. I am trying to write a mass downloader, my downloading mechanism needs to know internet location of page before downloading it.
edit
I ended up doing this, I hope this helps other people. I am still open to other methods.
import urlparse
import httplib
def getFinalUrl(url):
"Navigates Through redirections to get final url."
parsed = urlparse.urlparse(url)
conn = httplib.HTTPConnection(parsed.netloc)
conn.request("HEAD",parsed.path)
response = conn.getresponse()
if str(response.status).startswith("3"):
new_location = [v for k,v in response.getheaders() if k == "location"][0]
return getFinalUrl(new_location)
return url
I strongly suggest you to use requests library. It is well coded and actively maintained. Requests can make anything you need like prefetch/
From the Requests' documentation http://docs.python-requests.org/en/latest/user/advanced/ :
By default, when you make a request, the body of the response is downloaded immediately. You can override this behavior and defer downloading the response body until you access the Response.content attribute with the prefetch parameter:
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, prefetch=False)
At this point only the response headers have been downloaded and the connection remains open, hence allowing us to make content retrieval conditional:
if int(r.headers['content-length']) < TOO_LONG:
content = r.content
...
You can further control the workflow by use of the Response.iter_content and Response.iter_lines methods, or reading from the underlying urllib3 urllib3.HTTPResponse at Response.raw
You can use httplib to send HEAD requests.
You can also have a look at python-requests, which seems to be the new trendy API for HTTP requests, replacing the possibly awkward httplib2. (see Why Not httplib2)
It also has a head() method for this.
Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it !
Anyways, here is the simplest possible code to retrieve files in a loop:
import urlgrabber
for i in range(1, 100):
url = "http://www.iana.org/domains/example/"
urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>#<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello all !
I am trying to write a very simple python script to grab a bunch of files via urllib2.
This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).
Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !
Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.
Here is a simplified version of my script (with identified information stripped, obviously):
import urllib2
passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)
proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)
for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
f = urllib2.urlopen("http://www.iana.org/domains/example/")
outfile.write(f.read())
Thanks in advance !
You can minimize the number of connection by using the keepalive handler from the urlgrabber module.
import urllib2
from keepalive import HTTPHandler
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(keepalive_handler)
urllib2.install_opener(opener)
fo = urllib2.urlopen('http://www.python.org')
I am unsure that this will work correctly with your Proxy setup.
You may have to hack the keepalive module.
The proxy might be throttling your requests. I guess it thinks you look like a bot.
You could add a timeout, and see if that gets you through.
how do i change my headers and request so that i appear as firefox ...
like when request to some servers
import urllib
f = urllib.urlopen("rss feed")
they deny my request saying your client dosent have permission...
i get reply but the reply contains " your client dosent have permission"
so how do i get around this and get the data...
http://vsbabu.org/mt/archives/2003/05/27/urllib2_setting_http_headers.html
If you want to use good old urllib instead of newer, fancier urllib2, then as urllib's docs say, and I quote,
For example, applications may want to specify a different User-Agent header than URLopener defines. This can be accomplished with the following code:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "App/1.7"
urllib._urlopener = AppURLopener()
Of course, you'll want a version (aka User-Agent header) suitable for whatever version of Firefox (or w/ever else;-) you want to pretend you are;-).