I am developing a download manager. Using the requests module in python to check for a valid link (and hopefully broken links).
My code for checking link below:
url = 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe'
r = requests.get(url, allow_redirects=False) # this line takes 40 seconds
if r.status_code==200:
print("link valid")
else:
print("link invalid")
Now, the issue is this takes approximately 40 seconds to perform this check, which is huge.
My question is how can I speed this up maybe using urllib2 or something??
Note: Also if I replace url with the actual URL which is 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe', this takes one second so it appears to be an issue with requests.
Not all hosts support head requests. You can use this instead:
r = requests.get(url, stream=True)
This actually only download the headers, not the response content. Moreover, if the idea is to get the file afterwards, you don't have to make another request.
See here for more infos.
Don't use get that actually retrieves the file, use:
r = requests.head(url,allow_redirects=False)
Which goes from 6.9secs on my machine to 0.4secs
Related
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text
I'm fetching a batch of urls using the Python Requests module. I first want to read their headers only, to get the actual url and size of response. Then I get the actual content for any that pass muster.
So I use 'streams=True' to delay getting the content. This generally works fine.
But I'm encountering an occasional url that doesn't respond. So I put in timeout=3.
But those never time out. They just hang. If I remove the 'streams=True' it times out correctly. Is there some reason streams and timeout shouldn't work together? Removing the streams=True forces me to get all the content.
Doing this:
import requests
url = 'http://bit.ly/1pQH0o2'
x = requests.get(url) # hangs
x = requests.get(url, stream=True) # hangs
x = requests.get(url, stream=True, timeout=1) # hangs
x = requests.get(url, timeout=3) # times out correctly after 3 seconds
There was a relevant github issue:
Timeouts do not occur when stream == True
The fix was included into requests==2.3.0 version.
Tested it using the latest version - worked for me.
Do you close your responses? Unclosed and partially read responses can make multiple connections to the same resource and site may have connection limit for a single IP.
url = 'http://developer.usa.gov/1usagov.json'
r = requests.get(url)
Python code hangs forever and i not behind a http proxy or anything.
Pointing my browser directly to the url works
Following my comment above.. I think your problem is the continuous stream. You need to do something like in the doc
r = requests.get(url, stream=True)
if int(r.headers['content-length']) < TOO_LONG:
# rebuild the content and parse
with a while instead of if if you want a continuous loop.
For a given url, how can I detect final internet location after HTTP redirects, without downloading final page (e.g. HEAD request.) using python. I am trying to write a mass downloader, my downloading mechanism needs to know internet location of page before downloading it.
edit
I ended up doing this, I hope this helps other people. I am still open to other methods.
import urlparse
import httplib
def getFinalUrl(url):
"Navigates Through redirections to get final url."
parsed = urlparse.urlparse(url)
conn = httplib.HTTPConnection(parsed.netloc)
conn.request("HEAD",parsed.path)
response = conn.getresponse()
if str(response.status).startswith("3"):
new_location = [v for k,v in response.getheaders() if k == "location"][0]
return getFinalUrl(new_location)
return url
I strongly suggest you to use requests library. It is well coded and actively maintained. Requests can make anything you need like prefetch/
From the Requests' documentation http://docs.python-requests.org/en/latest/user/advanced/ :
By default, when you make a request, the body of the response is downloaded immediately. You can override this behavior and defer downloading the response body until you access the Response.content attribute with the prefetch parameter:
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, prefetch=False)
At this point only the response headers have been downloaded and the connection remains open, hence allowing us to make content retrieval conditional:
if int(r.headers['content-length']) < TOO_LONG:
content = r.content
...
You can further control the workflow by use of the Response.iter_content and Response.iter_lines methods, or reading from the underlying urllib3 urllib3.HTTPResponse at Response.raw
You can use httplib to send HEAD requests.
You can also have a look at python-requests, which seems to be the new trendy API for HTTP requests, replacing the possibly awkward httplib2. (see Why Not httplib2)
It also has a head() method for this.