Python library requests cannot open a site - python

url = 'http://developer.usa.gov/1usagov.json'
r = requests.get(url)
Python code hangs forever and i not behind a http proxy or anything.
Pointing my browser directly to the url works

Following my comment above.. I think your problem is the continuous stream. You need to do something like in the doc
r = requests.get(url, stream=True)
if int(r.headers['content-length']) < TOO_LONG:
# rebuild the content and parse
with a while instead of if if you want a continuous loop.

Related

Getting the redirected url in urllib2

I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated
use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)

Python - Requests HTTP range not working

According to this answer I can use the Range header to download only a part of an html page, but with this code:
import requests
url = "http://stackoverflow.com"
headers = {"Range": "bytes=0-100"} # first 100 bytes
r = requests.get(url, headers=headers)
print r.text
I get the whole html page. Why isn't it working?
If the webserver does not support Range header, it will be ignored.
Try with other server that support the header, for example tools.ietf.org:
import requests
url = "http://tools.ietf.org/rfc/rfc2822.txt"
headers = {"Range": "bytes=0-100"}
r = requests.get(url, headers=headers)
assert len(r.text) <= 101 # not exactly 101, because r.text does not include header
I'm having the same problem. The server I'm downloading from supports the Range header. Using requests, the header is ignored and the entire file is downloaded with a 200 status code. Meanwhile, sending the request via urllib3 correctly returns the partial content with a 206 status code.
I suppose this must be some kind of bug or incompatibility. requests works fine with other files, including the one in the example below. Accessing my file requires basic authorization - perhaps that has something to do with it?
If you run into this, urllib3 may be worth trying. You'll already have it because requests uses it. This is how I worked around my problem:
import urllib3
url = "https://www.rfc-editor.org/rfc/rfc2822.txt"
http = urllib3.PoolManager()
response = http.request('GET', url, headers={'Range':'bytes=0-100'})
Update: I tried sending a Range header to https://stackoverflow.com/, which is the link you specified. This returns the entire content with both Python modules as well as curl, despite the response header specifying accept-ranges: bytes. I can't say why.
I tried it without using:
headers = {"Range": "bytes=0-100"}
Try to use this:
import requests
# you can change the url
url = requests.get('http://example.com/')
print(url.text)

Python: urllib2 get nothing which does exist

I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text

Python Requests module doesn't handle timeout if streams=True?

I'm fetching a batch of urls using the Python Requests module. I first want to read their headers only, to get the actual url and size of response. Then I get the actual content for any that pass muster.
So I use 'streams=True' to delay getting the content. This generally works fine.
But I'm encountering an occasional url that doesn't respond. So I put in timeout=3.
But those never time out. They just hang. If I remove the 'streams=True' it times out correctly. Is there some reason streams and timeout shouldn't work together? Removing the streams=True forces me to get all the content.
Doing this:
import requests
url = 'http://bit.ly/1pQH0o2'
x = requests.get(url) # hangs
x = requests.get(url, stream=True) # hangs
x = requests.get(url, stream=True, timeout=1) # hangs
x = requests.get(url, timeout=3) # times out correctly after 3 seconds
There was a relevant github issue:
Timeouts do not occur when stream == True
The fix was included into requests==2.3.0 version.
Tested it using the latest version - worked for me.
Do you close your responses? Unclosed and partially read responses can make multiple connections to the same resource and site may have connection limit for a single IP.

python requests is slow

I am developing a download manager. Using the requests module in python to check for a valid link (and hopefully broken links).
My code for checking link below:
url = 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe'
r = requests.get(url, allow_redirects=False) # this line takes 40 seconds
if r.status_code==200:
print("link valid")
else:
print("link invalid")
Now, the issue is this takes approximately 40 seconds to perform this check, which is huge.
My question is how can I speed this up maybe using urllib2 or something??
Note: Also if I replace url with the actual URL which is 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe', this takes one second so it appears to be an issue with requests.
Not all hosts support head requests. You can use this instead:
r = requests.get(url, stream=True)
This actually only download the headers, not the response content. Moreover, if the idea is to get the file afterwards, you don't have to make another request.
See here for more infos.
Don't use get that actually retrieves the file, use:
r = requests.head(url,allow_redirects=False)
Which goes from 6.9secs on my machine to 0.4secs

Categories

Resources