According to this answer I can use the Range header to download only a part of an html page, but with this code:
import requests
url = "http://stackoverflow.com"
headers = {"Range": "bytes=0-100"} # first 100 bytes
r = requests.get(url, headers=headers)
print r.text
I get the whole html page. Why isn't it working?
If the webserver does not support Range header, it will be ignored.
Try with other server that support the header, for example tools.ietf.org:
import requests
url = "http://tools.ietf.org/rfc/rfc2822.txt"
headers = {"Range": "bytes=0-100"}
r = requests.get(url, headers=headers)
assert len(r.text) <= 101 # not exactly 101, because r.text does not include header
I'm having the same problem. The server I'm downloading from supports the Range header. Using requests, the header is ignored and the entire file is downloaded with a 200 status code. Meanwhile, sending the request via urllib3 correctly returns the partial content with a 206 status code.
I suppose this must be some kind of bug or incompatibility. requests works fine with other files, including the one in the example below. Accessing my file requires basic authorization - perhaps that has something to do with it?
If you run into this, urllib3 may be worth trying. You'll already have it because requests uses it. This is how I worked around my problem:
import urllib3
url = "https://www.rfc-editor.org/rfc/rfc2822.txt"
http = urllib3.PoolManager()
response = http.request('GET', url, headers={'Range':'bytes=0-100'})
Update: I tried sending a Range header to https://stackoverflow.com/, which is the link you specified. This returns the entire content with both Python modules as well as curl, despite the response header specifying accept-ranges: bytes. I can't say why.
I tried it without using:
headers = {"Range": "bytes=0-100"}
Try to use this:
import requests
# you can change the url
url = requests.get('http://example.com/')
print(url.text)
Related
Why I'm getting different responses when i use urllib.request.urlopen and requests.get
import requests
r = requests.get('https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg')
r.status_code
response 403
from urllib.request import urlopen
r = urlopen('https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg')
r.getcode()
response 200
First you could check print( r.content ) to see what you get from server.
Usually you can get some explanation which can help to see problem.
For your code it shows problem with header User-Agent
Wikipedia: User-Agent policy
Some servers check header User-Agent to send different content for different systems/browsers/devices. They use it also to detect scripts/bots/spamers/hackers and block them.
If I use header from real browser (or at least short Mozilla/5.0) then it works.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_(1950_poster).jpg'
#url = 'https://upload.wikimedia.org/wikipedia/commons/1/14/Sunset_Boulevard_%281950_poster%29.jpg'
r = requests.get(url, headers=headers)
print(r.status_code)
print(r.content[:100])
with open('image.jpg', 'wb') as fh:
fh.write(r.content)
EDIT:
After running code few times it start working for me even without User-Agent. Maybe they checked it for some different reason.
I'm relatively new to Python so would like some help, I've created a script which simply use the request library and basic auth to connect to an API and returns the xml or Json result.
# Imports
import requests
from requests.auth import HTTPBasicAuth
# Set variables
url = "api"
apiuser = 'test'
apipass = 'testpass'
# CALL API
r = requests.get(url, auth=HTTPBasicAuth(apiuser, apipass))
# Print Statuscode
print(r.status_code)
# Print XML
xmlString = str(r.text)
print(xmlString)
if but it returns a blank string.
If I was to use a browser to call the api and enter the cretentials I get the following response.
<Response>
<status>SUCCESS</status>
<callId>99999903219032190321</callId>
<result xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Dummy">
<authorFullName>jack jones</authorFullName>
<authorOrderNumber>1</authorOrderNumber>
</result>
</Response>
Can anyone tell me where I'm going wrong.
What API are you connecting to?
Try adding a user-agent to the header:
r = requests.get(url, auth=HTTPBasicAuth(apiuser, apipass), headers={'User-Agent':'test'})
Although this is not an exact answer for the OP, it may solve the issue for someone having a blank response from python-requests.
I was getting a blank response because of the wrong content type. I was expecting an HTML rather than a JSON or a login success. The correct content-type for me was application/x-www-form-urlencoded.
Essentially I had to do the following to make my script work.
data = 'arcDate=2021/01/05'
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
}
r = requests.post('https://www.deccanherald.com/getarchive', data=data, headers=headers)
print(r.status_code)
print(r.text)
Learn more about this in application/x-www-form-urlencoded or multipart/form-data?
Run this and see what responses you get.
import requests
url = "https://google.com"
r = requests.get(url)
print(r.status_code)
print(r.json)
print(r.text)
When you start having to pass things in your GET, PUT, DELETE, OR POST requests, you will add it in the request.
url = "https://google.com"
headers = {'api key': 'blah92382377432432')
r = requests.get(url, headers=headers)
Then you should see the same type of responses. Long story short,
Print(r.text) to see the response, then you once you see the format of the response you get, you can move it around however you want.
I have an empty response only when the authentication failed or is denied.
The HTTP status is still ≤ 400.
However, in the header you can find :
'X-Seraph-LoginReason': 'AUTHENTICATED_FAILED'
or
'X-Seraph-LoginReason': 'AUTHENTICATED_DENIED'
If the request is empty, not even a status code I could suggest waiting some time between printing. Maybe the server is taking time to return the response to you.
import time
time.sleep(5)
Not the nicest thing, but it's worth trying
How can I make a time delay in Python?
I guess there are no errors during execution
EDIT: nvm, you mentioned that you got a status code, I thought you were literally geting nothing.
On the side, if you are using python3 you have to use Print(), it replaced Print
Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.
I am using request to fetch data. on USPS.COM the tracking URL is redirecting permanently(301) hence can't see desired page. The URL work's perfectly on Browser.
Update:
Added the Real URL for clarification/debugging
According to Redirection and History - Requests documentation:
Requests will automatically perform location redirection for all verbs
except HEAD.
So, you don't need to worry about redirection.
The problem is that USPS.COM checks User-Agent header and returns different result according to the header value. You need to specify the header to get the same result with the browser.
For example:
import requests
url = 'http://.....'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
assert 'Delivered' in r.content
url = 'http://developer.usa.gov/1usagov.json'
r = requests.get(url)
Python code hangs forever and i not behind a http proxy or anything.
Pointing my browser directly to the url works
Following my comment above.. I think your problem is the continuous stream. You need to do something like in the doc
r = requests.get(url, stream=True)
if int(r.headers['content-length']) < TOO_LONG:
# rebuild the content and parse
with a while instead of if if you want a continuous loop.