I maintain an API client for the TRIAS API (German Link) to retrieve local public transport information for various states / cities in Germany. Recently one of the TRIAS servers (Baden-Württemberg) started responding with an error message to requests.
When I try to send the request via curl, the server responds just fine:
$ curl -vH "Content-Type: text/xml; charset=utf-8" -d#trias-req.xml http://www.efa-bw.de/trias
* Trying 94.186.213.206:80...
* Connected to www.efa-bw.de (94.186.213.206) port 80 (#0)
> POST /trias HTTP/1.1
> Host: www.efa-bw.de
> User-Agent: curl/7.83.1
> Accept: */*
> Content-Type: text/xml; charset=utf-8
> Content-Length: 652
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Tue, 31 May 2022 09:35:35 GMT
< Server: EFAController/10.4.25.9/BW-WW33
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: Authorization, Content-Type
< Access-Control-Allow-Methods: GET
< Access-Control-Expose-Headers: Content-Security-Policy, Location
< Access-Control-Max-Age: 600
< Content-Type: text/xml
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Accept-Ranges: none
< Content-Length: 1520
< Last-Modified: Tue, 31 May 2022 09:35:35 GMT
< Set-Cookie: ServerID=bw-ww33;Path=/
<
<?xml version="1.0" encoding="UTF-8"?>
<trias:Trias xmlns:siri="http://www.siri.org.uk/siri" xmlns:trias="http://www.vdv.de/trias" xmlns:acsb="http://www.ifopt.org.uk/acsb" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:datex2="http://datex2.eu/schema/1_0/1_0" version="1.1"><trias:ServiceDelivery><siri:ResponseTimestamp>2022-05-31T09:35:35Z</siri:ResponseTimestamp><siri:ProducerRef>de:nvbw</siri:ProducerRef><siri:Status>true</siri:Status><trias:Language>de</trias:Language><trias:CalcTime>30</trias:CalcTime><trias:DeliveryPayload><trias:LocationInformationResponse><trias:Location><trias:Location><trias:Address><trias:AddressCode>streetID:1500001248::8222000:-1:T 1:Mannheim:T 1::T 1: 68161:ANY:DIVA_STREET:942862:5641376:MRCV:B_W:0</trias:AddressCode><trias:AddressName><trias:Text>Mannheim, T 1</trias:Text><trias:Language>de</trias:Language></trias:AddressName><trias:PostalCode> 68161</trias:PostalCode><trias:LocalityName>Mannheim</trias:LocalityName><trias:LocalityRef>8222000:-1</trias:LocalityRef><trias:StreetName>T 1</trias:StreetName></trias:A* Connection #0 to host www.efa-bw.de left intact
ddress><trias:LocationName><trias:Text>Mannheim</trias:Text><trias:Language>de</trias:Language></trias:LocationName><trias:GeoPosition><trias:Longitude>8.46987</trias:Longitude><trias:Latitude>49.49121</trias:Latitude></trias:GeoPosition></trias:Location><trias:Complete>true</trias:Complete><trias:Probability>0.763999999</trias:Probability></trias:Location></trias:LocationInformationResponse></trias:DeliveryPayload></trias:ServiceDelivery></trias:Trias>
However, it fails with requests.post(), while it works with urllib.request.urlopen():
$ python efabw.py
502 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>
653
200 b'<?xml version="1.0" encoding="UTF-8"?>\n<trias:Trias xmlns:siri="http://www.siri.org.uk/siri" xmlns:trias="http://www.vdv.de/trias" xmlns:acsb="http://www.ifopt.org.uk/acsb" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:datex2="http://datex2.eu/schema/1_0/1_0" version="1.1"><trias:ServiceDelivery><siri:ResponseTimestamp>2022-05-31T09:37:02Z</siri:ResponseTimestamp><siri:ProducerRef>de:nvbw</siri:ProducerRef><siri:Status>true</siri:Status><trias:Language>de</trias:Language><trias:CalcTime>30</trias:CalcTime><trias:DeliveryPayload><trias:LocationInformationResponse><trias:Location><trias:Location><trias:Address><trias:AddressCode>streetID:1500001248::8222000:-1:T 1:Mannheim:T 1::T 1: 68161:ANY:DIVA_STREET:942862:5641376:MRCV:B_W:0</trias:AddressCode><trias:AddressName><trias:Text>Mannheim, T 1</trias:Text><trias:Language>de</trias:Language></trias:AddressName><trias:PostalCode> 68161</trias:PostalCode><trias:LocalityName>Mannheim</trias:LocalityName><trias:LocalityRef>8222000:-1</trias:LocalityRef><trias:StreetName>T 1</trias:StreetName></trias:Address><trias:LocationName><trias:Text>Mannheim</trias:Text><trias:Language>de</trias:Language></trias:LocationName><trias:GeoPosition><trias:Longitude>8.46987</trias:Longitude><trias:Latitude>49.49121</trias:Latitude></trias:GeoPosition></trias:Location><trias:Complete>true</trias:Complete><trias:Probability>0.763999999</trias:Probability></trias:Location></trias:LocationInformationResponse></trias:DeliveryPayload></trias:ServiceDelivery></trias:Trias>'
The respective code is:
#! /usr/bin/env python3
from urllib.request import Request, urlopen
from requests import post
URL = 'http://www.efa-bw.de/trias'
HEADERS = {'Content-Type': 'text/xml'}
def main():
with open('trias-req.xml', 'rb') as file:
xml = file.read()
response = post(URL, data=xml, headers=HEADERS)
print(response.status_code, response.text, len(xml))
request = Request(URL, data=xml, headers=HEADERS)
with urlopen(request) as response:
print(response.status, response.read())
if __name__ == '__main__':
main()
Why does the request fail with requests.post() only?
What can I do to debug this further?
Other API servers respond fine to the request with requests.post()
It turned out to be the user agent. After inspecting the headers with tcpdump, I found, that the request fails with user agent python-requests/2.27.1 but succeeds with Python-urllib/3.10 and curl/7.83.1:
#! /usr/bin/env python3
from urllib.request import Request, urlopen
from requests import post
URL = 'http://www.efa-bw.de/trias'
HEADERS_REQUESTS = {'Content-Type': 'text/xml', 'User-Agent': 'Python-urllib/3.10'}
HEADERS_URLLIB = {'Content-Type': 'text/xml'}
def main():
with open('trias-req.xml', 'rb') as file:
xml = file.read()
response = post(URL, data=xml, headers=HEADERS_REQUESTS)
print(response.status_code, response.text, len(xml))
request = Request(URL, data=xml, headers=HEADERS_URLLIB)
with urlopen(request) as response:
print(response.status, response.read())
if __name__ == '__main__':
main()
$ python efabw.py
200 <?xml version="1.0" encoding="UTF-8"?>
<trias:Trias xmlns:siri="http://www.siri.org.uk/siri" xmlns:trias="http://www.vdv.de/trias" xmlns:acsb="http://www.ifopt.org.uk/acsb" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:datex2="http://datex2.eu/schema/1_0/1_0" version="1.1"><trias:ServiceDelivery><siri:ResponseTimestamp>2022-05-31T10:09:06Z</siri:ResponseTimestamp><siri:ProducerRef>de:nvbw</siri:ProducerRef><siri:Status>true</siri:Status><trias:Language>de</trias:Language><trias:CalcTime>45</trias:CalcTime><trias:DeliveryPayload><trias:LocationInformationResponse><trias:Location><trias:Location><trias:Address><trias:AddressCode>streetID:1500001248::8222000:-1:T 1:Mannheim:T 1::T 1: 68161:ANY:DIVA_STREET:942862:5641376:MRCV:B_W:0</trias:AddressCode><trias:AddressName><trias:Text>Mannheim, T 1</trias:Text><trias:Language>de</trias:Language></trias:AddressName><trias:PostalCode> 68161</trias:PostalCode><trias:LocalityName>Mannheim</trias:LocalityName><trias:LocalityRef>8222000:-1</trias:LocalityRef><trias:StreetName>T 1</trias:StreetName></trias:Address><trias:LocationName><trias:Text>Mannheim</trias:Text><trias:Language>de</trias:Language></trias:LocationName><trias:GeoPosition><trias:Longitude>8.46987</trias:Longitude><trias:Latitude>49.49121</trias:Latitude></trias:GeoPosition></trias:Location><trias:Complete>true</trias:Complete><trias:Probability>0.763999999</trias:Probability></trias:Location></trias:LocationInformationResponse></trias:DeliveryPayload></trias:ServiceDelivery></trias:Trias> 653
200 b'<?xml version="1.0" encoding="UTF-8"?>\n<trias:Trias xmlns:siri="http://www.siri.org.uk/siri" xmlns:trias="http://www.vdv.de/trias" xmlns:acsb="http://www.ifopt.org.uk/acsb" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:datex2="http://datex2.eu/schema/1_0/1_0" version="1.1"><trias:ServiceDelivery><siri:ResponseTimestamp>2022-05-31T10:09:06Z</siri:ResponseTimestamp><siri:ProducerRef>de:nvbw</siri:ProducerRef><siri:Status>true</siri:Status><trias:Language>de</trias:Language><trias:CalcTime>19</trias:CalcTime><trias:DeliveryPayload><trias:LocationInformationResponse><trias:Location><trias:Location><trias:Address><trias:AddressCode>streetID:1500001248::8222000:-1:T 1:Mannheim:T 1::T 1: 68161:ANY:DIVA_STREET:942862:5641376:MRCV:B_W:0</trias:AddressCode><trias:AddressName><trias:Text>Mannheim, T 1</trias:Text><trias:Language>de</trias:Language></trias:AddressName><trias:PostalCode> 68161</trias:PostalCode><trias:LocalityName>Mannheim</trias:LocalityName><trias:LocalityRef>8222000:-1</trias:LocalityRef><trias:StreetName>T 1</trias:StreetName></trias:Address><trias:LocationName><trias:Text>Mannheim</trias:Text><trias:Language>de</trias:Language></trias:LocationName><trias:GeoPosition><trias:Longitude>8.46987</trias:Longitude><trias:Latitude>49.49121</trias:Latitude></trias:GeoPosition></trias:Location><trias:Complete>true</trias:Complete><trias:Probability>0.763999999</trias:Probability></trias:Location></trias:LocationInformationResponse></trias:DeliveryPayload></trias:ServiceDelivery></trias:Trias>'
Related
I'm writing an application REACT frontend and Flask backend (with Flask-cord installed). When I make a call from the frontend I get an error
Access to fetch at 'http://127.0.0.1:5000/get' from origin 'http://127.0.0.1:3000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
So I followed this post
https://www.arundhaj.com/blog/definitive-guide-to-solve-cors-access-control-allow-origin-python-flask.html
and configured my application following the instructions.
If I run
$ curl -v -X OPTIONS -H "Origin: http://127.0.0.1:5000" -H "Access-Control-Request-Method: PUT" -H "Access-Control-Request-Headers: Authorization" http://127.0.0.1:5000
I get this response with the right Access-Control-Allow
Trying 127.0.0.1:5000...
* Connected to 127.0.0.1 (127.0.0.1) port 5000 (#0)
> OPTIONS / HTTP/1.1
> Host: 127.0.0.1:5000
> User-Agent: curl/7.77.0
> Accept: */*
> Origin: http://127.0.0.1:5000
> Access-Control-Request-Method: PUT
> Access-Control-Request-Headers: Authorization
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 NOT FOUND
< Server: Werkzeug/2.1.1 Python/3.10.4
< Date: Sat, 23 Apr 2022 09:36:22 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 232
< Access-Control-Allow-Origin: http://127.0.0.1:5000
< Access-Control-Allow-Headers: Authorization
< Access-Control-Allow-Methods: GET, OPTIONS, POST, PUT
If I run the same on http://127.0.0.1:3000 I get this
Trying 127.0.0.1:3000...
* Connected to 127.0.0.1 (127.0.0.1) port 3000 (#0)
> OPTIONS / HTTP/1.1
> Host: 127.0.0.1:3000
> User-Agent: curl/7.77.0
> Accept: */*
> Origin: http://127.0.0.1:3000
> Access-Control-Request-Method: PUT
> Access-Control-Request-Headers: Authorization
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< X-Powered-By: Express
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: *
< Access-Control-Allow-Headers: *
< Content-Security-Policy: default-src 'none'
< X-Content-Type-Options: nosniff
< Content-Type: text/html; charset=utf-8
< Content-Length: 143
< Vary: Accept-Encoding
< Date: Sat, 23 Apr 2022 09:50:15 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot OPTIONS /</pre>
</body>
</html>
* Connection #0 to host 127.0.0.1 left intact
Of corse if I run the application the same CORS error is popping up. I have the impression that flask-cors is not seen by React.
Here is the flask-cors configuration
api_config = {
"origins": ["http://127.0.0.1:3000"],
"methods": ["OPTIONS", "GET", "POST", "PUT"],
"allow_headers": ['Content-Type', 'Authorization']
}
CORS(app, resources={
r"/*":api_config
})
And I have this in my js file
useEffect(() => {
fetch("http://127.0.0.1:5000/get", {
mode: 'cors',
method: "GET",
headers: {
"Content-Type": "application/json",
"Accept": "application/json",
},
})
.then(resp => resp.json())
.then(resp => console.log(resp))
.catch(error => console.log(error))
}, []);
I have this issue from an angular + flask app run with python3.10 in a macOS.
The same app works perfectly when run with python3.8 on ubuntu.
I have to try to run it with a lower version of python in the macOS computer to check wether is the os or the python version.
Happy to hear I am not the only one with this issue, I hope I will come back to you with some news.
This is a follow-up from a question I saw earlier today. In this question, a user asks about a problem downloading a pdf from this url:
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
I would think that the two download functions below would give the same result, but the urllib2 version downloads some html with a script tag referencing a pdf loader, while the requests version downloads the real pdf. Can someone explain the difference in behavior?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Is requests smart enough to wait for dynamic websites to render before downloading?
Edit
Following up on #cwallenpoole's idea, I compared the headers and tried swapping headers from the requests request into the urllib2 request. The magic header was Cookie; the below functions write the same file for the example URL.
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Next question: where did requests get that cookie? Is requests making multiple trips to the server?
Edit 2
Cookie came from a redirect header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
I'll bet that it's an issue with the User Agent header (I just used curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009 and got the same as you report with urllib2). This is part of the request header that lets a website know what type of program/user/whatever is accessing the site (not the library, the HTTP request).
By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
If you're working on a Mac, I'll bet that the site is reading Darwin/13.1.0 or similar and then serving you the macos appropriate content. Otherwise, it's probably trying to direct you to some default alternate content (or prevent you from scraping that URL).
I'm learning how to login to an example website using python requests module. This
Video Tutorial
got me started. From all the cookies that I see in GoogleChrome>Inspect Element>NetworkTab, I'm not able to retrieve all of them using the following code:
import requests
with requests.Session() as s:
url = 'http://www.noobmovies.com/accounts/login/?next=/'
s.get(url)
allcookies = s.cookies.get_dict()
print allcookies
Using this I only get csrftoken like below:
{'csrftoken': 'ePE8zGxV4yHJ5j1NoGbXnhLK1FQ4jwqO'}
But in google chrome, I see all these other cookies apart from csrftoken (sessionid, _gat, _ga etc):
I even tried the following code from here, but the result was the same:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.noobmovies.com/accounts/login/?next=/")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
Output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org
the cookies are:
<Cookie csrftoken=ePE8zGxV4yHJ5j1NoGbXnhLK1FQ4jwqO for www.noobmovies.com/>
So, how can I get all the cookies ? Thanks.
The cookies being set are from other pages/resources, probably loaded by JavaScript code. You can check it making the request to the page only (without running the JS code), using tools such as wget, curl or httpie.
The only cookie this server set is csrftoken, as you can see in:
$ wget --server-response 'http://www.noobmovies.com/accounts/login/?next=/'
--2016-02-01 22:51:55-- http://www.noobmovies.com/accounts/login/?next=/
Resolving www.noobmovies.com (www.noobmovies.com)... 69.164.217.90
Connecting to www.noobmovies.com (www.noobmovies.com)|69.164.217.90|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu)
Date: Tue, 02 Feb 2016 00:51:58 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Expires: Tue, 02 Feb 2016 00:51:58 GMT
Vary: Cookie,Accept-Encoding
Cache-Control: max-age=0
Set-Cookie: csrftoken=XJ07sWhMpT1hqv4K96lXkyDWAYIFt1W5; expires=Tue, 31-Jan-2017 00:51:58 GMT; Max-Age=31449600; Path=/
Last-Modified: Tue, 02 Feb 2016 00:51:58 GMT
Length: unspecified [text/html]
Saving to: ‘index.html?next=%2F’
index.html?next=%2F [ <=> ] 10,83K 2,93KB/s in 3,7s
2016-02-01 22:52:03 (2,93 KB/s) - ‘index.html?next=%2F’ saved [11085]
Note the Set-Cookie line.
I'm trying to convert my python script from issuing a curl command via os.system() to using requests. I thought I'd use pycurl, but this question convinced me otherwise. The problem is I'm getting an error returned from the server that I can see when using r.text (from this answer) but I need more information. Is there a better way to debug what's happening?
for what it's worth I think the issue revoles around converting my --data flag from curl/pycurl to requests. I've created a dictionary of the params i was passing to --data before. My guess is that one of those isn't valid but how can I get more info to know for sure?
example:
headers2 = {"Accept":"*/*", \
"Content-Type":"application/x-www-form-urlencoded", \
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36", \
"Origin":"https://somedomain.com", \
"X-Requested-With":"XMLHttpRequest", \
"Connection":"keep-alive", \
"Accept-Language":"en-US,en;q=0.8", \
"Referer":"https://somedomain.com/release_cr_new.html?releaseid=%s&v=2&m=a&prev_release_id=%s" % (current_release_id, previous_release_id), \
"Host":"somedomain.com", \
"Accept-Encoding":"gzip,deflate,sdch", \
"Cookie":'cookie_val'}
for bug_id in ids:
print bug_id
data = {'dump_json':'1','releaseid':current_release_id, 'v':'2','m':'a','prev_release_id': previous_release_id,'bug_ids': bug_id, 'set_cols':'sqa_status&sqa_updates%5B0%5D%5Bbugid%5D=' + bug_id + '&sqa_updates%5B0%5D%5Bsqa_status%5D=6'}
print 'current_release_id' , data['releaseid']
print 'previous_release_id', data['prev_release_id']
r = requests.post(post_url, data=json.dumps(data), headers=headers2)
print r.text
The output I'm getting is a pretty generic html message that I've seen before when I've queried the server in the wrong way. So I know I'm reaching the right server at least.
I'm not really expecting any output. This should just post to the server and update a field in the DB.
Anatomy of an http response
Example (loading this page)
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Expires: Fri, 27 Sep 2013 19:22:41 GMT
Last-Modified: Fri, 27 Sep 2013 19:21:41 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Date: Fri, 27 Sep 2013 19:21:41 GMT
Content-Length: 12706
<!DOCTYPE html>
<html>
... truncated rest of body ...
The first line is the status line and consists of the status code and status text.
Headers are key/value pairs. Headers are ended with an empty new line. The empty line denotes there are no more headers and the start of the payload / body follows.
body consumes the rest of the message.
The following explains how to extract the 3 parts:
Status Line
Use the following to get the status line sent back from the server
>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404
>>> bad_r.raise_for_status()
Traceback (most recent call last):
File "requests/models.py", line 832, in raise_for_status
raise http_error
requests.exceptions.HTTPError: 404 Client Error
(source)
Headers:
r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')
# response headers:
r.headers
# request headers:
r.request.headers
Body
Use r.text.
Post Request Encoding
The 'content-type' you send to the server in the request should match the content-type you're actually sending. In your case, you are sending json but telling the server you're sending form data (which is the default if you do not specify).
From the headers you show above:
"Content-Type":"application/x-www-form-urlencoded",
But your request.post call sets data=json.dumps(data) which is JSON. The headers should say:
"Content-type": "application/json",
The value returned from the request object contains the request information under .request.
Example:
r = requests.request("POST", url, ...)
print("Request headers:", r.request.headers)
print("Request body:", r.request.body)
print("Response status code:", r.status_code)
print("Response text:", r.text.encode('utf8'))
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx