Making Head Requests in Twisted - python

I am relatively new to using Twisted and I am having trouble returning the content-length header when performing a basic head request. I have set up an asynchronous client already but the trouble comes in this bit of code:
def getHeaders(url):
d = Agent(reactor).request("HEAD", url)
d.addCallbacks(handleResponse, handleError)
return d
def handleResponse(r):
print r.code, r.headers
whenFinished = twisted.internet.defer.Deffered()
r.deliverBody(PrinterClient(whenFinished))
return whenFinished
I am making a head request and passing the url. As indicated in this documentation the content-length header is not stored in self.length, but can be accessed from the self.headers response. The output is returning the status code as expected but the header output is not what is expected. Using "uhttp://www.espn.go.com" as an example it currently returns:
Set-Cookie: SWID=77638195-7A94-4DD0-92A5-348603068D58;
path=/; expires=Fri, 31-Jan-2034 00:50:09 GMT; domain=go.com;
X-Ua-Compatible: IE=edge,chrome=1
Cache-Control: max-age=15
Date: Fri, 31 Jan 2014 00:50:09 GMT
P3P: CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAi IVDi CONi
OUR SAMo OTRo BUS PHY ONL UNI PUR COM NAV INT DEM CNT STA PRE"
Content-Type: text/html; charset=iso-8859-1
As you can see, no content-length field is returned. If the same request is done in requests then the result will contain the content-length header:
r = requests.head("http://www.espn.go.com")
r.headers
({'content-length': '46133', 'content-encoding': 'gzip'...})
(rest omitted for readability)
What is causing this problem? I am sure it is a simple mistake on my part but I for the life of me cannot figure out what I have done wrong. Any help is appreciated.

http://www.espn.go.com/ returns one response if the client sends an Accept-Encoding: gzip header and another response if it doesn't.
One of the differences between the two responses is the inclusion of the Content-Length header.
If you want to make requests using Agent including Accept-Encoding: gzip then take a look at ContentDecoderAgent or the third-party treq package.

http allows (but does not REQUIRE) entity headers in responses to HEAD requests. The only restriction it places is that 200 responses to HEAD requests MUST NOT include an entity payload. Its up to the origin server to decide which, if any entity headers it would like to include.
In the case of Content-Length, it makes sense for this to be optional for HEAD; if the entity will be computed dynamically (as with compressing/decompressing content), it's better for the server to avoid the extra work of computing the content length when the request won't include the content anyway.

Related

Using 'Requests' python module for POST request, receiving response as if it were GET

So I am trying to make a script that checks a reservation availability of a bus. The starting link for this is https://reservation.pc.gc.ca/.
In the reserve box the following needs to be selected:
Reservation: Day Use (Guided Hikes, Lake O’Hara Bus)
Park: Yoho-Lake O'Hara
Arrival Date: Jun 16
Party Size: 2
When those options are entered, it takes you to the following page: https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar
It is my understanding that if I send a POST request to that second link with the correct data it should return the page I'm looking for
If I look in the dev tools network info when I select the correct parameters the form data is:
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE: -reallllly long string-
__VIEWSTATEGENERATOR: 8D0E13E6
ctl00$MainContentPlaceHolder$rdbListReservationType: Events
ddlLocations: 213a1bc9-9218-4e98-9a7f-0f209008e437**
ddlArrivalMonth: 2017-06-16
ddlArrivalDay: 19
ddlNights: 1
ddlDepartureMonth:
ddlDepartureDay:
ddlEquipment:
ddlEquipmentSub:
ddlPartySize:2
ctl00$MainContentPlaceHolder$chkExcludeAccessible: on
ctl00$MainContentPlaceHolder$imageButtonCalendar.x: 64
ctl00$MainContentPlaceHolder$imageButtonCalendar.y: 56
So the code I wrote is:
import requests
payload = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE':-reallly long string-,
'__VIEWSTATEGENERATOR': '8D0E13E6',
'ctl00$MainContentPlaceHolder$rdbListReservationType': 'Events',
'ddlLocations': '213a1bc9-9218-4e98-9a7f-0f209008e437',
'ddlArrivalMonth': 2017-06-16,
'ddlArrivalDay': 19,
'ddlNights': 1,
'ddlDepartureMonth': '',
'ddlDepartureDay': '',
'ddlEquipment': '',
'ddlEquipmentSub': '',
'ddlPartySize': 2,
'ctl00$MainContentPlaceHolder$chkExcludeAccessible': 'on',
'ctl00$MainContentPlaceHolder$imageButtonCalendar.x': 64,
'ctl00$MainContentPlaceHolder$imageButtonCalendar.y': 56
}
r = requests.get(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
print r.text
r.text ends up just being the second link as if no parameters were entered - as if I just sent a normal GET request to the link. I tried turning the payload values that are integers into strings, I tried removing the empty key:value pairs. No luck. Trying to figure out what I'm missing.
It looks to me like there are 2 things going on:
#errata was correct, and this should be a POST request. You're about halfway there.
What I noticed though is that it seems to post the form data to Home.aspx and the URL that you see after submitting the form is the result of that processing and subsequent redirect.
You might try posting the form data as json to ./Home.aspx.
I found through Postman that this nearly worked, but I had to specify the content-type in order to get the proper results.
If you need to know how to add header and body instructions to the .post() method, it looks like there is a good example (though perhaps slightly outdated) here:
adding header to python request module
Also, fwiw, check out Postman. If you're both inexperienced with requests and with doing it in Python, at least this may lesson some of the burden of trial and error.
You are using
r = requests.get(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
instead of
r = requests.post(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
Digging a bit deeper in your problem, I found out that the URL you are calling is actually redirecting to a different URL (returning HTTP response 302):
$ curl -I "https://reservation.pc.gc.ca/Yoho-LakeO'Hara"
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 77273
Content-Type: text/html; charset=utf-8
Location: https://reservation-pc.fjgc-gccf.gc.ca/GccfLanguage.aspx?lang=eng&ret=https%3a%2f%2freservation.pc.gc.ca%3a443%2fYoho-LakeO%27Hara
Server: Microsoft-IIS/8.0
Set-Cookie: ASP.NET_SessionId=qw4p4e2zxjxx0c2zyq014p45; path=/; secure; HttpOnly
Set-Cookie: CookieLocaleName=en-CA; path=/; secure; HttpOnly
X-Powered-By: ASP.NET
X-Frame-Options: SAMEORIGIN
Date: Wed, 17 May 2017 14:22:53 GMT
However, following the Location from response results also in 302:
$ curl -I "https://reservation-pc.fjgc-gccf.gc.ca/GccfLanguage.aspx?lang=eng&ret=https%3a%2f%2freservation.pc.gc.ca%3a443%2fYoho-LakeO%27Hara"
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 179
Content-Type: text/html; charset=utf-8
Location: https://reservation.pc.gc.ca:443/Yoho-LakeO'Hara?gccf=true
Server: Microsoft-IIS/8.0
Set-Cookie: ASP.NET_SessionId=rbcuvexfg4fb340ixtcjd1qy; path=/; secure; HttpOnly
Set-Cookie: _gc_lang=eng; domain=.fjgc-gccf.gc.ca; path=/; secure; HttpOnly
X-Powered-By: ASP.NET
X-Frame-Options: SAMEORIGIN
Date: Wed, 17 May 2017 14:24:55 GMT
All this probably results in Requests transforming your POST into GET in the end...

JSON in post request works in HttpRequester but not in python Requests

I'm stuck in web scraping a page using Python. Basically, the following is the request from HttpRequester (in Mozilla) and it gives me the right response.
POST https://www.hpe.com/h20195/v2/Library.aspx/LoadMore
Content-Type: application/json
{"sort": "csdisplayorder", "hdnOffset": "1", "uniqueRequestId": "d6da6a30bdeb4d77b0e607a6b688de1e", "test": "", "titleSearch": "false", "facets": "wildcatsearchcategory#HPE,cshierarchycategory#No,csdocumenttype#41,csproducttype#18964"}
-- response --
200 OK
Cache-Control: private, max-age=0
Content-Length: 13701
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 28 May 2016 04:12:57 GMT
Connection: keep-alive
The exact same operation in python 2.7.1 using Requests, fails with an error. The following is the code snippet:
jsonContent = {"sort": "csdisplayorder", "hdnOffset": "1", "uniqueRequestId": "d6da6a30bdeb4d77b0e607a6b688de1e", "test": "", "titleSearch": "false", "facets": "wildcatsearchcategory#HPE,cshierarchycategory#No,csdocumenttype#41,csproducttype#18964"}
catResponse = requests.post('https://www.hpe.com/h20195/v2/Library.aspx/LoadMore', json = jsonContent)
The following is the error that I get:
{"Message":"Value cannot be null.\r\nParameter name: source","StackTrace":" at
System.Linq.Enumerable.Contains[TSource](IEnumerable`1 source, TSource value, I
EqualityComparer`1 comparer)\r\n
More information:
The Post request that I'm looking for is fired upon:
opening this web page: https://www.hpe.com/h20195/v2/Library.aspx?doctype=41&doccompany=HPE&footer=41&filter_doctype=no&filter_doclang=no&country=&filter_country=no&cc=us&lc=en&status=A&filter_status=rw#doctype-41&doccompany-HPE&prodtype_oid-18964&status-a&sortorder-csdisplayorder&teasers-off&isRetired-false&isRHParentNode-false&titleCheck-false
Clicking on the "Load more" grey button at the end of the page
I'm capturing the exact set of request headers and response from the browser operation and trying to mimic that in Postman, Python code and HttpRequester (Mozilla).
It flags the same error (mentioned above) with Postman and Python, but works with no headers set on my part with HttpRequester.
Can anyone think of an explanation for this?
If both Postman and requests are receiving an error, then there is more context than what HttpRequester is showing. There are a number of headers that I'd expect to be set almost always, including User-Agent and Content-Length, that are missing here.
The usual suspects are cookies (look for Set-Cookie headers in earlier requests, preserve those by using a requests.Session() object), the User-Agent header and perhaps a Referrer header, but do look for other headers like anything starting with Accept, for example.
Have HttpRequester post to http://httpbin.org/post instead for example, and inspect the returned JSON, which tells you what headers were sent. This won't include cookies (those are domain-specific), but anything else could potentially be something the server looks for. Try such headers one by one if cookies are not helping.

Content-type is blank in the headers of some requests

I've ran this queries millions (yes, millions) of times before with other URLs. However, I'm getting a KeyError when checking the content-type of the following webpage.
Code snippet:
r = requests.get("http://health.usnews.com/health-news/articles/2014/10/15/limiting-malpractice-claims-may-not-curb-costly-medical-tests", timeout=10, headers=headers)
if "text/html" in r.headers["content-type"]:
Error:
KeyError: 'content-type'
I checked the content of r.headers and it's:
CaseInsensitiveDict({'date': 'Fri, 20 May 2016 06:44:19 GMT', 'content-length': '0', 'connection': 'keep-alive', 'server': 'BigIP'})
What could be causing this?
Not all servers set a Content-Type header. Use .get() to retrieve a default if it is missing:
if "text/html" in r.headers.get("content-type", ''):
For the URL you gave I can't reproduce this:
$ curl -s -D - -o /dev/null "http://health.usnews.com/health-news/articles/2014/10/15/limiting-malpractice-claims-may-not-curb-costly-medical-tests"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Powered-By: Brightspot
Content-Type: text/html;charset=UTF-8
Date: Fri, 20 May 2016 06:45:12 GMT
Set-Cookie: JSESSIONID=A0C35776067AABCF9E029150C64D8D91; Path=/; HttpOnly
Transfer-Encoding: chunked
but if the header is missing from your response then it usually isn't Python's fault, and certainly not your code's fault.
It could be you encountered a buggy server or temporary glitch, or the server you contacted doesn't like you for one reason or another. Your sample response headers have the content-length set to 0 as well, for example, indicating there was no content to serve at all.
The server that gave you that response is BigIP, a load balancer / network router product from a company called F5. Hard to say exactly what kind (they have global routing servers as well as per-datacenter or cluster load balancers). It could be that the load balancer ran out of back-end servers to serve the request, doesn't have servers in your region, or the load balancer decided that you are sending too many requests and refuses to give you more than just this response, or it is the wrong phase of the moon and Jupiter is in retrograde and it threw a tantrum. We can't know!
But, just in case this happens again, do also look at the response status code. It may well be a 4xx or 5xx status code indicating that something was wrong with your request or with the server. For example, a 429 status code response would indicate you made too many requests in a short amount of time and should slow down. Test for it by checking r.status_code.

Testing web-tornado using Firefox's HttpRequest addon

I am testing my web-tornado application using Firefox's HttpRequest add-on but after I log in and receive my secure cookie data, I am not able to re-use it to consume protected methods.
This is my response data:
POST http://mylocalurl:8888/user/login
Content-Type: application/x-www-form-urlencoded
Login=mylogin;Pass=123
-- response -- 200 OK Content-Length: 33
Content-Type: text/html; charset=UTF-8
Server: TornadoServer/2.2.1
Set-Cookie:
IdUser="Mjk=|1395170421|ffaf0d6fecf2f91c0dccca7cab03d799ef6637a0";
expires=Thu, 17 Apr 2014 19:20:21 GMT; Path=/
{
"Success": true }
-- end response --
Now why I am trying to do is to configure HttpRequester to use this cookie for my new requests. I tried to add it using the "Headers" tab but my server keeps sending me a 403, Forbidden.
Can anyone help me on this ? It could be with another tool (for linux) too.
I really like fiddler2 for these kind of things and there's an alpha build for mono that you may wish to try out: http://www.telerik.com/download/fiddler
If you don't mind paid software you can use Charles, for which there is a free trial.
And if you are testing and already using python, why not use a simple python script with requests and its Session object with cookie-persistence..

OAuth and the YouTube API

I am trying to use the YouTube services with OAuth. I have been able to obtain request tokens, authorize them and transform them into access tokens.
Now I am trying to use those tokens to actually do requests to the YouTube services. For instance I am trying to add a video to a playlist. Hence I am making a POST request to
https://gdata.youtube.com/feeds/api/playlists/XXXXXXXXXXXX
sending a body of
<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:yt="http://gdata.youtube.com/schemas/2007">
<id>XXXXXXXXX</id>
</entry>
and with the headers
Gdata-version: 2
Content-type: application/atom+xml
Authorization: OAuth oauth_consumer_key="www.xxxxx.xx",
oauth_nonce="xxxxxxxxxxxxxxxxxxxxxxxxx",
oauth_signature="XXXXXXXXXXXXXXXXXXX",
oauth_signature_method="HMAC-SHA1",
oauth_timestamp="1310985770",
oauth_token="1%2FXXXXXXXXXXXXXXXXXXXX",
oauth_version="1.0"
X-gdata-key: key="XXXXXXXXXXXXXXXXXXXXXXXXX"
plus some standard headers (Host and Content-Length) which are added by urllib2 (I am using Python) at the moment of the request.
Unfortunately, I get an Error 401: Unknown authorization header, and the headers of the response are
X-GData-User-Country: IT
WWW-Authenticate: GoogleLogin service="youtube",realm="https://www.google.com/youtube/accounts/ClientLogin"
Content-Type: text/html; charset=UTF-8
Content-Length: 179
Date: Mon, 18 Jul 2011 10:42:50 GMT
Expires: Mon, 18 Jul 2011 10:42:50 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
In particular I do not know how to interpret the WWW-Authenticate header, whose realm hints to ClientLogin.
I have also tried to play with the OAuth Playground and the Authorization header sent by that site looks exactly like mine, except for the order of the fields. Still, on the plyaground everything works. Well, almost: I get an error telling that a Developer key is missing, but that is reasonable since there is no way to add one on the playground. Still, I go past the Error 401.
I have also tried to manually copy the Authorization header from there, and I got an Error 400: Bad request.
What am I doing wrong?
Turns out the problem was the newline before xmlns:yt. I was able to debug this using ncat, as suggeested here, and inspecting the full response.
i would suggest using the oauth python module, because it much more simple and takes care of the auth headers :) https://github.com/simplegeo/python-oauth2, as a solution i suggest you encode your parameters with 'utf-8' , i had a similar problem, and the solution was that google was expecting utf-8 encoded strings

Categories

Resources