Server returns data with strange encoding/compression - python

I'm requesting data from a site which seems like it is returning base64 encoded data. A response looks like this:
b'LExRPzI+NlFpUXw2Mj9RW1E1MkUyUWksTFFJUWlgZGdkZ19mYmdnX19fW1FKUWlgXWRbUTUyRTJ7MjM2PURRaUxRSVFpaE5OW0xRSVFpYGRnZGhgYmhjZ19fX1tRSlFpYF1jY19kYk5bTFFJUWlgZGdkaGBmYmVnX19fW1FKUWlgXWNOW0xRSVFpYGRnZGhmaGNlZF9fX1tRSlFpYF1jY19kYk5bTFFJUWlgZGdlX2RjZGFjX19fW1FKUWlgXWNhX2BmW1E1MkUyezIzNj1EUWlMUUlRaVxoTk4uTi4='
But, just using base64.decode on that byte sequence doesn't give any meaningful data, so there must be some other step in transforming this data.
Here are the headers of this request:
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html; charset=UTF-8
Date: Mon, 13 Apr 2020 17:48:49 GMT
Server: nginx/1.14.0 (Ubuntu)
Transfer-Encoding: chunked
It is a GET request to this URL https://www.bestfightodds.com/api?f=ggd&m=20222&p=2
Something that seemed like it could work is
data = zlib.decompress(base64.b64decode(r.content))
But any kind of decompression always results with zlib.error: Error -3 while decompressing data: incorrect header check

It's obviously not compressed. After decoding with Base64, the data is highly repetitive.

Related

Partial Content Get Request

I'm trying to get partial content from the website http://dijkstra.cs.bilkent.edu.tr/~cs421/partialt11.txt
Below is my get request:
cmd2 = "GET /{} HTTP/1.0\r\nHost: {}\r\nAuthorization: Basic {}\r\nAccept-Ranges: bytes=1-1800\r\n\r\n".format(path2, host2, token2)
where host2, path2, token2 are defined 100% correctly. matchlist[][] gives the range of bytes I want to recover.
However, no matter what I write into Accept-Ranges: bytes=...-..., I get the same "amount" of the file. It is not the whole file. Plus I get the 200 OK message instead of Partial Content as the status code. Even accept-ranges header is not filled in the response. Why is that? Thanks in advance. Below is the response:
'HTTP/1.1 200 OK\r\n
Date: Sun, 20 Mar 2022 08:29:23 GMT\r\n
Server: Apache/2.4.6 () OpenSSL/1.0.2k-fips PHP/5.6.40 mod_perl/2.0.11 Perl/v5.16.3\r\n
Last-Modified: Mon, 07 Mar 2022 10:38:52 GMT\r\n
ETag: "73a-5d99e786e8eae"\r\n
Accept-Ranges: bytes\r\n
Content-Length: 1850\r\n
Connection: close\r\n
Content-Type: text/plain\r\n\r\n
Modem Noise Killer (alpha version)\n\nWith this circuit diagram, some basic tools including a soldering iron, and\nfour or five components from Radio Shack, you should be able to cut the\nnoise/garbage that appear'
Accept-Ranges is a server response indicating that it will accept partial requests. Client should send Range - i.e., in your case:
Range: bytes=1-1800
However, it's worth noting that the server MAY ignore Range

Does requests properly support multipart responses?

I'm getting an error when receiving a multipart response.
WARNING connectionpool Failed to parse headers (url=************): [StartBoundaryNotFoundDefect(), MultipartInvariantViolationDefect()], unparsed data: ''
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 399, in _make_request
assert_header_parsing(httplib_response.msg)
File "/usr/local/lib/python3.6/site-packages/urllib3/util/response.py", line 66, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [StartBoundaryNotFoundDefect(), MultipartInvariantViolationDefect()], unparsed data: ''
Does this mean that the library does not support multipart responses? The response from my server works in all other cases including to the browser so I'm a little confused.
Any ideas?
This is what is coming back from the server (of course body truncated for brevity):
HTTP/1.1 200 OK
X-Powered-By: Servlet/3.1
X-CA-Affinity: 2411441258
Cache-Control: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Encoding: gzip
X-Compressed-By: BICompressionFilter
Content-Type: multipart/related; type="text/xml"; boundary="1521336443366.-7832488688540884419.-1425166373"
Content-Language: en-US
Transfer-Encoding: chunked
Date: Sun, 18 Mar 2018 01:27:23 GMT
a
154e
<i ʲ O x\龅L dre Qyi
/su k
Of course this is encoded. If I decode it in Fiddler this is what it looks like:
HTTP/1.1 200 OK
X-Powered-By: Servlet/3.1
X-CA-Affinity: 2411441258
Cache-Control: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Compressed-By: BICompressionFilter
Content-Type: multipart/related; type="text/xml"; boundary="1521336443366.-7832488688540884419.-1425166373"
Content-Language: en-US
Date: Sun, 18 Mar 2018 01:27:23 GMT
Content-Length: 17419
--1521336443366.-7832488688540884419.-1425166373
Content-Type: text/xml; charset=utf-8
Content-Length: 15261
<?xml version="1.0" encoding="UTF-8"?>
To answer your question: Yes, Requests handles multipart requests just fine. Having said that, I have seen the same error you're getting.
This appears to be a bug within urllib3 but possibly goes as deep as the httplib package that comes with python. In your case I would guess it comes back to the UTF-8 encoding of the response which obviously you can't do much about (unless you also maintain server-side). I believe it is perfectly safe to ignore but simply including urllib3.disable_warnings() doesn't seem to do the trick for me. If you want to silence this specific warning, you can include a logging filter in your code. (credit to the home-assistant maintainers for this approach)
def filter_urllib3_logging():
"""Filter header errors from urllib3 due to a urllib3 bug."""
urllib3_logger = logging.getLogger("urllib3.connectionpool")
if not any(isinstance(x, NoHeaderErrorFilter)
for x in urllib3_logger.filters):
urllib3_logger.addFilter(
NoHeaderErrorFilter()
)
class NoHeaderErrorFilter(logging.Filter):
"""Filter out urllib3 Header Parsing Errors due to a urllib3 bug."""
def filter(self, record):
"""Filter out Header Parsing Errors."""
return "Failed to parse headers" not in record.getMessage()
Then, just call filter_urllib3_logging() in your setup. It doesn't stop the warnings but it DOES hide them :D
!!PLEASE NOTE!! This will also hide, and thus, make it difficult to diagnose any error that is caused by parsing headers which occasionally could be a legitimate error!

Using 'Requests' python module for POST request, receiving response as if it were GET

So I am trying to make a script that checks a reservation availability of a bus. The starting link for this is https://reservation.pc.gc.ca/.
In the reserve box the following needs to be selected:
Reservation: Day Use (Guided Hikes, Lake O’Hara Bus)
Park: Yoho-Lake O'Hara
Arrival Date: Jun 16
Party Size: 2
When those options are entered, it takes you to the following page: https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar
It is my understanding that if I send a POST request to that second link with the correct data it should return the page I'm looking for
If I look in the dev tools network info when I select the correct parameters the form data is:
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE: -reallllly long string-
__VIEWSTATEGENERATOR: 8D0E13E6
ctl00$MainContentPlaceHolder$rdbListReservationType: Events
ddlLocations: 213a1bc9-9218-4e98-9a7f-0f209008e437**
ddlArrivalMonth: 2017-06-16
ddlArrivalDay: 19
ddlNights: 1
ddlDepartureMonth:
ddlDepartureDay:
ddlEquipment:
ddlEquipmentSub:
ddlPartySize:2
ctl00$MainContentPlaceHolder$chkExcludeAccessible: on
ctl00$MainContentPlaceHolder$imageButtonCalendar.x: 64
ctl00$MainContentPlaceHolder$imageButtonCalendar.y: 56
So the code I wrote is:
import requests
payload = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE':-reallly long string-,
'__VIEWSTATEGENERATOR': '8D0E13E6',
'ctl00$MainContentPlaceHolder$rdbListReservationType': 'Events',
'ddlLocations': '213a1bc9-9218-4e98-9a7f-0f209008e437',
'ddlArrivalMonth': 2017-06-16,
'ddlArrivalDay': 19,
'ddlNights': 1,
'ddlDepartureMonth': '',
'ddlDepartureDay': '',
'ddlEquipment': '',
'ddlEquipmentSub': '',
'ddlPartySize': 2,
'ctl00$MainContentPlaceHolder$chkExcludeAccessible': 'on',
'ctl00$MainContentPlaceHolder$imageButtonCalendar.x': 64,
'ctl00$MainContentPlaceHolder$imageButtonCalendar.y': 56
}
r = requests.get(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
print r.text
r.text ends up just being the second link as if no parameters were entered - as if I just sent a normal GET request to the link. I tried turning the payload values that are integers into strings, I tried removing the empty key:value pairs. No luck. Trying to figure out what I'm missing.
It looks to me like there are 2 things going on:
#errata was correct, and this should be a POST request. You're about halfway there.
What I noticed though is that it seems to post the form data to Home.aspx and the URL that you see after submitting the form is the result of that processing and subsequent redirect.
You might try posting the form data as json to ./Home.aspx.
I found through Postman that this nearly worked, but I had to specify the content-type in order to get the proper results.
If you need to know how to add header and body instructions to the .post() method, it looks like there is a good example (though perhaps slightly outdated) here:
adding header to python request module
Also, fwiw, check out Postman. If you're both inexperienced with requests and with doing it in Python, at least this may lesson some of the burden of trial and error.
You are using
r = requests.get(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
instead of
r = requests.post(r"https://reservation.pc.gc.ca/Yoho-LakeO'Hara?Calendar", data=payload)
Digging a bit deeper in your problem, I found out that the URL you are calling is actually redirecting to a different URL (returning HTTP response 302):
$ curl -I "https://reservation.pc.gc.ca/Yoho-LakeO'Hara"
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 77273
Content-Type: text/html; charset=utf-8
Location: https://reservation-pc.fjgc-gccf.gc.ca/GccfLanguage.aspx?lang=eng&ret=https%3a%2f%2freservation.pc.gc.ca%3a443%2fYoho-LakeO%27Hara
Server: Microsoft-IIS/8.0
Set-Cookie: ASP.NET_SessionId=qw4p4e2zxjxx0c2zyq014p45; path=/; secure; HttpOnly
Set-Cookie: CookieLocaleName=en-CA; path=/; secure; HttpOnly
X-Powered-By: ASP.NET
X-Frame-Options: SAMEORIGIN
Date: Wed, 17 May 2017 14:22:53 GMT
However, following the Location from response results also in 302:
$ curl -I "https://reservation-pc.fjgc-gccf.gc.ca/GccfLanguage.aspx?lang=eng&ret=https%3a%2f%2freservation.pc.gc.ca%3a443%2fYoho-LakeO%27Hara"
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 179
Content-Type: text/html; charset=utf-8
Location: https://reservation.pc.gc.ca:443/Yoho-LakeO'Hara?gccf=true
Server: Microsoft-IIS/8.0
Set-Cookie: ASP.NET_SessionId=rbcuvexfg4fb340ixtcjd1qy; path=/; secure; HttpOnly
Set-Cookie: _gc_lang=eng; domain=.fjgc-gccf.gc.ca; path=/; secure; HttpOnly
X-Powered-By: ASP.NET
X-Frame-Options: SAMEORIGIN
Date: Wed, 17 May 2017 14:24:55 GMT
All this probably results in Requests transforming your POST into GET in the end...

Making Head Requests in Twisted

I am relatively new to using Twisted and I am having trouble returning the content-length header when performing a basic head request. I have set up an asynchronous client already but the trouble comes in this bit of code:
def getHeaders(url):
d = Agent(reactor).request("HEAD", url)
d.addCallbacks(handleResponse, handleError)
return d
def handleResponse(r):
print r.code, r.headers
whenFinished = twisted.internet.defer.Deffered()
r.deliverBody(PrinterClient(whenFinished))
return whenFinished
I am making a head request and passing the url. As indicated in this documentation the content-length header is not stored in self.length, but can be accessed from the self.headers response. The output is returning the status code as expected but the header output is not what is expected. Using "uhttp://www.espn.go.com" as an example it currently returns:
Set-Cookie: SWID=77638195-7A94-4DD0-92A5-348603068D58;
path=/; expires=Fri, 31-Jan-2034 00:50:09 GMT; domain=go.com;
X-Ua-Compatible: IE=edge,chrome=1
Cache-Control: max-age=15
Date: Fri, 31 Jan 2014 00:50:09 GMT
P3P: CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAi IVDi CONi
OUR SAMo OTRo BUS PHY ONL UNI PUR COM NAV INT DEM CNT STA PRE"
Content-Type: text/html; charset=iso-8859-1
As you can see, no content-length field is returned. If the same request is done in requests then the result will contain the content-length header:
r = requests.head("http://www.espn.go.com")
r.headers
({'content-length': '46133', 'content-encoding': 'gzip'...})
(rest omitted for readability)
What is causing this problem? I am sure it is a simple mistake on my part but I for the life of me cannot figure out what I have done wrong. Any help is appreciated.
http://www.espn.go.com/ returns one response if the client sends an Accept-Encoding: gzip header and another response if it doesn't.
One of the differences between the two responses is the inclusion of the Content-Length header.
If you want to make requests using Agent including Accept-Encoding: gzip then take a look at ContentDecoderAgent or the third-party treq package.
http allows (but does not REQUIRE) entity headers in responses to HEAD requests. The only restriction it places is that 200 responses to HEAD requests MUST NOT include an entity payload. Its up to the origin server to decide which, if any entity headers it would like to include.
In the case of Content-Length, it makes sense for this to be optional for HEAD; if the entity will be computed dynamically (as with compressing/decompressing content), it's better for the server to avoid the extra work of computing the content length when the request won't include the content anyway.

OAuth and the YouTube API

I am trying to use the YouTube services with OAuth. I have been able to obtain request tokens, authorize them and transform them into access tokens.
Now I am trying to use those tokens to actually do requests to the YouTube services. For instance I am trying to add a video to a playlist. Hence I am making a POST request to
https://gdata.youtube.com/feeds/api/playlists/XXXXXXXXXXXX
sending a body of
<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:yt="http://gdata.youtube.com/schemas/2007">
<id>XXXXXXXXX</id>
</entry>
and with the headers
Gdata-version: 2
Content-type: application/atom+xml
Authorization: OAuth oauth_consumer_key="www.xxxxx.xx",
oauth_nonce="xxxxxxxxxxxxxxxxxxxxxxxxx",
oauth_signature="XXXXXXXXXXXXXXXXXXX",
oauth_signature_method="HMAC-SHA1",
oauth_timestamp="1310985770",
oauth_token="1%2FXXXXXXXXXXXXXXXXXXXX",
oauth_version="1.0"
X-gdata-key: key="XXXXXXXXXXXXXXXXXXXXXXXXX"
plus some standard headers (Host and Content-Length) which are added by urllib2 (I am using Python) at the moment of the request.
Unfortunately, I get an Error 401: Unknown authorization header, and the headers of the response are
X-GData-User-Country: IT
WWW-Authenticate: GoogleLogin service="youtube",realm="https://www.google.com/youtube/accounts/ClientLogin"
Content-Type: text/html; charset=UTF-8
Content-Length: 179
Date: Mon, 18 Jul 2011 10:42:50 GMT
Expires: Mon, 18 Jul 2011 10:42:50 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
In particular I do not know how to interpret the WWW-Authenticate header, whose realm hints to ClientLogin.
I have also tried to play with the OAuth Playground and the Authorization header sent by that site looks exactly like mine, except for the order of the fields. Still, on the plyaground everything works. Well, almost: I get an error telling that a Developer key is missing, but that is reasonable since there is no way to add one on the playground. Still, I go past the Error 401.
I have also tried to manually copy the Authorization header from there, and I got an Error 400: Bad request.
What am I doing wrong?
Turns out the problem was the newline before xmlns:yt. I was able to debug this using ncat, as suggeested here, and inspecting the full response.
i would suggest using the oauth python module, because it much more simple and takes care of the auth headers :) https://github.com/simplegeo/python-oauth2, as a solution i suggest you encode your parameters with 'utf-8' , i had a similar problem, and the solution was that google was expecting utf-8 encoded strings

Categories

Resources