missing 'content-length' header when using python's urllib2 urlopen - python

When attempting to check the 'content-length' header for some web pages using urllib2 in python, the header is missing. For example, the response from google.com is missing this header. Any idea why?
Example:
r = urllib2.urlopen('http://www.google.com')
i = r.info()
print i.keys()
Gives:
['x-xss-protection', 'set-cookie', 'expires', 'server', 'connection', 'cache-control', 'date', 'p3p', 'content-type', 'x-frame-options']

You can see here that an http response can either contain Content-Length or Transfer-Encoding: chunked.
However, when Transfer-Encoding: chunked is used in the header, after the headers, you'll get a hexadecimal string which if converted to decimal, will give you the length of the next chunk. And after the last chunk you'll get a 0 for this value which means you've reached the end of the file.
You can use regular expressions to get this hexadecimal value (not a must though)
read = #string containing a line or a part of the http response
hexPat = re.compile(r'([0-9A-F]+)\r\n', re.I)
match = re.search(hexPat, read)
chunkLen = int(match.group(1), 16) #converts hexadecimal to decimal
or You can just read the first hexadecimal value, get the length of the first chunk and receive that chunk, then get the length of the next chunk and so on till you find a 0

The Content-Length of a HEAD response SHOULD, but not always does include the Content-Length value of a GET response:
Stack Overflow does:
> telnet stackoverflow.com 80
HEAD / HTTP/1.1
Host: stackoverflow.com
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Length: 362245 <--------
Content-Type: text/html; charset=utf-8
Expires: Mon, 04 Oct 2010 11:51:49 GMT
Last-Modified: Mon, 04 Oct 2010 11:50:49 GMT
Vary: *
Date: Mon, 04 Oct 2010 11:50:49 GMT
Google doesn't:
> telnet www.google.com 80
HEAD / HTTP/1.1
Host: www.google.ie
HTTP/1.1 200 OK
Date: Mon, 04 Oct 2010 11:55:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

Related

HTTP header cut in half with `urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data`

I spotted a weird warning in logs:
[WARNING] urllib3.connectionpool:467: Failed to parse headers (url=https://REDACTED): [MissingHeaderBodySeparatorDefect()], unparsed data: 'trol,Content-Type\r\n\r\n'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 465, in _make_request
assert_header_parsing(httplib_response.msg)
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/response.py", line 91, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'trol,Content-Type\r\n\r\n'
This is from calling a standard requests.post() on a web service I fully control (a Python app behind nginx).
When I turn on debuglevel=1 in http.client.HTTPResponse I see this:
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx/1.18.0 (Ubuntu)
header: Date: Tue, 30 Nov 2021 22:14:04 GMT
header: Content-Type: application/json
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Vary: Accept-Encoding
header: Access-Control-Allow-Origin: *
header: Access-Control-Allow-Credentials: true
header: Access-Control-Allow-Methods: GET, POST, OPTIONS
header: Access-Control-Allow-Headers: DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Con
Note the last header ending abruptly in ,If-Modified-Since,Cache-Con.
Clearly, requests==2.26.0 (via urllib3==1.26.7 via http.client) cuts the last header in half for some reason during parsing, and then later complains it has "left over" data with the remaining trol,Content-Type\r\n\r\n.
In this case the warning is not critical, because the header is not really needed. But it's scary this is happening, because… what else is being cut / misparsed?
The same endpoint works fine from e.g. curl:
$ curl -i -XPOST https://REDACTED
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 04 Dec 2021 20:08:59 GMT
Content-Type: application/json
Content-Length: 53
Connection: keep-alive
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Con
trol,Content-Type
…JSON response…
Any idea what could be wrong? Many thanks.
Your webserver, or its configuration, looks broken. Have a look at what is generating that CORS Access-Control-Allow-Headers header because it is not permitted to contain a line break.

urllib2 returns sometimes old page - returns strange header

I'm working on a python script which works with a JSON returned by an URL.
Since a couple of days urllib2 returns (just sometimes) an old state of the JSON.
I did add the headers "Cache-Control":"max-age=0" etc. still it sometimes happen.
If I print out the request info I get:
Server: nginx/1.8.0
Date: Thu, 03 Sep 2015 17:02:47 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3539
Status: 200 OK
X-XHR-Current-Location: /shop/169464.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "b1fbe7a01e0832025a3afce23fc2ab56"
X-Request-Id: 4cc0d399f943ad09a903f18a6ce1c488
X-Runtime: 0.123033
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 1707606900 1707225496
Age: 2860
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from adsl
X-Cache-Lookup: MISS from adsl:21261
Connection: close
has it something to do with the header "Age" or "X-Cache-Rack"? Or any ideas how I can fix it?
thanks in advance!
try to fake the user-agent, remove cookies, drop sessions.
fake_user_agent = ['chrome','firefox','safari']
request = urllib2.Request(url)
request.add_header('User-Agent', get_random(fake_user_agent))
content = urllib2.build_opener().open(request)
if all doesn't work, then try using tor to change ip per request.
if nothing works, then you can't bypass it because you are most definitely connecting to the transparent proxy

Python Scripting Question

I have a question about writing a small tool that would provide the headers for any website. I am new to python but wanted to know if there is anything else other than encoding that I would have to account for in my code when developing the tool? I have a rough draft of my code shown below. Any pointers from the python coders?
#!/usr/bin/python
import sys, urllib
if len(sys.argv) == 2:
website = sys.argv[1]
website = urllib.urlopen(sys.argv[1])
if(website.code != 200):
print "Something went wrong here"
print website.code
exit(0)
print 'Printing the headers'
print '-----------------------------------------'
for header, value in website.headers.items() :
print header + ' : ' + value
Seems a fairly straightforward script (though this question seems more of a fit for stackoverflow). Couple comments, first curl -I is a useful command line tool to compare against. Second, even when you don't get 200 status, there are still often useful content or headers you may want to display. E.g.,
$ curl -I http://security.stackexchange.com/asdf
HTTP/1.1 404 Not Found
Cache-Control: private
Content-Length: 24068
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=678b5b9c-0130-4398-9834-673475961dc6; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:24:00 GMT
Also note urllib follows redirects automatically. E.g., with curl you'll see:
$ curl -I http://www.security.stackexchange.com
HTTP/1.1 301 Moved Permanently
Content-Length: 157
Content-Type: text/html; charset=UTF-8
Location: http://security.stackexchange.com/
Date: Fri, 25 Apr 2014 07:26:52 GMT
while your tool will just give.
$ python user3567119.py http://www.security.stackexchange.com
Printing the headers
-----------------------------------------
content-length : 68639
set-cookie : prov=9bf4f3d4-e3ae-4161-8e34-9aaa83f0aa4b; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
expires : Fri, 25 Apr 2014 07:29:32 GMT
vary : *
last-modified : Fri, 25 Apr 2014 07:28:32 GMT
connection : close
cache-control : public, no-cache="Set-Cookie", max-age=60
date : Fri, 25 Apr 2014 07:28:31 GMT
x-frame-options : SAMEORIGIN
content-type : text/html; charset=utf-8
Third, if you continue playing around with HTTP requests in python, I highly recommend using requests. With requests, you'll be able to see the 301 if you do:
In [1]: import requests
In [2]: r=requests.get('http://www.security.stackexchange.com')
In [3]: r
Out[3]: <Response [200]>
In [4]: r.history
Out[4]: (<Response [301]>,)
It's also worth trying out some HTTP requests in just plain old telnet. E.g., telnet security.stackexchange.com 80 then quickly type:
GET / HTTP/1.1
Host: security.stackexchange.com
followed by a blank line. Then you'll see the actual HTTP response on the wire (instead of recreating it after urllib has processed the HTTP response):
HTTP/1.1 200 OK
Cache-Control: public, no-cache="Set-Cookie", max-age=60
Content-Type: text/html; charset=utf-8
Expires: Fri, 25 Apr 2014 07:38:37 GMT
Last-Modified: Fri, 25 Apr 2014 07:37:37 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=a75de1f2-678b-4a9d-bbfd-39e933e60237; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:37:36 GMT
Content-Length: 68849
<!DOCTYPE html>

Is there a way to tell if a page opened with Mechanize isn't returning "search results"?

I am using Mechanize to log in to a web site and make a search. After extracting the links/info I want, I then recurisively move from the current page to the next to the next page. What I'm wondering is if there's an easy way to tell -- based on header information, for instance -- if there are "No results found" or similar page. If so, I could quickly check the header for a "404" or no-results page and then return.
I couldn't find it in the documentation and from what I can tell the answer is no. Can anyone here say more definitely, tho, whether the answer is in fact no?? Thanks in advance.
(Presently I just do a .find() for 'no results' after I .read() the link.)
NOTES:
1) Header Info for a "good" page (with results):
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:10 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: b501064808b265fc6e478fa88e622710
header: X-Runtime: 0.478829
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
2) Header Info from a "bad" (no results page)
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:11 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: 1ae89b2b25ba7983f8a48fa17f7a1798
header: X-Runtime: 0.127865
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
The response header is generated by the server, you could add your own "no results" parameter and parse that...otherwise you have to analyze the content.
If you're set on using the header the only thing I can see between the two is that the bad search returned 4x faster -- maybe you could find a moving average for elapsed response times.

urllib2 python (Transfer-Encoding: chunked)

I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'

Categories

Resources