I am trying to use urllib and urllib2 to scrape some data from a particular website.
Now the urllib was primarily for reading and processing the data while the code section with urllib2 was mainly for reading and storing the data.
The external site experienced some changes and while the urllib code section kept working the urllib2 section simply keeled over.
So I did some checks and noticed the urllib2.urlopen(URL) always returned a blank String while the urllib.urlopen(URL) always worked OK.
I dug deeper and enable debug logging on both urllib and urllib modules:
>>> response2 =urllib2.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxltd.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:12:28 GMT
header: Transfer-Encoding: chunked
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=F02B1F76CCCF6F41BE48951F6E1A6205; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close
And....
>>> html3=urllib.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxltd.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:10:36 GMT
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:10:34 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=8CFB903B80C42CA3DA37EDF90D84FF99; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Date: Thu, 28 Nov 2013 19:10:35 GMT
header: Connection: close
As can be identified, the urllib2 connection flow has significantly more Connection headers ( one of which is the Connection header which has its value as Close).
Can anyone assist in finding why the urllib2 fails to retrieve the data while urllib module works well.
I am certain that it has something to do with the Connection headers but I want some sort of confirmation and thinking process explanation.
Thanks.
I would suggest debugging using curl to replicate the headers the two versions of urllib are using. With a bit of trial and error you should be able to find the header causing the problem and go from there.
Related
I spotted a weird warning in logs:
[WARNING] urllib3.connectionpool:467: Failed to parse headers (url=https://REDACTED): [MissingHeaderBodySeparatorDefect()], unparsed data: 'trol,Content-Type\r\n\r\n'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 465, in _make_request
assert_header_parsing(httplib_response.msg)
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/response.py", line 91, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'trol,Content-Type\r\n\r\n'
This is from calling a standard requests.post() on a web service I fully control (a Python app behind nginx).
When I turn on debuglevel=1 in http.client.HTTPResponse I see this:
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx/1.18.0 (Ubuntu)
header: Date: Tue, 30 Nov 2021 22:14:04 GMT
header: Content-Type: application/json
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Vary: Accept-Encoding
header: Access-Control-Allow-Origin: *
header: Access-Control-Allow-Credentials: true
header: Access-Control-Allow-Methods: GET, POST, OPTIONS
header: Access-Control-Allow-Headers: DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Con
Note the last header ending abruptly in ,If-Modified-Since,Cache-Con.
Clearly, requests==2.26.0 (via urllib3==1.26.7 via http.client) cuts the last header in half for some reason during parsing, and then later complains it has "left over" data with the remaining trol,Content-Type\r\n\r\n.
In this case the warning is not critical, because the header is not really needed. But it's scary this is happening, because… what else is being cut / misparsed?
The same endpoint works fine from e.g. curl:
$ curl -i -XPOST https://REDACTED
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 04 Dec 2021 20:08:59 GMT
Content-Type: application/json
Content-Length: 53
Connection: keep-alive
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Con
trol,Content-Type
…JSON response…
Any idea what could be wrong? Many thanks.
Your webserver, or its configuration, looks broken. Have a look at what is generating that CORS Access-Control-Allow-Headers header because it is not permitted to contain a line break.
This is a follow-up from a question I saw earlier today. In this question, a user asks about a problem downloading a pdf from this url:
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
I would think that the two download functions below would give the same result, but the urllib2 version downloads some html with a script tag referencing a pdf loader, while the requests version downloads the real pdf. Can someone explain the difference in behavior?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Is requests smart enough to wait for dynamic websites to render before downloading?
Edit
Following up on #cwallenpoole's idea, I compared the headers and tried swapping headers from the requests request into the urllib2 request. The magic header was Cookie; the below functions write the same file for the example URL.
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Next question: where did requests get that cookie? Is requests making multiple trips to the server?
Edit 2
Cookie came from a redirect header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
I'll bet that it's an issue with the User Agent header (I just used curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009 and got the same as you report with urllib2). This is part of the request header that lets a website know what type of program/user/whatever is accessing the site (not the library, the HTTP request).
By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
If you're working on a Mac, I'll bet that the site is reading Darwin/13.1.0 or similar and then serving you the macos appropriate content. Otherwise, it's probably trying to direct you to some default alternate content (or prevent you from scraping that URL).
I am using Mechanize to log in to a web site and make a search. After extracting the links/info I want, I then recurisively move from the current page to the next to the next page. What I'm wondering is if there's an easy way to tell -- based on header information, for instance -- if there are "No results found" or similar page. If so, I could quickly check the header for a "404" or no-results page and then return.
I couldn't find it in the documentation and from what I can tell the answer is no. Can anyone here say more definitely, tho, whether the answer is in fact no?? Thanks in advance.
(Presently I just do a .find() for 'no results' after I .read() the link.)
NOTES:
1) Header Info for a "good" page (with results):
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:10 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: b501064808b265fc6e478fa88e622710
header: X-Runtime: 0.478829
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
2) Header Info from a "bad" (no results page)
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:11 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: 1ae89b2b25ba7983f8a48fa17f7a1798
header: X-Runtime: 0.127865
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
The response header is generated by the server, you could add your own "no results" parameter and parse that...otherwise you have to analyze the content.
If you're set on using the header the only thing I can see between the two is that the bad search returned 4x faster -- maybe you could find a moving average for elapsed response times.
I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'
Trying to get a login script working, I kept getting the same login page returned, so I turned on debugging of the http stream (can't use wireshark or the like because of https).
I got nothing, so I copied the example, it works. Any query to google.com works, but to my target page does not show debugging, what is the difference? If it was a redirect I would expect to see the first get/redirect header, and http://google redirects as well.
import urllib
import urllib2
import pdb
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
urllib2.install_opener(opener)
print '================================'
data = urllib2.urlopen('http://google.com').read()
print '================================'
data = urllib2.urlopen('https://google.com').read()
print '================================'
data = urllib2.urlopen('https://members.poolplayers.com/default.aspx').read()
print '================================'
data = urllib2.urlopen('https://google.com').read()
When I run I get this.
$ python ex.py
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Location: http://www.google.com/
header: Content-Type: text/html; charset=UTF-8
header: Date: Sat, 02 Jul 2011 16:20:11 GMT
header: Expires: Mon, 01 Aug 2011 16:20:11 GMT
header: Cache-Control: public, max-age=2592000
header: Server: gws
header: Content-Length: 219
header: X-XSS-Protection: 1; mode=block
header: Connection: close
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:12 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=4ca9123c4f8b617f:FF=0:TM=1309623612:LM=1309623612:S=o3GqHRj5_3BkKFuJ; expires=Mon, 01-Jul-2013 16:20:12 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=eZdXW-qQQC2fRrXps3HpzkGgeWbMCnyT_taxzdvW1icXS1KSM0SSYOL7B8-OPsw0eLLAbvCW863Viv9ICDj4VAL7dmHtF-gsPfro67IFN5SP6WyHHpLL7JsS_-MOvwSD; expires=Sun, 01-Jan-2012 16:20:12 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:14 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=d613768b3704482b:FF=0:TM=1309623614:LM=1309623614:S=xLxMwBVKEG_bb1bo; expires=Mon, 01-Jul-2013 16:20:14 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=im_KcHyhG2LrrGgLsQjYlwI93lFZa2jZjEYBzdn-xXEyQnoGo8xkP0234fROYV5DScfY_6UbbCJFtyP_V00Ji11kjZwJzR63LfkLoTlEqiaY7FQCIky_8hA2NEqcXwJe; expires=Sun, 01-Jan-2012 16:20:14 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
================================
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:16 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=dc2cb55e6476c555:FF=0:TM=1309623616:LM=1309623616:S=o__g-Zcpts392D9_; expires=Mon, 01-Jul-2013 16:20:16 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=R5gy1aTMjL8pghxQmfUkJaMLc3SxmpFxu5XpoZELAsZrdf8ogQLwyo9Vbk_pRkoETvKE-beWbHHBZu3xgJDt6IsjwmSHPaMGSzxXvsWERxsbKwQMy-wlLSfasvUq5x6q; expires=Sun, 01-Jan-2012 16:20:16 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
You'll need an HTTPSHandler:
h = urllib2.HTTPSHandler(debuglevel=1)