Downloading file with urllib2 vs requests: Why are these outputs different? - python

This is a follow-up from a question I saw earlier today. In this question, a user asks about a problem downloading a pdf from this url:
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
I would think that the two download functions below would give the same result, but the urllib2 version downloads some html with a script tag referencing a pdf loader, while the requests version downloads the real pdf. Can someone explain the difference in behavior?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Is requests smart enough to wait for dynamic websites to render before downloading?
Edit
Following up on #cwallenpoole's idea, I compared the headers and tried swapping headers from the requests request into the urllib2 request. The magic header was Cookie; the below functions write the same file for the example URL.
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Next question: where did requests get that cookie? Is requests making multiple trips to the server?
Edit 2
Cookie came from a redirect header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding

I'll bet that it's an issue with the User Agent header (I just used curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009 and got the same as you report with urllib2). This is part of the request header that lets a website know what type of program/user/whatever is accessing the site (not the library, the HTTP request).
By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
If you're working on a Mac, I'll bet that the site is reading Darwin/13.1.0 or similar and then serving you the macos appropriate content. Otherwise, it's probably trying to direct you to some default alternate content (or prevent you from scraping that URL).

Related

Downloaded PDF is corrupted, how can I download it correctly using Python?

I am trying to download pdf using requests(python 2.7), and, this is the code that I am using:
file_resp = requests.post(file_url, data=payload, headers={"Referer":file_referer_url})
with open('test.pdf', 'wb') as f:
f.write(file_resp.content)
But, downloaded pdf is corrupted. This is the response header that I got:
Cache-Control: no-cache, no-store
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Tue, 23 Jun 2020 05:49:32 GMT
Expires: -1
Pragma: no-cache
Server: Microsoft-IIS/8.5
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-AspNet-Version: 4.0.30319
x-frame-options: SAMEORIGIN
X-Powered-By: ASP.NET
Also, response is something like this:
JVBERi0xLjQNCiW0tba3DQoxIDAgb2JqDQo8P...(long sequence like this...)
Please, can anyone help me with what I might be doing wrong?
Response is of type string, so simply Base64 decode the data with base64.b64decode and write it to a file.
with open('application.pdf', 'wb') as file_to_save:
decoded_pdf_data = base64.b64decode(file_resp.content)
file_to_save.write(decoded_pdf_data)

urllib2.urlopen failing while urllib.urlopen working on same URL

I am trying to use urllib and urllib2 to scrape some data from a particular website.
Now the urllib was primarily for reading and processing the data while the code section with urllib2 was mainly for reading and storing the data.
The external site experienced some changes and while the urllib code section kept working the urllib2 section simply keeled over.
So I did some checks and noticed the urllib2.urlopen(URL) always returned a blank String while the urllib.urlopen(URL) always worked OK.
I dug deeper and enable debug logging on both urllib and urllib modules:
>>> response2 =urllib2.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxltd.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:12:28 GMT
header: Transfer-Encoding: chunked
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=F02B1F76CCCF6F41BE48951F6E1A6205; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close
And....
>>> html3=urllib.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxltd.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:10:36 GMT
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:10:34 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=8CFB903B80C42CA3DA37EDF90D84FF99; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Date: Thu, 28 Nov 2013 19:10:35 GMT
header: Connection: close
As can be identified, the urllib2 connection flow has significantly more Connection headers ( one of which is the Connection header which has its value as Close).
Can anyone assist in finding why the urllib2 fails to retrieve the data while urllib module works well.
I am certain that it has something to do with the Connection headers but I want some sort of confirmation and thinking process explanation.
Thanks.
I would suggest debugging using curl to replicate the headers the two versions of urllib are using. With a bit of trial and error you should be able to find the header causing the problem and go from there.

Is there a way to tell if a page opened with Mechanize isn't returning "search results"?

I am using Mechanize to log in to a web site and make a search. After extracting the links/info I want, I then recurisively move from the current page to the next to the next page. What I'm wondering is if there's an easy way to tell -- based on header information, for instance -- if there are "No results found" or similar page. If so, I could quickly check the header for a "404" or no-results page and then return.
I couldn't find it in the documentation and from what I can tell the answer is no. Can anyone here say more definitely, tho, whether the answer is in fact no?? Thanks in advance.
(Presently I just do a .find() for 'no results' after I .read() the link.)
NOTES:
1) Header Info for a "good" page (with results):
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:10 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: b501064808b265fc6e478fa88e622710
header: X-Runtime: 0.478829
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
2) Header Info from a "bad" (no results page)
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:11 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: 1ae89b2b25ba7983f8a48fa17f7a1798
header: X-Runtime: 0.127865
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
The response header is generated by the server, you could add your own "no results" parameter and parse that...otherwise you have to analyze the content.
If you're set on using the header the only thing I can see between the two is that the bad search returned 4x faster -- maybe you could find a moving average for elapsed response times.

Python Scraping Web with Session Cookie

Hi iam trying to scrap some data off from this URL:
http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1
As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)
I tried to do it like this:
def main():
try:
cj = CookieJar()
baseurl = "http://www.21cineplex.com"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(baseurl)
urllib2.install_opener(opener)
movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()
splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)
print splitSource
except Exception, e:
str(e)
print "Error occured in main Block"
However, i ended up failing to scrap from that particular URL.
A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.
The question is how do i mitigate such example?
ps: i've tried to install request (via pip) how ever it gives me (404):
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Getting page https://pypi.python.org/simple/
URLs to search for versions for request:
* https://pypi.python.org/simple/request/
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Could not find any downloads that satisfy the requirement request
Cleaning up...
Thanks to #Chainik i got it to work now. I ended up modify my code like this:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
Once, the html text is retrieved. It's all about parsing its content.
Cheers
Try setting a referer URL, see below.
Without referer URL set (302 redirect):
$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/
With referer URL set (HTTP/200):
$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/
To set referer URL using urllib, see this post
-- ab1

how can I get complete header info from urlib2 request?

I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx

Categories

Resources