I need to scrape all the data available on this website
Here is my code
import requests
url = "https://bpcleproc.in/EPROC/viewtender/13474"
r = requests.get(url)
print(r)
print(r.text)
The result i get is
<Response [200]>
<script type="text/javascript">
window.location.href = "/EPROC/viewtender/13474";
</script>
I don't understand why this isn't working. Is the website generated over a js file?
The website sets a bunch of cookies in a redirect loop before bringing you to your final page, for some reason:
$ curl -vv https://bpcleproc.in/EPROC/viewtender/13474
* Trying 103.231.215.7...
* Connected to bpcleproc.in (103.231.215.7) port 443 (#0)
* TLS 1.0 connection using TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
* Server certificate: *.bpcleproc.in
* Server certificate: Go Daddy Secure Certificate Authority - G2
* Server certificate: Go Daddy Root Certificate Authority - G2
> GET /EPROC/viewtender/13474 HTTP/1.1
> Host: bpcleproc.in
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Wed, 25 May 2016 16:43:42 GMT
< Server: Apache-Coyote/1.1
< Cache-Control: no-store
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Pragma: no-cache
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Location: http://bpcleproc.in/EPROC/setcookie
< Content-Type: UTF-8;charset=UTF-8
< Content-Language: 1
< Content-Length: 0
< Set-Cookie: JSESSIONID=D2843B0E183A813195650A281A9FCC7D.tomcat2; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: locale=1; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: moduleId=3; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: isShowDcBanner=1; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: listingStyle=1; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: logo=Untitled.jpg; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: theme=theme-1; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: dateFormat=DD/MM/YYYY; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: conversionValue=103; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: eprocLogoName=1; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: phoneNo=07940016868; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
< Set-Cookie: email="support#bpcleproc.in"; Version=1; Max-Age=86400; Expires=Thu, 26-May-2016 16:43:42 GMT; Path=/EPROC/; Secure; HttpOnly
<
* Connection #0 to host bpcleproc.in left intact
You can use requests.Session to emulate that behavior:
import requests
session = requests.Session()
# First, get the cookies.
# The session keeps track of cookies and requests follows redirects for you
r = session.get("https://bpcleproc.in/EPROC/viewtender/13474")
# Then, simulate following the JS redirect
r = session.get("https://bpcleproc.in/EPROC/viewtender/13474")
print(r)
print(r.text)
Related
I'm trying to find pages in my network where full download size is too big, let's say, bigger than 10-20MiB.
I already know how to crawl, I need something that will find out the size of everything a browser would be downloading for each page, preferably without actually downloading it, but this condition is of minor importance.
Preferably in python, but if not at least something that I could use inside a bash script (for example curl or wget). I would call that bash script from inside python.
As for more context, in python, right now I'm using requests and beautiful soup for crawling and checking the status response of all the web pages.
you can try this:
curl --head https://www.instagram.com
it will give this result:
HTTP/1.1 200 OK
Content-Type: text/html
X-Frame-Options: SAMEORIGIN
Cache-Control: private, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Vary: Cookie, Accept-Language, Accept-Encoding
Content-Language: en
Date: Mon, 23 Jul 2018 17:05:14 GMT
Strict-Transport-Security: max-age=60
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: sessionid=""; Domain=.instagram.com; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: rur=FTW; Domain=.instagram.com; Path=/
Set-Cookie: csrftoken=Y0WEjvNDGdQXAU7YQoUNsVjSodMT6cOZ; Domain=.instagram.com; expires=Mon, 22-Jul-2019 17:05:14 GMT; Max-Age=31449600; Path=/; Secure
Set-Cookie: mid=W1YKygAEAAGowaTCPQjEP25_NhqF; Domain=.instagram.com; expires=Sun, 18-Jul-2038 17:05:14 GMT; Max-Age=630720000; Path=/
Set-Cookie: mcd=3; Domain=.instagram.com; Path=/
Connection: keep-alive
Content-Length: 21754
The Content length in the last line is the required information.
I am trying to get JSON response from this URL.
But the JSON I see in the browser is different than what I get from python's requests response.
The code and its output:-
#code
import requests
r = requests.get("https://www.bigbasket.com/product/get-products/?slug=fruits-vegetables&page=1&tab_type=[%22all%22]&sorted_on=popularity&listtype=pc")
print("Status code: ", r.status_code)
print("JSON: ", r.json())
print("Headers:\n",r.headers())
#output
Status code: 200
JSON: '{"cart_info": {}, "tab_info": [], "screen_name": ""}'
Headers:
{'Content-Type': 'application/json',
'Content-Length': '52',
'Server': 'nginx',
'x-xss-protection': '1; mode=block',
'x-content-type-options': 'nosniff',
'x-frame-options': 'SAMEORIGIN',
'Access-Control-Allow-Origin': 'https://b2b.bigbasket.com',
'Date': 'Sat, 02 Sep 2017 18:43:51 GMT',
'Connection': 'keep-alive',
'Set-Cookie': '_bb_cid=4; Domain=.bigbasket.com; expires=Fri, 28-Aug-2037 18:43:51 GMT; Max-Age=630720000; Path=/, ts="2017-09-03 00:13:51.164"; Domain=.bigbasket.com; expires=Sun, 02-Sep-2018 18:43:51 GMT; Max-Age=31536000; Path=/, _bb_rd=6; Domain=.bigbasket.com; expires=Sun, 02-Sep-2018 18:43:51 GMT; Max-Age=31536000; Path=/'}
This is what Chrome shows in dev tools:-
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 4206
Server: nginx
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
Content-Encoding: gzip
x-frame-options: SAMEORIGIN
Access-Control-Allow-Origin: https://b2b.bigbasket.com
Date: Sat, 02 Sep 2017 15:43:20 GMT
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: ts="2017-09-02 21:13:20.193"; Domain=.bigbasket.com; expires=Sun, 02-Sep-2018 15:43:20 GMT; Max-Age=31536000; Path=/
Set-Cookie: _bb_rd=6; Domain=.bigbasket.com; expires=Sun, 02-Sep-2018 15:43:20 GMT; Max-Age=31536000; Path=/
Also tried separating query string and specifying it as params argument but it is giving the same result.
import requests
s = requests.session()
s.get("https://www.bigbasket.com/product/get-products/?slug=fruits-vegetables&page=1&tab_type=[%22all%22]&sorted_on=popularity&listtype=pc")
r = s.get("https://www.bigbasket.com/product/get-products/?slug=fruits-vegetables&page=1&tab_type=[%22all%22]&sorted_on=popularity&listtype=pc")
print("Status code: ", r.status_code)
print("JSON: ", r.json())
This is happening because of different City ID identified by your web browser and Requests.
You can check value of _bb_cid in both the cases
I opened debug info for requests and request an URL:
import logging
import requests
from http.client import HTTPConnection
HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
requests.get('http://uae.souq.com/ae-ar/apple-iphone-7-with-facetime-32gb-4g-lte-gold-11526690/i/')
It outputs:
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP
connection (1): uae.souq.com send: b'GET
/ae-ar/apple-iphone-7-with-facetime-32gb-4g-lte-gold-11526690/i/
HTTP/1.1\r\nHost: uae.souq.com\r\nAccept: /\r\nUser-Agent:
python-requests/2.12.4\r\nConnection: keep-alive\r\nAccept-Encoding:
gzip, deflate\r\n\r\n' reply: 'HTTP/1.1 301 Moved Permanently\r\n'
DEBUG:requests.packages.urllib3.connectionpool:http://uae.souq.com:80
"GET /ae-ar/apple-iphone-7-with-facetime-32gb-4g-lte-gold-11526690/i/
HTTP/1.1" 301 None
WARNING:requests.packages.urllib3.connectionpool:Failed to parse
headers
(url=http://uae.souq.com:80/ae-ar/apple-iphone-7-with-facetime-32gb-4g-lte-gold-11526690/i/):
[MissingHeaderBodySeparatorDefect()], unparsed data:
'ع-Ù\x81Ù\x8aس-تاÙ\x8aÙ\x85-32-جÙ\x8aجا-اÙ\x84جÙ\x8aÙ\x84-اÙ\x84رابع-اÙ\x84-تÙ\x8a-اÙ\x8a-Ø°Ù\x87بÙ\x8a-11526690/i/\r\nPragma:
no-cache\r\nServer: Apache\r\nStrict-Transport-Security:
max-age=0\r\nVary: Accept-Encoding\r\nX-Content-Type-Options:
nosniff\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Length: 20\r\nDate:
Wed, 08 Mar 2017 16:20:30 GMT\r\nConnection: keep-alive\r\nSet-Cookie:
PLATEFORMC=ae; expires=Thu, 08-Mar-2018 16:20:30 GMT; path=/;
domain=.souq.com\r\nSet-Cookie:
PHPSESSID=gdbudqf2d734i499du12qac0bofbjjvo; path=/; domain=.souq.com;
HttpOnly\r\nSet-Cookie: PLATEFORML=ar; expires=Thu, 08-Mar-2018
16:20:30 GMT; path=/; domain=.souq.com\r\nSet-Cookie:
c_Ident=14889900303248; expires=Sat, 06-Mar-2027 16:20:30 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: registration_source=deleted;
expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: registration_failed=deleted;
expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: login_source=deleted; expires=Thu,
01-Jan-1970 00:00:01 GMT; path=/; domain=.souq.com\r\nSet-Cookie:
login_failed=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: BUYER_CITY_SELECTED=89; expires=Fri,
07-Apr-2017 16:20:30 GMT; path=/\r\nSet-Cookie: UserViews=11526690;
expires=Fri, 07-Apr-2017 16:20:30 GMT; path=/\r\nSet-Cookie:
NSC_tpvr-72.52.8.197-80=ffffffff2d81ae8345525d5f4f58455e445a4a423660;path=/;httponly\r\n\r\n' Traceback (most recent call last): File
"/Users/Sona/work/projects/1688Crawler/myenv/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py",
line 403, in _make_request
assert_header_parsing(httplib_response.msg) File "/Users/Sona/work/projects/1688Crawler/myenv/lib/python3.5/site-packages/requests/packages/urllib3/util/response.py",
line 66, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
requests.packages.urllib3.exceptions.HeaderParsingError:
[MissingHeaderBodySeparatorDefect()], unparsed data:
'ع-Ù\x81Ù\x8aس-تاÙ\x8aÙ\x85-32-جÙ\x8aجا-اÙ\x84جÙ\x8aÙ\x84-اÙ\x84رابع-اÙ\x84-تÙ\x8a-اÙ\x8a-Ø°Ù\x87بÙ\x8a-11526690/i/\r\nPragma:
no-cache\r\nServer: Apache\r\nStrict-Transport-Security:
max-age=0\r\nVary: Accept-Encoding\r\nX-Content-Type-Options:
nosniff\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Length: 20\r\nDate:
Wed, 08 Mar 2017 16:20:30 GMT\r\nConnection: keep-alive\r\nSet-Cookie:
PLATEFORMC=ae; expires=Thu, 08-Mar-2018 16:20:30 GMT; path=/;
domain=.souq.com\r\nSet-Cookie:
PHPSESSID=gdbudqf2d734i499du12qac0bofbjjvo; path=/; domain=.souq.com;
HttpOnly\r\nSet-Cookie: PLATEFORML=ar; expires=Thu, 08-Mar-2018
16:20:30 GMT; path=/; domain=.souq.com\r\nSet-Cookie:
c_Ident=14889900303248; expires=Sat, 06-Mar-2027 16:20:30 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: registration_source=deleted;
expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: registration_failed=deleted;
expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: login_source=deleted; expires=Thu,
01-Jan-1970 00:00:01 GMT; path=/; domain=.souq.com\r\nSet-Cookie:
login_failed=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/;
domain=.souq.com\r\nSet-Cookie: BUYER_CITY_SELECTED=89; expires=Fri,
07-Apr-2017 16:20:30 GMT; path=/\r\nSet-Cookie: UserViews=11526690;
expires=Fri, 07-Apr-2017 16:20:30 GMT; path=/\r\nSet-Cookie:
NSC_tpvr-72.52.8.197-80=ffffffff2d81ae8345525d5f4f58455e445a4a423660;path=/;httponly\r\n\r\n'
You can see it output a warning Failed to parse headers then it just hangs here.
I just want to get the response of that URL by request, what should I do?
ENV:
Python 3.5.2 (default, Aug 16 2016, 05:35:40)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
requests (2.12.4) Failed
requests (2.13.0) Failed
I'm trying write an HTTP client in python using the sockets library and can't get the receive part working.
Here is my code:
import socket, sys
class httpBase:
def __init__(self, host, port=80):
self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.s.connect((host, port))
def send(self, msg):
self.s.sendall(msg)
def recive(self):
data = ''
while 1:
Tdata = self.s.recv(128)
print("||" + data + "|")
data += Tdata
if data.decode() == '': break
return data
http = httpBase('www.google.com')
http.send('GET / HTTP/1.1\r\n\r\n'.encode())
print(http.recive())
The problem is what I get in response with out the print inside of the recive function I get nothing back and the code just waits and I have to force stop it.
Here is the response from google:
|||
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8h3zii3-ibcyo8zdcKg8WmJjbYYr_hCX4NWWvMTCw1dVwTHKtJbo1M6ay977MwX5hswJ6XeadRFIpd5Pe4La2HBRF; expires=Fri, 12-Sep-2014 16:40:26 GMT;|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8h3zii3-ibcyo8zdcKg8WmJjbYYr_hCX4NWWvMTCw1dVwTHKtJbo1M6ay977MwX5hswJ6XeadRFIpd5Pe4La2HBRF; expires=Fri, 12-Sep-2014 16:40:26 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8h3zii3-ibcyo8zdcKg8WmJjbYYr_hCX4NWWvMTCw1dVwTHKtJbo1M6ay977MwX5hswJ6XeadRFIpd5Pe4La2HBRF; expires=Fri, 12-Sep-2014 16:40:26 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Date: Thu, 13 Mar 2014 16:40:26 GMT
Server: gws
Content-Length: 277
X-XSS-Protection:|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8h3zii3-ibcyo8zdcKg8WmJjbYYr_hCX4NWWvMTCw1dVwTHKtJbo1M6ay977MwX5hswJ6XeadRFIpd5Pe4La2HBRF; expires=Fri, 12-Sep-2014 16:40:26 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Date: Thu, 13 Mar 2014 16:40:26 GMT
Server: gws
Content-Length: 277
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
<HTML><HEAD><meta http-equiv="content-type" content=|
||HTTP/1.1 302 Found
Location: http://www.google.co.il/?gfe_rd=ctrl&ei=et8hU-qsFaXY8ge6moCYAg&gws_rd=cr
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=502cb60127440cb1:FF=0:TM=1394728826:LM=1394728826:S=gXXQi28MXZy3d-U7; expires=Sat, 12-Mar-2016 16:40:26 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=pnIbo1mi1JNuqB9sxTHn41_sdPg6Za-1nQLp_Wk8h3zii3-ibcyo8zdcKg8WmJjbYYr_hCX4NWWvMTCw1dVwTHKtJbo1M6ay977MwX5hswJ6XeadRFIpd5Pe4La2HBRF; expires=Fri, 12-Sep-2014 16:40:26 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Date: Thu, 13 Mar 2014 16:40:26 GMT
Server: gws
Content-Length: 277
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.g|
This looks like the same problem as in Python Recv() stalling, e.g. using HTTP/1.1 and wondering why the server does not close after the request. See there for details.
Trying to get a login script working, I kept getting the same login page returned, so I turned on debugging of the http stream (can't use wireshark or the like because of https).
I got nothing, so I copied the example, it works. Any query to google.com works, but to my target page does not show debugging, what is the difference? If it was a redirect I would expect to see the first get/redirect header, and http://google redirects as well.
import urllib
import urllib2
import pdb
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
urllib2.install_opener(opener)
print '================================'
data = urllib2.urlopen('http://google.com').read()
print '================================'
data = urllib2.urlopen('https://google.com').read()
print '================================'
data = urllib2.urlopen('https://members.poolplayers.com/default.aspx').read()
print '================================'
data = urllib2.urlopen('https://google.com').read()
When I run I get this.
$ python ex.py
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Location: http://www.google.com/
header: Content-Type: text/html; charset=UTF-8
header: Date: Sat, 02 Jul 2011 16:20:11 GMT
header: Expires: Mon, 01 Aug 2011 16:20:11 GMT
header: Cache-Control: public, max-age=2592000
header: Server: gws
header: Content-Length: 219
header: X-XSS-Protection: 1; mode=block
header: Connection: close
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:12 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=4ca9123c4f8b617f:FF=0:TM=1309623612:LM=1309623612:S=o3GqHRj5_3BkKFuJ; expires=Mon, 01-Jul-2013 16:20:12 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=eZdXW-qQQC2fRrXps3HpzkGgeWbMCnyT_taxzdvW1icXS1KSM0SSYOL7B8-OPsw0eLLAbvCW863Viv9ICDj4VAL7dmHtF-gsPfro67IFN5SP6WyHHpLL7JsS_-MOvwSD; expires=Sun, 01-Jan-2012 16:20:12 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:14 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=d613768b3704482b:FF=0:TM=1309623614:LM=1309623614:S=xLxMwBVKEG_bb1bo; expires=Mon, 01-Jul-2013 16:20:14 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=im_KcHyhG2LrrGgLsQjYlwI93lFZa2jZjEYBzdn-xXEyQnoGo8xkP0234fROYV5DScfY_6UbbCJFtyP_V00Ji11kjZwJzR63LfkLoTlEqiaY7FQCIky_8hA2NEqcXwJe; expires=Sun, 01-Jan-2012 16:20:14 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
================================
================================
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 02 Jul 2011 16:20:16 GMT
header: Expires: -1
header: Cache-Control: private, max-age=0
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=dc2cb55e6476c555:FF=0:TM=1309623616:LM=1309623616:S=o__g-Zcpts392D9_; expires=Mon, 01-Jul-2013 16:20:16 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=48=R5gy1aTMjL8pghxQmfUkJaMLc3SxmpFxu5XpoZELAsZrdf8ogQLwyo9Vbk_pRkoETvKE-beWbHHBZu3xgJDt6IsjwmSHPaMGSzxXvsWERxsbKwQMy-wlLSfasvUq5x6q; expires=Sun, 01-Jan-2012 16:20:16 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
You'll need an HTTPSHandler:
h = urllib2.HTTPSHandler(debuglevel=1)