I'm using Mechanize in Python to perform some web scraping. Most of the website works but one particular page doesn't return any Content or Response.
My settings are
self._browser = mechanize.Browser()
self._browser.set_handle_refresh(True)
self._browser.set_debug_responses(True)
self._browser.set_debug_redirects(True)
self._browser.set_debug_http(True)
and the code to execute is:
response = self._browser.open(url)
This is the debug output:
add_cookie_header
Checking xyz.com for cookies to return
- checking cookie path=/
- checking cookie <Cookie ASP.NET_SessionId=j3pg0wnavh3yjseyj1v3mr45 for xyz.com/>
it's a match
send: 'GET /page.aspx?leagueID=39 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: xyz.com\r\nCookie: ASP.NET_SessionId=aapg9wnavh3yqyrtg1v3ar45\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 07 Feb 2012 19:04:37 GMT
header: Pragma: no-cache
header: Expires: -1
header: Connection: close
header: Cache-Control: no-cache
header: Content-Length: 0
extract_cookies: Date: Tue, 07 Feb 2012 19:04:37 GMT
Pragma: no-cache
Expires: -1
Connection: close
Cache-Control: no-cache
Content-Length: 0
I've tried with and without Redirect to no avail. Any ideas?
I might add the page works fine in a browser.
The procedure to find out what's the problem usually is this one:
Capture your web browser traffic when successfully opening the url
Capture python traffic when trying to open the url
For the first step, there are many tools available. For example, in firefox, HttpFox and Live HTTP Headers might be quite useful.
For the second step, programmatically logging the headers being sent/received should be enough.
For both steps, you can also capture traffic in your network card with something like wireshark.
Related
I'm working on a python script which works with a JSON returned by an URL.
Since a couple of days urllib2 returns (just sometimes) an old state of the JSON.
I did add the headers "Cache-Control":"max-age=0" etc. still it sometimes happen.
If I print out the request info I get:
Server: nginx/1.8.0
Date: Thu, 03 Sep 2015 17:02:47 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3539
Status: 200 OK
X-XHR-Current-Location: /shop/169464.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "b1fbe7a01e0832025a3afce23fc2ab56"
X-Request-Id: 4cc0d399f943ad09a903f18a6ce1c488
X-Runtime: 0.123033
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 1707606900 1707225496
Age: 2860
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from adsl
X-Cache-Lookup: MISS from adsl:21261
Connection: close
has it something to do with the header "Age" or "X-Cache-Rack"? Or any ideas how I can fix it?
thanks in advance!
try to fake the user-agent, remove cookies, drop sessions.
fake_user_agent = ['chrome','firefox','safari']
request = urllib2.Request(url)
request.add_header('User-Agent', get_random(fake_user_agent))
content = urllib2.build_opener().open(request)
if all doesn't work, then try using tor to change ip per request.
if nothing works, then you can't bypass it because you are most definitely connecting to the transparent proxy
I wrote a program in python to automatically retrieve info from website. But it returns a lot of respond messages, so I cannot see how my program processes. How can I turn it off?
part of code
cookiejar = cookielib.LWPCookieJar()
cookie_support = urllib2.HTTPCookieProcessor(cookiejar)
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
....
url = 'http://weibo.com/2344328334/fans'
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'}
req = urllib2.Request(url = url, headers = header)
result = urllib2.urlopen(req)
text = result.read()
part of output
send: 'GET /2344328334/fans HTTP/1.1\r\nAccept-Encoding: ...'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx
header: Date: Tue, 10 Sep 2013 03:54:02 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Cache-Control: no-cache, must-revalidate
header: Expires: Sat, 26 Jul 1997 05:00:00 GMT
... other messages ...
You open the debug option, just remove the debuglevel=1
And I recommand anather python http lib: requests , It's very easy to use and powerful.Just install with pip install requests , support python2 and python3
It's Feature Support :
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
...
I've followed the following article in an attempt to setup Apache2 caching in order to use it with Django on Ubuntu 12.10 with mod_wsgi. I want Apache to cache some requests for me.
http://www.howtoforge.com/caching-with-apaches-mod_cache-on-ubuntu-10.04
From the article I enabled the modules and setup the following php script to test the caching. The caching works just fine - I only get a new timestamp after 5 minutes.
vi /var/www/cachetest.php
<?php
header("Cache-Control: must-revalidate, max-age=300");
header("Vary: Accept-Encoding");
echo time()."<br>";
?>
Now in my django response, I return an HttpResponse object after setting the appropriate headers the same way:
# Create a Response Object with the content to return and set it's
response = HttpResponse("%s"%(output_display))
response['Cache-Control'] = 'must-revalidate, max-age=20'
response['Vary'] = 'Accept-Encoding'
return response
The caching with the Django request doesn't work at all. I've used Firefox's LiveHeaders to examine the HTTP response headers.
For the example link above and the PHP script the headers look like:
http://localhost/cachetest.php
GET /cachetest.php HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cache-Control: max-age=0
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:29:32 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.4.6-1ubuntu1.1
Cache-Control: must-revalidate, max-age=300
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 34
Connection: close
Content-Type: text/html
----------------------------------------------------------
For my Django Request - the caching doesn't work, it always forces the lengthy operation to complete the response - just like re-loading the php request above with F5. Using the FireFox plugin I seem to be writing the correct headers:
http://localhost/testdjango/testdjango/
GET /testdjango/testdjango/ HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:32:41 GMT
Server: Apache/2.2.22 (Ubuntu)
Vary: Accept-Encoding
Cache-Control: must-revalidate, max-age=20
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------
What am I doing wrong? How can I get the django caching to work like the php script? Thanks!
This seems to be your problem:
Transfer-Encoding: chunked
It means a 'streaming response', in terms of mod_mem_cache. And, according to the docs:
By default, a streamed response will not be cached unless it has a
Content-Length header.
You can solve it by setting the MCacheMaxStreamingBuffer directive.
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx
OK, here's the header(just an example) info I got from Live HTTP Header while logging into an account:
http://example.com/login.html
POST /login.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://example.com
Cookie: blahblahblah; blah = blahblah
Content-Type: application/x-www-form-urlencoded
Content-Length: 39
username=shane&password=123456&do=login
HTTP/1.1 200 OK
Date: Sat, 18 Dec 2010 15:41:02 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.14
Set-Cookie: blah = blahblah_blah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Cache-Control: private, no-cache="set-cookie"
Expires: 0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 4135
Keep-Alive: timeout=10, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
Normally I would code like this:
import mechanize
import urllib2
MechBrowser = mechanize.Browser()
LoginUrl = "http://example.com/login.html"
LoginData = "username=shane&password=123456&do=login"
LoginHeader = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}
LoginRequest = urllib2.Request(LoginUrl, LoginData, LoginHeader)
LoginResponse = MechBrowser.open(LoginRequest)
Above code works fine. My question is, do I also need to add these following lines (and more in previous header infos) in LoginHeader to make it really looks like firefox's surfing, not mechanize?
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
What parts/how many of header info need to be spoofed to make it looks "real"?
It depends on what you're trying to 'fool'. You can try some online services that do simple User Agent sniffing to gauge your success:
http://browserspy.dk/browser.php
http://www.browserscope.org (look for 'We think you're using...')
http://www.browserscope.org/ua
http://panopticlick.eff.org/ -> will help you to pick some 'too common to track' options
http://networking.ringofsaturn.com/Tools/browser.php
I believe a determined programmer could detect your game, but many log parsers and tools wouldn't once you echo what your real browser sends.
One thing you should consider is that lack of JS might raise red flags, so capture sent headers with JS disabled too.
Here's how you set the user agent for all requests made by mechanize.Browser
br = mechanize.Browser()
br.addheaders = [('User-agent', 'your user agent string here')]
Mechanize can fill in forms as well
br.open('http://yoursite.com/login')
br.select_form(nr=1) # select second form in page (0 indexed)
br['username'] = 'yourUserName' # inserts into form field with name 'username'
br['password'] = 'yourPassword'
response = br.submit()
if 'Welcome yourUserName' in response.get_data():
# login was successful
else:
# something went wrong
print response.get_data()
See the mechanize examples for more info
If you are paranoid about keeping bots/scripts/non-real browsers out, you'd look for things like the order of HTTP requests, let one resource be added using JavaScript. If that resource is not requested, or requested before the JavaScript - then you know it's a "fake" browser.
You could also look at number of requests per connection (keep-alive), or simply verify that all CSS files of the first page (given that they're at the top of the HTML) gets loaded.
YMMV but it can become pretty cumbersome to simulate enough to make some "fake" browser pass as a "real" one (used by humans).