I've followed the following article in an attempt to setup Apache2 caching in order to use it with Django on Ubuntu 12.10 with mod_wsgi. I want Apache to cache some requests for me.
http://www.howtoforge.com/caching-with-apaches-mod_cache-on-ubuntu-10.04
From the article I enabled the modules and setup the following php script to test the caching. The caching works just fine - I only get a new timestamp after 5 minutes.
vi /var/www/cachetest.php
<?php
header("Cache-Control: must-revalidate, max-age=300");
header("Vary: Accept-Encoding");
echo time()."<br>";
?>
Now in my django response, I return an HttpResponse object after setting the appropriate headers the same way:
# Create a Response Object with the content to return and set it's
response = HttpResponse("%s"%(output_display))
response['Cache-Control'] = 'must-revalidate, max-age=20'
response['Vary'] = 'Accept-Encoding'
return response
The caching with the Django request doesn't work at all. I've used Firefox's LiveHeaders to examine the HTTP response headers.
For the example link above and the PHP script the headers look like:
http://localhost/cachetest.php
GET /cachetest.php HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cache-Control: max-age=0
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:29:32 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.4.6-1ubuntu1.1
Cache-Control: must-revalidate, max-age=300
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 34
Connection: close
Content-Type: text/html
----------------------------------------------------------
For my Django Request - the caching doesn't work, it always forces the lengthy operation to complete the response - just like re-loading the php request above with F5. Using the FireFox plugin I seem to be writing the correct headers:
http://localhost/testdjango/testdjango/
GET /testdjango/testdjango/ HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:32:41 GMT
Server: Apache/2.2.22 (Ubuntu)
Vary: Accept-Encoding
Cache-Control: must-revalidate, max-age=20
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------
What am I doing wrong? How can I get the django caching to work like the php script? Thanks!
This seems to be your problem:
Transfer-Encoding: chunked
It means a 'streaming response', in terms of mod_mem_cache. And, according to the docs:
By default, a streamed response will not be cached unless it has a
Content-Length header.
You can solve it by setting the MCacheMaxStreamingBuffer directive.
Related
I'm trying to write a python script that would help me install a theme remotely. Unfortunately, the upload part doesn't play nice, trying to do it with requests' POST helpers.
The HTTP headers of a successful upload look like this:
http://127.0.0.1/wordpress/wp-admin/update.php?action=upload-theme
POST /wordpress/wp-admin/update.php?action=upload-theme HTTP/1.1
Host: 127.0.0.1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------2455316848522
Content-Length: 2580849
Referer: http://127.0.0.1/wordpress/wp-admin/theme-install.php
Cookie: wordpress_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7Cee7366eea9b5bc9a9d492a664a04cb0916b97b0d211e892875cec86cf43e2f9d; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7C9949f19ef5d900daf1b859c0bb4e2129cf86d6a970718a1b63e3b9e56dc5e710; wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1497486698
Connection: keep-alive
Upgrade-Insecure-Requests: 1
-----------------------------2455316848522: undefined
Content-Disposition: form-data; name="_wpnonce"
b1467671e0
-----------------------------2455316848522
Content-Disposition: form-data; name="_wp_http_referer"
/wordpress/wp-admin/theme-install.php
-----------------------------2455316848522
Content-Disposition: form-data; name="themezip"; filename="oedipus_theme.zip"
Content-Type: application/octet-stream
PK
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2017 01:33:25 GMT
Server: Apache/2.4.25 (Win32) OpenSSL/1.0.2j PHP/7.1.1
X-Powered-By: PHP/7.1.1
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
X-Frame-Options: SAMEORIGIN
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----------------------------------------------------------
To create a simple session for WP, in order to use later for uploads:
global wp_session
def wpCreateSession(uname, upassword, site_link):
"""
:param uname: Username for the login.
:param upaswword: Password for the login.
:param site_link: Site to login on.
:return: Returns a sessions for the said website.
"""
global wp_session
wp_session = requests.session()
wp_session.post(site_link, data={'log' : uname, 'pwd' : upassword})
To upload the said file to WP, using the wp_session global:
def wpUploadTheme(file_name):
global wp_session
try:
with open(file_name, 'rb') as up_file:
r = wp_session.post('http://127.0.0.1/wordpress/wp-admin/update.php', files = {file_name: up_file})
print "Got after try."
finally:
up_file.close()
And this last bit is where it doesn't work, the upload is not successful and I get returned to WordPress' basic 404.
I have also tried requests_toolbelt MultiPart_Encoder to no avail.
Question: 'requests' POST file fails when trying to upload
Check your files dict, your dict is invalid
files = {file_name: up_file}
Maybe you need a full blown files dict, for instance:
files = {'themezip': ('oedipus_theme.zip',
open('oedipus_theme.zip', 'rb'),
'application/octet-stream', {'Expires': '0'})}
From docs.python-requests.org
files = {'file': open('test.jpg', 'rb')}
requests.post(url, files=files)
From SO Answer Upload Image using POST form data in Python-requests
We have our own private cloud where I have installed OpenStack Swift. I have a working node (proxy and storage) that allows me to store and retrieve if I use the openstack and swift python cli to store and retrieve files. Additionally I am able to use the python API on a remote machine to store and retrieve files.
The root of my question is how to debug temp url and crossdomain filter issues. Is there a way to turn on detailed debug logging for these filters?
I have the default logging set to
log_name = swift
log_facility = LOG_LOCAL0
log_level = DEBUG
The situation I am trying to troubleshoot is as follows. When I try and use temp url and cross domain (for CORS), I get a 401. I debugged the code and it appears to be a invalid HMAC error. Based on research, this appears to be a date time issue where the client and the server have missed matched times. However both are running the ntpd service so the time should be in sync.
For CORS, it appears that preflight OPTIONS request is succeeding. The subsequent PUT is failing with a 401....
"No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://blahost' is therefore not allowed access. The response had HTTP status code 401."
The strange part is the OPTIONS request is returning "access-control-allow-origin" instead of 'Access-Control-Allow-Origin'... the case is off.
Preflight request:
OPTIONS /v1/AUTH_99cf99f26aaa4b2c923806231b03334c/436/88b6d895-6dbf-4f29-904d-96c9b7959016?temp_url_sig=4a953c34372e37b2a22bb31fb0581a7eb7f02cee&temp_url_expires=1441508891 HTTP/1.1
Host: 23.253.200.41:8080
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Access-Control-Request-Method: PUT
Origin: http://blahhost
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
Access-Control-Request-Headers: accept, content-type
Accept: */*
Referer: http://blahost/binder/436/site/419/folder/17560/file
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Preflight Response:
HTTP/1.1 200 OK
access-control-allow-origin: http://blahost
access-control-allow-methods: HEAD, GET, PUT, POST, COPY, OPTIONS, DELETE
access-control-allow-headers: content-type, accept
Allow: HEAD, GET, PUT, POST, COPY, OPTIONS, DELETE
Content-Length: 0
X-Trans-Id: tx9e359777dfb94148858cd-0055eba012
Date: Sun, 06 Sep 2015 02:08:18 GMT
Connection: keep-alive
Subsequent PUT request(notice it is missing the Access-Control-Allow-Origin)
PUT /v1/AUTH_99cf99f26aaa4b2c923806231b03334c/436/88b6d895-6dbf-4f29-904d-96c9b7959016?temp_url_sig=4a953c34372e37b2a22bb31fb0581a7eb7f02cee&temp_url_expires=1441508891 HTTP/1.1
Host: 23.253.200.41:8080
Connection: keep-alive
Content-Length: 16231
Pragma: no-cache
Cache-Control: no-cache
Accept: */*
Origin: http://blahost
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
Content-Type: application/pdf
Referer: http://blahost/binder/436/site/419/folder/17560/file
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
I would appreciate any advice on how to troubleshoot.
Thanks
Greg
I'm trying to submit a form to a website using the requests module in Python but my form is not submitting correctly. I can submit the form correctly manually on the site but I assume something is wrong with my code that is causing the Python submission to fail. I can successfully login to the website and visit pages/issue GET requests using Python. I can issue a GET request to the page that I submit the form to and it will successfully load, i.e. the requests login works. I have included all of my output below including the Python code, an invalid form submission in Python and a valid form submission from the browser. This may be overkill but I am inexperienced with this and am not sure what is necessary.
My code to login is:
s = requests.Session()
data = s.get(login_url)
authToken = re.search(('name="authenticity_token"[\s]'
'type="hidden"[\s]+value="(.+)"'), \
data.text).group(1)
data_dict = {
'utf8': '✓',
'authenticity_token': authToken,
'admin[email]': username,
'admin[password]': password,
'admin[remember_me]': '1',
'commit': 'Sign in'
}
s.post(login_url, data_dict)
This successfully logs me in and I can submit GET requests to any page and get valid results.
My code to submit the form:
payload = {
'utf8': '✓',
'authenticity_token': authToken,
'progress_course[name]': name,
'progress_course[description]': desc,
'commit': 'Create Course'
}
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'176',
'Content-Type':'application/x-www-form-urlencoded',
'Host':'xxx.com',
'Origin':'https://xxx.com',
'Referer':'https://xxx.com/workshop/progress/courses/new',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'
}
response = s.post(path,data=payload,headers=headers)
My form submission does not work. Here is the python logging module output:
The POST:
send: 'POST /workshop/progress/courses HTTP/1.1
Origin: https://xxx.com\r\nContent-Length: 181
Accept-Language: en-US,en;q=0.8
Accept-Encoding: gzip, deflate
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Host: xxx.com
Referer: https://xxx.com/workshop/progress/courses/new
Cache-Control: max-age=0
Cookie: _brainfit_session=xxx; remember_admin_token=xxx
Content-Type: application/x-www-form-urlencoded
progress_course%5Bdescription%5D=pytest&utf8=%26%23x2713%3B&commit=Create+Course&progress_course%5Bname%5D=pytest&authenticity_token=xxx
The reply to the POST. You can see that this comes from the sign-in page rather than generating a new page as seen in the correct output further down.
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: nginx
header: Date: Tue, 02 Dec 2014 01:57:46 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Status: 302 Found
header: Strict-Transport-Security: max-age=31536000
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Location: https://xxx.com/admins/sign_in
header: Cache-Control: no-cache
header: X-Request-Id: 983472d4-8954-4106-904a-38ea3b6a76a1
header: X-Runtime: 0.039963
DEBUG:requests.packages.urllib3.connectionpool:"POST /workshop/progress/courses HTTP/1.1" 302 None
Redirect GETs and replies. I may be mistaken in what these actually are:
send: 'GET /admins/sign_in HTTP/1.1
Origin: https://xxx.com
Accept-Language: en-US,en;q=0.8
Accept-Encoding: gzip, deflate
Host: xxx.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Connection: keep-alive
Referer: https://xxx.com/workshop/progress/courses/new
Cache-Control: max-age=0
Cookie: _brainfit_session=xxx; remember_admin_token=xxx
Content-Type: application/x-www-form-urlencoded
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: nginx
header: Date: Tue, 02 Dec 2014 01:57:46 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Status: 302 Found
header: Strict-Transport-Security: max-age=31536000
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Location: https://xxx.com/workshop/progress/courses
header: Cache-Control: no-cache
header: Set-Cookie: _brainfit_session=xxx; path=/; secure; HttpOnly
header: X-Request-Id: 9e5535d6-143c-4e98-8bb4-35dd29ab045d
header: X-Runtime: 0.007505
DEBUG:requests.packages.urllib3.connectionpool:"GET /admins/sign_in HTTP/1.1" 302 None
send: 'GET /workshop/progress/courses HTTP/1.1
Origin: https://xxx.com
Accept-Language: en-US,en;q=0.8
Accept-Encoding: gzip, deflate
Host: xxx.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Connection: keep-alive
Referer: https://xxx.com/workshop/progress/courses/new
Cache-Control: max-age=0
Cookie: _brainfit_session=exxx; remember_admin_token=xxx
Content-Type: application/x-www-form-urlencoded
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Tue, 02 Dec 2014 01:57:46 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Status: 200 OK
header: Strict-Transport-Security: max-age=31536000
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Cache-Control: max-age=0, private, must-revalidate
header: Set-Cookie: _brainfit_session=xxx; path=/; secure; HttpOnly
header: X-Request-Id: 4e759b60-af1c-4f39-b033-71ff99b62df4
header: X-Runtime: 0.019001
header: Content-Encoding: gzip
DEBUG:requests.packages.urllib3.connectionpool:"GET /workshop/progress/courses HTTP/1.1" 200 None
The POST headers of a properly submitted form on the site:
Remote Address:xxx
Request URL:https://xxx.com/workshop/progress/courses
Request Method:POST
Status Code:302 Found
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:176
Content-Type:application/x-www-form-urlencoded
Cookie:__utma=xxx; __utmc=xxx; __utmz=xxx.utmcsr=google|utmccn=(organic)|utmcmd=organic|
utmctr=xxx; km_lv=x; kvcd=xxx; km_ai=xxx; km_ni=xxx; km_uq=xxx; has_logged_in=true; WT_FPC=id=xxx; _ga=xxx; _brainfit_session=xxx
Host:xxx.com
Origin:https://xxx.com
Referer:https://xxx.com/workshop/progress/courses
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Form Data
utf8:✓
authenticity_token:xxx
progress_course[name]:test03
progress_course[description]:test03
commit:Create Course
Response Headers
Cache-Control:no-cache
Connection:keep-alive
Content-Type:text/html; charset=utf-8
Date:Mon, 01 Dec 2014 14:01:37 GMT
Location:https://xxx.com/workshop/progress/courses/44
Server:nginx
Set-Cookie:_brainfit_session=xxx; path=/; secure; HttpOnly
Status:302 Found
Strict-Transport-Security:max-age=31536000
Transfer-Encoding:chunked
X-Content-Type-Options:nosniff
X-Frame-Options:SAMEORIGIN
X-Request-Id:8bb04242-bc5a-408f-99a5-9d8bf4eeb611
X-Runtime:0.028559
X-XSS-Protection:1; mode=block
Once the form is correctly submitted there is also a GET response from the page it redirects to. I've excluded additional output here but it can be added if it will help resolve the problem.
Remote Address:xxx
Request URL:https://xxx.com/workshop/progress/courses/44
Request Method:GET
Status Code:200 OK
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx
OK, here's the header(just an example) info I got from Live HTTP Header while logging into an account:
http://example.com/login.html
POST /login.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://example.com
Cookie: blahblahblah; blah = blahblah
Content-Type: application/x-www-form-urlencoded
Content-Length: 39
username=shane&password=123456&do=login
HTTP/1.1 200 OK
Date: Sat, 18 Dec 2010 15:41:02 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.14
Set-Cookie: blah = blahblah_blah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Cache-Control: private, no-cache="set-cookie"
Expires: 0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 4135
Keep-Alive: timeout=10, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
Normally I would code like this:
import mechanize
import urllib2
MechBrowser = mechanize.Browser()
LoginUrl = "http://example.com/login.html"
LoginData = "username=shane&password=123456&do=login"
LoginHeader = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}
LoginRequest = urllib2.Request(LoginUrl, LoginData, LoginHeader)
LoginResponse = MechBrowser.open(LoginRequest)
Above code works fine. My question is, do I also need to add these following lines (and more in previous header infos) in LoginHeader to make it really looks like firefox's surfing, not mechanize?
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
What parts/how many of header info need to be spoofed to make it looks "real"?
It depends on what you're trying to 'fool'. You can try some online services that do simple User Agent sniffing to gauge your success:
http://browserspy.dk/browser.php
http://www.browserscope.org (look for 'We think you're using...')
http://www.browserscope.org/ua
http://panopticlick.eff.org/ -> will help you to pick some 'too common to track' options
http://networking.ringofsaturn.com/Tools/browser.php
I believe a determined programmer could detect your game, but many log parsers and tools wouldn't once you echo what your real browser sends.
One thing you should consider is that lack of JS might raise red flags, so capture sent headers with JS disabled too.
Here's how you set the user agent for all requests made by mechanize.Browser
br = mechanize.Browser()
br.addheaders = [('User-agent', 'your user agent string here')]
Mechanize can fill in forms as well
br.open('http://yoursite.com/login')
br.select_form(nr=1) # select second form in page (0 indexed)
br['username'] = 'yourUserName' # inserts into form field with name 'username'
br['password'] = 'yourPassword'
response = br.submit()
if 'Welcome yourUserName' in response.get_data():
# login was successful
else:
# something went wrong
print response.get_data()
See the mechanize examples for more info
If you are paranoid about keeping bots/scripts/non-real browsers out, you'd look for things like the order of HTTP requests, let one resource be added using JavaScript. If that resource is not requested, or requested before the JavaScript - then you know it's a "fake" browser.
You could also look at number of requests per connection (keep-alive), or simply verify that all CSS files of the first page (given that they're at the top of the HTML) gets loaded.
YMMV but it can become pretty cumbersome to simulate enough to make some "fake" browser pass as a "real" one (used by humans).