how can I get complete header info from urlib2 request?

how can I get complete header info from urlib2 request? - python

I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.

Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.

I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx

Related

trouble accessing cookies using python requests

I'm a bit new to the python requests library, and am having some trouble accessing cookies after forms authentication. I captured packets in wireshark, and I'm certain that the cookies are being set (HTTP stream output at bottom). I'm following documentation here: https://2.python-requests.org/en/master/user/quickstart/#cookies.
My request is as follows:
r = requests.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'})
If I invoke
print(r.cookies), The only thing that is returned is <RequestsCookieJar[]>
OR, if I try to access them using r.cookies['mydeviceAG_POEWebTool'] I get a key error: KeyError: "name='mydeviceAG_POEWebTool', domain=None, path=None"
The HTTP Stream from wireshark is here:
POST /Account/LogOn HTTP/1.1
Host: 192.168.2.111
User-Agent: python-requests/2.24.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 49
Content-Type: application/x-www-form-urlencoded
User=Customer&Password=Customer&button=submitHTTP/1.1 302 Found
Date: Thu, 24 Sep 2020 10:30:43 GMT
Server: Apache/2.4.25 (Debian)
Location: /States
X-AspNet-Version: 4.0.30319
Cache-Control: private
Set-Cookie: mydeviceAG_POEWebTool=8081AD7E6EEEB463C0AD8458; path=/
Set-Cookie: .mydeviceAG_POEWebTool_AUTH=jmnjqmKeL0ge8fz/sYrP3Xm+ntUTnPLrWBGtKAmAvnkKIjPKYQVn9xVrRa7EUEHLTfB1KNCKjotabnb7QqnDlQlZKuQkJ0J8rLmxuxrtCMFDsa/d6jyUj/PUckJ8V0Te; path=/; expires=Thu, 24 Sep 2020 13:30:44 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 112
Keep-Alive: timeout=5, m=100
Connection: Keep-Alive
Content-Type: text/html
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here</h2>
</body><html>
** After reading the link offered by #D-E-N below, and searching around a bit, I tried the code below, which gets me further, but only stores the first of two cookies in the cookie jar. The server sends back two cookies:
Set-Cookie: AG_POEWebTool=9E508CCAE50EACB4AD1B33D8; path=/
Set-Cookie: .AG_POEWebTool_AUTH=WaXAtyh/tGnoZFPIiV4xoF6BhwGbr0jlaFq
but the only cookie stored is the AG_POEWebTool one. **
with requests.Session() as s:
s.get('http://192.168.2.111/Account/LogOn', timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
r = s.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'}, timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
cookiedict = s.cookies.get_dict()
print('URL:')
for item in cookiedict.items():
print(item)
response=s.get('http://192.168.2.111/Logfiles')
This is the response that I get:
cookiedict items:
('AG_POEWebTool', '471B4EB1153E4733E0EA1A40')
Process finished with exit code 0

Python requests' POST file fails when trying to upload a WordPress Theme to Host

I'm trying to write a python script that would help me install a theme remotely. Unfortunately, the upload part doesn't play nice, trying to do it with requests' POST helpers.
The HTTP headers of a successful upload look like this:
http://127.0.0.1/wordpress/wp-admin/update.php?action=upload-theme
POST /wordpress/wp-admin/update.php?action=upload-theme HTTP/1.1
Host: 127.0.0.1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------2455316848522
Content-Length: 2580849
Referer: http://127.0.0.1/wordpress/wp-admin/theme-install.php
Cookie: wordpress_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7Cee7366eea9b5bc9a9d492a664a04cb0916b97b0d211e892875cec86cf43e2f9d; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7C9949f19ef5d900daf1b859c0bb4e2129cf86d6a970718a1b63e3b9e56dc5e710; wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1497486698
Connection: keep-alive
Upgrade-Insecure-Requests: 1
-----------------------------2455316848522: undefined
Content-Disposition: form-data; name="_wpnonce"
b1467671e0
-----------------------------2455316848522
Content-Disposition: form-data; name="_wp_http_referer"
/wordpress/wp-admin/theme-install.php
-----------------------------2455316848522
Content-Disposition: form-data; name="themezip"; filename="oedipus_theme.zip"
Content-Type: application/octet-stream
PK
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2017 01:33:25 GMT
Server: Apache/2.4.25 (Win32) OpenSSL/1.0.2j PHP/7.1.1
X-Powered-By: PHP/7.1.1
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
X-Frame-Options: SAMEORIGIN
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----------------------------------------------------------
To create a simple session for WP, in order to use later for uploads:
global wp_session
def wpCreateSession(uname, upassword, site_link):
"""
:param uname: Username for the login.
:param upaswword: Password for the login.
:param site_link: Site to login on.
:return: Returns a sessions for the said website.
"""
global wp_session
wp_session = requests.session()
wp_session.post(site_link, data={'log' : uname, 'pwd' : upassword})
To upload the said file to WP, using the wp_session global:
def wpUploadTheme(file_name):
global wp_session
try:
with open(file_name, 'rb') as up_file:
r = wp_session.post('http://127.0.0.1/wordpress/wp-admin/update.php', files = {file_name: up_file})
print "Got after try."
finally:
up_file.close()
And this last bit is where it doesn't work, the upload is not successful and I get returned to WordPress' basic 404.
I have also tried requests_toolbelt MultiPart_Encoder to no avail.

Question: 'requests' POST file fails when trying to upload
Check your files dict, your dict is invalid
files = {file_name: up_file}
Maybe you need a full blown files dict, for instance:
files = {'themezip': ('oedipus_theme.zip',
open('oedipus_theme.zip', 'rb'),
'application/octet-stream', {'Expires': '0'})}
From docs.python-requests.org
files = {'file': open('test.jpg', 'rb')}
requests.post(url, files=files)
From SO Answer Upload Image using POST form data in Python-requests

Caching Django Responses with mod_wsgi and Apache2 mem_cache

I've followed the following article in an attempt to setup Apache2 caching in order to use it with Django on Ubuntu 12.10 with mod_wsgi. I want Apache to cache some requests for me.
http://www.howtoforge.com/caching-with-apaches-mod_cache-on-ubuntu-10.04
From the article I enabled the modules and setup the following php script to test the caching. The caching works just fine - I only get a new timestamp after 5 minutes.
vi /var/www/cachetest.php
<?php
header("Cache-Control: must-revalidate, max-age=300");
header("Vary: Accept-Encoding");
echo time()."<br>";
?>
Now in my django response, I return an HttpResponse object after setting the appropriate headers the same way:
# Create a Response Object with the content to return and set it's
response = HttpResponse("%s"%(output_display))
response['Cache-Control'] = 'must-revalidate, max-age=20'
response['Vary'] = 'Accept-Encoding'
return response
The caching with the Django request doesn't work at all. I've used Firefox's LiveHeaders to examine the HTTP response headers.
For the example link above and the PHP script the headers look like:
http://localhost/cachetest.php
GET /cachetest.php HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cache-Control: max-age=0
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:29:32 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.4.6-1ubuntu1.1
Cache-Control: must-revalidate, max-age=300
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 34
Connection: close
Content-Type: text/html
----------------------------------------------------------
For my Django Request - the caching doesn't work, it always forces the lengthy operation to complete the response - just like re-loading the php request above with F5. Using the FireFox plugin I seem to be writing the correct headers:
http://localhost/testdjango/testdjango/
GET /testdjango/testdjango/ HTTP/1.1
Host: localhost
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
HTTP/1.1 200 OK
Date: Sun, 10 Mar 2013 02:32:41 GMT
Server: Apache/2.2.22 (Ubuntu)
Vary: Accept-Encoding
Cache-Control: must-revalidate, max-age=20
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------
What am I doing wrong? How can I get the django caching to work like the php script? Thanks!

This seems to be your problem:
Transfer-Encoding: chunked
It means a 'streaming response', in terms of mod_mem_cache. And, according to the docs:
By default, a streamed response will not be cached unless it has a
Content-Length header.
You can solve it by setting the MCacheMaxStreamingBuffer directive.

Mechanize response returns no content

I'm using Mechanize in Python to perform some web scraping. Most of the website works but one particular page doesn't return any Content or Response.
My settings are
self._browser = mechanize.Browser()
self._browser.set_handle_refresh(True)
self._browser.set_debug_responses(True)
self._browser.set_debug_redirects(True)
self._browser.set_debug_http(True)
and the code to execute is:
response = self._browser.open(url)
This is the debug output:
add_cookie_header
Checking xyz.com for cookies to return
- checking cookie path=/
- checking cookie <Cookie ASP.NET_SessionId=j3pg0wnavh3yjseyj1v3mr45 for xyz.com/>
it's a match
send: 'GET /page.aspx?leagueID=39 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: xyz.com\r\nCookie: ASP.NET_SessionId=aapg9wnavh3yqyrtg1v3ar45\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 07 Feb 2012 19:04:37 GMT
header: Pragma: no-cache
header: Expires: -1
header: Connection: close
header: Cache-Control: no-cache
header: Content-Length: 0
extract_cookies: Date: Tue, 07 Feb 2012 19:04:37 GMT
Pragma: no-cache
Expires: -1
Connection: close
Cache-Control: no-cache
Content-Length: 0
I've tried with and without Redirect to no avail. Any ideas?
I might add the page works fine in a browser.

The procedure to find out what's the problem usually is this one:
Capture your web browser traffic when successfully opening the url
Capture python traffic when trying to open the url
For the first step, there are many tools available. For example, in firefox, HttpFox and Live HTTP Headers might be quite useful.
For the second step, programmatically logging the headers being sent/received should be enough.
For both steps, you can also capture traffic in your network card with something like wireshark.

How to get mechanize requests to look like they originate from a real browser

OK, here's the header(just an example) info I got from Live HTTP Header while logging into an account:
http://example.com/login.html
POST /login.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://example.com
Cookie: blahblahblah; blah = blahblah
Content-Type: application/x-www-form-urlencoded
Content-Length: 39
username=shane&password=123456&do=login
HTTP/1.1 200 OK
Date: Sat, 18 Dec 2010 15:41:02 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.14
Set-Cookie: blah = blahblah_blah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Cache-Control: private, no-cache="set-cookie"
Expires: 0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 4135
Keep-Alive: timeout=10, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
Normally I would code like this:
import mechanize
import urllib2
MechBrowser = mechanize.Browser()
LoginUrl = "http://example.com/login.html"
LoginData = "username=shane&password=123456&do=login"
LoginHeader = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}
LoginRequest = urllib2.Request(LoginUrl, LoginData, LoginHeader)
LoginResponse = MechBrowser.open(LoginRequest)
Above code works fine. My question is, do I also need to add these following lines (and more in previous header infos) in LoginHeader to make it really looks like firefox's surfing, not mechanize?
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
What parts/how many of header info need to be spoofed to make it looks "real"?

It depends on what you're trying to 'fool'. You can try some online services that do simple User Agent sniffing to gauge your success:
http://browserspy.dk/browser.php
http://www.browserscope.org (look for 'We think you're using...')
http://www.browserscope.org/ua
http://panopticlick.eff.org/ -> will help you to pick some 'too common to track' options
http://networking.ringofsaturn.com/Tools/browser.php
I believe a determined programmer could detect your game, but many log parsers and tools wouldn't once you echo what your real browser sends.
One thing you should consider is that lack of JS might raise red flags, so capture sent headers with JS disabled too.

Here's how you set the user agent for all requests made by mechanize.Browser
br = mechanize.Browser()
br.addheaders = [('User-agent', 'your user agent string here')]
Mechanize can fill in forms as well
br.open('http://yoursite.com/login')
br.select_form(nr=1) # select second form in page (0 indexed)
br['username'] = 'yourUserName' # inserts into form field with name 'username'
br['password'] = 'yourPassword'
response = br.submit()
if 'Welcome yourUserName' in response.get_data():
# login was successful
else:
# something went wrong
print response.get_data()
See the mechanize examples for more info

If you are paranoid about keeping bots/scripts/non-real browsers out, you'd look for things like the order of HTTP requests, let one resource be added using JavaScript. If that resource is not requested, or requested before the JavaScript - then you know it's a "fake" browser.
You could also look at number of requests per connection (keep-alive), or simply verify that all CSS files of the first page (given that they're at the top of the HTML) gets loaded.
YMMV but it can become pretty cumbersome to simulate enough to make some "fake" browser pass as a "real" one (used by humans).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how can I get complete header info from urlib2 request? - python

I see that server returns HTTP/1.1 302 Found - HTTP redirect. urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx

Related

trouble accessing cookies using python requests

Python requests' POST file fails when trying to upload a WordPress Theme to Host

Caching Django Responses with mod_wsgi and Apache2 mem_cache

Mechanize response returns no content

How to get mechanize requests to look like they originate from a real browser

Categories

Resources