How to call urllib2 get_header method? - python

I was looking into python urllib2 download size question.
Although the method RanRag or jterrace suggested worked fine for me but I was wondering how to use the urllib2.Request.get_header method to achieve the same. So, I tried the below line of code:
>>> import urllib2
>>> req_info = urllib2.Request('http://mirror01.th.ifl.net/releases//precise/ubuntu-12.04-desktop-i386.iso')
>>> req_info.header_items()
[]
>>> req_info.get_header('Content-Length')
>>>
As, you can see the get_header returned nothing and neither does header_items.
So, what is the correct way to use the above methods?

The urllib2.Request class is just "an abstraction of a URL request" (http://docs.python.org/library/urllib2.html#urllib2.Request), and does not do any actual retrieval of data. You must use urllib2.urlopen to retrieve data. urlopen either takes the url directly as a string, or you can pass an instance of the Request object too.
For example:
>>> req_info = urllib2.urlopen('https://www.google.com/logos/2012/javelin-2012-hp.jpg')
>>> req_info.headers.keys()
['content-length', 'x-xss-protection', 'x-content-type-options', 'expires', 'server', 'last-modified', 'connection', 'cache-control', 'date', 'content-type']
>>> req_info.headers.getheader('Content-Length')
'52741'

Related

How can I easily remove an information from a url?

my users will be redirected to my site with some information like this
http://127.0.0.1:8000/accounts/dasbboard/?trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f
using urllib will get give me
query='trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f'
But I'm no okay with that, I want to be able to get the number after the reference(621538940cbc9865e63ec43857ed0f), what is the best way to do it in my view? thank you in advance.
You can first parse the url with urlparse, and then use parse_qs to parse the querystring part:
>>> from urllib.parse import urlparse, parse_qs
>>> url = 'http://127.0.0.1:8000/accounts/dasbboard/?trxref=621538940cbc9865e63ec43857ed0f&reference=621538940cbc9865e63ec43857ed0f'
>>> parse_qs(urlparse(url).query)
{'trxref': ['621538940cbc9865e63ec43857ed0f'], 'reference': ['621538940cbc9865e63ec43857ed0f']}
This is a dictionary that maps keys to a list of values, since a key can occur multiple times. We can then retrieve the reference with:
>>> data = parse_qs(urlparse(url).query)
>>> data['reference'][0]
'621538940cbc9865e63ec43857ed0f'

Replace port in url using python

I want to change the port in given url.
OLD=http://test:7000/vcc3
NEW=http://test:7777/vcc3
I tried below code code, I am able to change the URL but not able to change the port.
>>> from urlparse import urlparse
>>> aaa = urlparse('http://test:7000/vcc3')
>>> aaa.hostname
test
>>> aaa.port
7000
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.hostname,"newurl")).geturl()
'http://newurl:7000/vcc3'
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.port,"7777")).geturl()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected a character buffer object
It's not a particularly good error message. It's complaining because you're passing ParseResult.port, an int, to the string's replace method which expects a str. Just stringify port before you pass it in:
aaa._replace(netloc=aaa.netloc.replace(str(aaa.port), "7777"))
I'm astonished that there isn't a simple way to set the port using the urlparse library. It feels like an oversight. Ideally you'd be able to say something like parseresult._replace(port=7777), but alas, that doesn't work.
The details of the port are stored in netloc, so you can simply do:
>>> a = urlparse('http://test:7000/vcc3')
>>> a._replace(netloc='newurl:7777').geturl()
'http://newurl:7777/vcc3'
>>> a._replace(netloc=a.hostname+':7777').geturl() # Keep the same host
'http://test:7777/vcc3'
The problem is that the ParseResult 's 'port' member is protected and you can't change the attribute -don't event try to use private _replace() method. Solution is here:
from urllib.parse import urlparse, ParseResult
old = urlparse('http://test:7000/vcc3')
new = ParseResult(scheme=a.scheme, netloc="{}:{}".format(old.hostname, 7777),
path=old.path, params=old.params, query=old.query, fragment=old.fragment)
new_url = new.geturl()
The second idea is to convert ParseResult to list->change it later on like here:
Changing hostname in a url
BTW 'urlparse' library is not flexible in that area!

Modify URL components in Python 2

Is there a cleaner way to modify some parts of a URL in Python 2?
For example
http://foo/bar -> http://foo/yah
At present, I'm doing this:
import urlparse
url = 'http://foo/bar'
# Modify path component of URL from 'bar' to 'yah'
# Use nasty convert-to-list hack due to urlparse.ParseResult being immutable
parts = list(urlparse.urlparse(url))
parts[2] = 'yah'
url = urlparse.urlunparse(parts)
Is there a cleaner solution?
Unfortunately, the documentation is out of date; the results produced by urlparse.urlparse() (and urlparse.urlsplit()) use a collections.namedtuple()-produced class as a base.
Don't turn this namedtuple into a list, but make use of the utility method provided for just this task:
parts = urlparse.urlparse(url)
parts = parts._replace(path='yah')
url = parts.geturl()
The namedtuple._replace() method lets you create a new copy with specific elements replaced. The ParseResult.geturl() method then re-joins the parts into a url for you.
Demo:
>>> import urlparse
>>> url = 'http://foo/bar'
>>> parts = urlparse.urlparse(url)
>>> parts = parts._replace(path='yah')
>>> parts.geturl()
'http://foo/yah'
mgilson filed a bug report (with patch) to address the documentation issue.
I guess the proper way to do it is this way.
As using _replace private methods or variables is not suggested.
from urlparse import urlparse, urlunparse
res = urlparse('http://www.goog.com:80/this/is/path/;param=paramval?q=val&foo=bar#hash')
l_res = list(res)
# this willhave ['http', 'www.goog.com:80', '/this/is/path/', 'param=paramval', 'q=val&foo=bar', 'hash']
l_res[2] = '/new/path'
urlunparse(l_res)
# outputs 'http://www.goog.com:80/new/path;param=paramval?q=val&foo=bar#hash'

What is a nice, reliable short way to get the charset of a webpage?

I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'
So you have to get the header, and split it. Twice.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
... charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'
That's a surprising amount of steps for such a basic function. Am I missing something?
Have you checked this?
How to download any(!) webpage with correct charset in python?
I did some research and came up with this solution:
response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()
This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use urllib2.request instead of urllib.request.
Here is how it works, since the official Python documentation doesn't explain it very well: the result of urlopen is an http.client.HTTPResponse object. The headers property of this object is an http.client.HTTPMessage object, which, according to the documentation, "is implemented using the email.message.Message class", which has a method called get_content_charset, which tries to determine and return the character set of the response.
By default, this method returns None if it is unable to determine the character set, but you can override this behavior instead by passing a failobj parameter:
encoding = response.headers.get_content_charset(failobj="utf-8")
You're not missing anything. It's doing the right thing - encoding of a HTTP response is a subpart of Content-Type.
Note also that some pages might send only Content-Type: text/html and then set the encoding via <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - that's an ugly hack though (on the part of the page author) and is not too common.
I would go with chardet Universal Encoding Detector.
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
You are doing right but your approach would fail for pages where charset is declared on meta tag or is not declared at all.
If you look closer at Chardet sources, it has a charsetprober/charsetgroupprober modules that deals with this problem nicely.

How can I determine the final URL after redirection using python / urllib2?

I need to get the final URL after redirection in python.
What's a good way to do that?
>>> import urllib2
>>> var = urllib2.urlopen('http://www.stackoverflow.com/')
>>> var.geturl()
'http://stackoverflow.com/'

Categories

Resources