I want to change the port in given url.
OLD=http://test:7000/vcc3
NEW=http://test:7777/vcc3
I tried below code code, I am able to change the URL but not able to change the port.
>>> from urlparse import urlparse
>>> aaa = urlparse('http://test:7000/vcc3')
>>> aaa.hostname
test
>>> aaa.port
7000
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.hostname,"newurl")).geturl()
'http://newurl:7000/vcc3'
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.port,"7777")).geturl()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected a character buffer object
It's not a particularly good error message. It's complaining because you're passing ParseResult.port, an int, to the string's replace method which expects a str. Just stringify port before you pass it in:
aaa._replace(netloc=aaa.netloc.replace(str(aaa.port), "7777"))
I'm astonished that there isn't a simple way to set the port using the urlparse library. It feels like an oversight. Ideally you'd be able to say something like parseresult._replace(port=7777), but alas, that doesn't work.
The details of the port are stored in netloc, so you can simply do:
>>> a = urlparse('http://test:7000/vcc3')
>>> a._replace(netloc='newurl:7777').geturl()
'http://newurl:7777/vcc3'
>>> a._replace(netloc=a.hostname+':7777').geturl() # Keep the same host
'http://test:7777/vcc3'
The problem is that the ParseResult 's 'port' member is protected and you can't change the attribute -don't event try to use private _replace() method. Solution is here:
from urllib.parse import urlparse, ParseResult
old = urlparse('http://test:7000/vcc3')
new = ParseResult(scheme=a.scheme, netloc="{}:{}".format(old.hostname, 7777),
path=old.path, params=old.params, query=old.query, fragment=old.fragment)
new_url = new.geturl()
The second idea is to convert ParseResult to list->change it later on like here:
Changing hostname in a url
BTW 'urlparse' library is not flexible in that area!
Related
I would like to be able to enter a server response code and have Requests tell me what the code means. For example, code 200 --> ok
I found a link to the source code which shows the dictionary structure of the codes and descriptions. I see that Requests will return a response code for a given description:
print requests.codes.processing # returns 102
print requests.codes.ok # returns 200
print requests.codes.not_found # returns 404
But not the other way around:
print requests.codes[200] # returns None
print requests.codes.viewkeys() # returns dict_keys([])
print requests.codes.keys() # returns []
I thought this would be a routine task, but cannot seem to find an answer to this in online searching, or in the documentation.
Alternatively, in case of Python 2.x, you can use httplib.responses:
>>> import httplib
>>> httplib.responses[200]
'OK'
>>> httplib.responses[404]
'Not Found'
In Python 3.x, use http module:
In [1]: from http.client import responses
In [2]: responses[200]
Out[2]: 'OK'
In [3]: responses[404]
Out[3]: 'Not Found'
One possibility:
>>> import requests
>>> requests.status_codes._codes[200]
('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '\xe2\x9c\x93')
The first value in the tuple is used as the conventional code key.
I had the same problem before and found the
answer in this question
Basically:
responsedata.status_code - gives you the integer status code
responsedata.reason - gives the text/string representation of the status code
requests.status_codes.codes.OK
works nicely and makes it more readable in my application code
Notice that in source code: the requests.status_codes.codes is of type LookupDict which overrides method getitem
You could see all the supported keys with - dir(requests.status_codes.codes)
When using in combination with FLASK:
i like use following enum from flask-api plugin
from flask_api import status where i get more descriptive version of HTTP status codes as in -
status.HTTP_200_OK
With Python 3.x this will work
>>> from http import HTTPStatus
>>> HTTPStatus(200).phrase
'OK'
I'm playing around in Python and and there's a URL that I'm trying to use which goes like this
https://[username#domain.com]:[password]#domain.com/blah
This is my code:
response =urllib2.urlopen("https://[username#domain.com]:[password]#domain.com/blah")
html = response.read()
print ("data="+html)
This isn't going through, it doesn't like the # symbols and probably the : too. I tried searching, and I read something about unquote, but that's not doing anything. This is the error I get:
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: 'password#updates.opendns.com'
How do I get around this? The actual site is "https://updates.opendns.com/nic/update?hostname=
thank you!
URIs have a bunch of reserved characters separating distinguishable parts of the URI (/, ?, &, # and a few others). If any of these characters appears in either username (# does in your case) or password, they need to be percent encoded or the URI becomes invalid.
In Python 3:
>>> from urllib import parse
>>> parse.quote("p#ssword?")
'p%40ssword%3F'
In Python 2:
>>> import urllib
>>> urllib.quote("p#ssword?")
'p%40ssword%3F'
Also, don't put the username and password in square brackets, this is not valid either.
use urlencode! Not sure if urllib2 has it, but urllib has an urlencode function. One sec and i'll get back to you.
I did a quick check, and it seems that you need to use urrlib instead of urllib2 for that...importing urllib and then using urllib.urlencode(YOUR URL) should work!
import urllib
url = urllib.urlencode(<your_url_here>)
EDIT: it's actually urlllib2.quote()!
I was looking into python urllib2 download size question.
Although the method RanRag or jterrace suggested worked fine for me but I was wondering how to use the urllib2.Request.get_header method to achieve the same. So, I tried the below line of code:
>>> import urllib2
>>> req_info = urllib2.Request('http://mirror01.th.ifl.net/releases//precise/ubuntu-12.04-desktop-i386.iso')
>>> req_info.header_items()
[]
>>> req_info.get_header('Content-Length')
>>>
As, you can see the get_header returned nothing and neither does header_items.
So, what is the correct way to use the above methods?
The urllib2.Request class is just "an abstraction of a URL request" (http://docs.python.org/library/urllib2.html#urllib2.Request), and does not do any actual retrieval of data. You must use urllib2.urlopen to retrieve data. urlopen either takes the url directly as a string, or you can pass an instance of the Request object too.
For example:
>>> req_info = urllib2.urlopen('https://www.google.com/logos/2012/javelin-2012-hp.jpg')
>>> req_info.headers.keys()
['content-length', 'x-xss-protection', 'x-content-type-options', 'expires', 'server', 'last-modified', 'connection', 'cache-control', 'date', 'content-type']
>>> req_info.headers.getheader('Content-Length')
'52741'
I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
else:
break
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
print(text)
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
else:
break
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
_bk201
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset
value of "ISO-8859-1" when received via HTTP.
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.
Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content - has the bytes type
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'
While using response.text - you have the encoded response
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'
The default encoding is utf-8, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text will hold the content in the encoding you requested ...
I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'
So you have to get the header, and split it. Twice.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
... charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'
That's a surprising amount of steps for such a basic function. Am I missing something?
Have you checked this?
How to download any(!) webpage with correct charset in python?
I did some research and came up with this solution:
response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()
This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use urllib2.request instead of urllib.request.
Here is how it works, since the official Python documentation doesn't explain it very well: the result of urlopen is an http.client.HTTPResponse object. The headers property of this object is an http.client.HTTPMessage object, which, according to the documentation, "is implemented using the email.message.Message class", which has a method called get_content_charset, which tries to determine and return the character set of the response.
By default, this method returns None if it is unable to determine the character set, but you can override this behavior instead by passing a failobj parameter:
encoding = response.headers.get_content_charset(failobj="utf-8")
You're not missing anything. It's doing the right thing - encoding of a HTTP response is a subpart of Content-Type.
Note also that some pages might send only Content-Type: text/html and then set the encoding via <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - that's an ugly hack though (on the part of the page author) and is not too common.
I would go with chardet Universal Encoding Detector.
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
You are doing right but your approach would fail for pages where charset is declared on meta tag or is not declared at all.
If you look closer at Chardet sources, it has a charsetprober/charsetgroupprober modules that deals with this problem nicely.