Python urllib2 ensuring url is well-formed - python

In ValueError: unknown url type in urllib2, though the url is fine if opened in a browser, it was pointed out that before calling opener.open() you must ensure that the url passed to it is well-formed (ie - has a "http://" prefix for HTTP urls, "ftp://" for FTP, etc).
The question was refined to ask "Is it possible to handle such cases automatically with some builtin function or I have to do error handling with subsequent string concatenation?" Or put another way: is there a Python built-in for doing this?
However, this refined question was never answered, hence the re-asking here. It's easy enough to do myself, but why reinvent the wheel right?

In Python it's quite common to accept and exception instead of checkin the value in advance. So something like this would be perfectly of for me and probably for most python programmers:
try:
opener.open(url)
except ValueError,e:
# fix url and try again
# ...
But I don't see how you would like to handle urls without prefix automatically. The prefix defines the protocol to be used. If it's not given, how would you "guess" it?

If you want to default to prepending http://, you really need to do this on your own. There is no reason why this should be better than prepending e.g. gopher: or mailto: or news: - there are plenty of protocols.
Just because web-browsers today hide the http:// prefix from their users doesn't make it obsolete.

Related

Trying to grab a section of a URL and then amend it to the end of a line in python

So I am trying to a session ID from the end of a URL, and then add to another URL. I am basically opening an internal site which goes to a homepage, and then to search for an item, goes to another page. To get around having to give out the password we use a script for it, which we currently use autohotkey, which doesn't work very well, has a lot of issues, and generally is more of a pain than just loading the site and logging in.
So here is my progress:
First I tried:
sid = urlparse(browser.current_url).query
url=
urljoin('http://internal.site/BelManage/find_pc_by_name.asp?',
sid)
That failed, which makes sense. So then I imported urlencode and did:
updateurl='http://internal.site/BelManage/find_pc_by_name.asp?{}'.format(urllib.parse.urlencode(sid))
This fails stating not a valid non-string sequence or mapping object.
Because I have to grab the sid with .query from urlparse it means I cannot use a string concatenation, unless I convert sid as a set to a string, which I am not for sure of an easy way to do that.
Any ideas of a better way to do this?
Ok I forgot everything, and then remembered, and was able to get this working.
I converted sid to a str via str(sid), defined the url variable with the address and a simple concatenation and it seems to be working now.

Python: How to replace all relative urls in a document with absolute urls

I am writing an app for Google App Engine that fetches the content of a url and then writes the content of that external url to the local page. I am able to do this, but the obvious issue is that the relative urls point to non-existent pages. I'm not very experienced with python so writing code like this on my own would probably take years.
Here's my code so far:
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
self.response.out.write(result.read())
except urllib2.URLError, e:
self.response.out.write(e)
Note: I'm not creating a malicious app.
I can tell you broadly what you'll need to do, but unfortunately, it's a little complicated and you're probably not going to like it. Python defines a very generic template class called html.parser for doing exactly this sort of thing. The class defines a feed() method which provides the main point of access for an end user such as yourself. The feed() method rips through the raw html, and as it encounters different html markup items, it calls different "handler" methods for processing each one. You actually use the class by overriding these "handler" methods, most of which are empty (i.e., they simply return without doing anything) by default. The link that I included above provides some example code demonstrating how to implement this override for trivial cases.
For most of the handler methods, you will override the empty default logic by simply telling the handler to print whatever item it encounters, perhaps with an additional "<" or "\" or ">" character printed at the beginning or end as appropriate (the parser strips these out by default). In this way, you will cause the parser to simply write out the same html code again just exactly as it encountered it. But for one of the handler methods, specifically the handle_starttag() method, you will have to provide some additional logic so that when you encounter an "A" tag with an attribute keyed by "HREF", you inspect the value associated with the "HREF" key, and then substitute a full URL address rather than a relative address if required.
The URLs would be relative to the base URL of the page you are looking at. So you need to get that base passed into your backend python code. You could use document.URL if you are calling your python from Javascript.
Or, possibly, self.request.referer will be useful to you.
The answer depends on where the relative URLs are coming from and how you are calling your python, it's not clear from your question.

URL Encoding/Decoding in python (whole url, not just the path)

I have done a lot of search and experimentation, and I havent been able to find the solution. So, if there is something trivial I missed, I appologize ahead of time.
Problem:
I have a python turbogears app that is downloading url resources. It is being given a URL to download by clients.
One client in particular sends unescaped urls. For eg, 'http://www.foo.com/file with space.txt'
When I try to download it, the download fails, because the server does not recognize this url. It needs to have the spaces escaped to be a valid url.
I know that there are methods ( urllib.urlencode/urllib.quote etc) that will encode strings. However they assume that the strings they work on are not urls. If you give a URL to these methods, they escape the scheme of the url, and make it even more invalid.
So, the summary is: How do I unescape a whole fully qualified url in python?
NOTE: I have tried using urlparse to parse out the url components to get at the path. However sometimes the url will have query parameters, fragments etc. So, I do not want to write code that splits the url into its parts, escapes whatever is required only from the path+query+fragment, and then reconstructs the url.
Is there any helper function that directly takes the url, and escapes it?
Also, note that sometimes I get valid escaped urls from clients. So, I want to handle them as well, without double escaping them.
Ok, I found the following on pypi. This seems to solve the problem.
https://github.com/seomoz/url-py/
This is the url egg from seomoz. Seems to do the job very well.
You can use regular expressions to separate the domain name and the file path, then only urlencode the path. Here's the regex documentation, here's a tutorial.

Getting a URL with Python

I'm trying to do something similar to placekitten.com, wherein a user can input two strings after the base URL and have those strings alter the output. I'm doing this in Python, and I cannot for the life of me figure out how to grab the URL. In PHP I can do it with query string and $_REQUEST. I can't find a similar method in Python that doesn't rely on CGI.
(I know I could do this with Django, but that's serious overkill for this project.)
This is just by looking at the docs but have you tried it?
cherrypy.request.path_info
The docs say:
The ‘relative path’ portion of the Request-URI. This is relative to the script_name (‘mount point’) of the application which is handling this request.
http://docs.cherrypy.org/stable/refman/_cprequest.html#cherrypy._cprequest.Request.path_info

Any way to detect mistyped urls in python?

My python program involves going to a user-supplied url and then doing stuff on the page. Ideally, mistyped urls would be recognized and pop up an error. But if they have the right syntax and just don't point anywhere, then either an ISP error page or an ad site is loaded instead.
For example:
"http://washingtonn.edu" --> http://search5.comcast.com/?cat=dnsr&con=dsqcy&url=washingtonn.edu
"http://www.amazdon.com/" --> http://www.amazdon.com/
Is there any way to detect these without knowing all the possible pages? The second one might be pretty hard because it's an actual site, but I'd be happy with catching the first.
Thanks!
Unless I am misunderstanding your question, what you ask for is impossible, doesn't make sense, or is far far from trivial.
If you think about it, other than a 404 error, where you detect that a page does not exist, if a page does exist there is not way of knowing whether the page is "good" or "bad" as this is subjective. It might be possible to apply some general rules, but you can't make embrace all the possibilities.
The only way would be something like what Google does with the suggestions, but this would imply a huge database with a list of popularity of websites, and test every time for proximity, but that is far beyond trivial and probably not necessary.
For handling 404 statutes in python you could use lie httplib.
Good luck!
You can check the HTTP status code of your requests. Probably most interesting for you is the 404 - Not Found status. In the second case, you are right - if the response is a web page, you can't know if is what user wanted or is a typo
What you're talking about is heuristics and it's actually a very complex topic. You could have a list of common websites and common misspellings- if something cannot resolve (i.e, 404 HTTP response) check the input against the list, and pick the "closest" answer (this is a whole algorithm in-of-itself). It wouldn't be too reliable though, because a misspelled website may indeed resolve correctly (although to the unintended domain).
a really simple solution, if you're very concerned about misspelled urls is to just ask for the URL twice.
You could use a regex to check for a valid url, and also use httplib to check for the response codes and require a 200 to continue.
HTTPConnection.getresponse() will return 200 if a url is valid

Categories

Resources