URL Encoding/Decoding in python (whole url, not just the path)

URL Encoding/Decoding in python (whole url, not just the path) - python

I have done a lot of search and experimentation, and I havent been able to find the solution. So, if there is something trivial I missed, I appologize ahead of time.
Problem:
I have a python turbogears app that is downloading url resources. It is being given a URL to download by clients.
One client in particular sends unescaped urls. For eg, 'http://www.foo.com/file with space.txt'
When I try to download it, the download fails, because the server does not recognize this url. It needs to have the spaces escaped to be a valid url.
I know that there are methods ( urllib.urlencode/urllib.quote etc) that will encode strings. However they assume that the strings they work on are not urls. If you give a URL to these methods, they escape the scheme of the url, and make it even more invalid.
So, the summary is: How do I unescape a whole fully qualified url in python?
NOTE: I have tried using urlparse to parse out the url components to get at the path. However sometimes the url will have query parameters, fragments etc. So, I do not want to write code that splits the url into its parts, escapes whatever is required only from the path+query+fragment, and then reconstructs the url.
Is there any helper function that directly takes the url, and escapes it?
Also, note that sometimes I get valid escaped urls from clients. So, I want to handle them as well, without double escaping them.

Ok, I found the following on pypi. This seems to solve the problem.
https://github.com/seomoz/url-py/
This is the url egg from seomoz. Seems to do the job very well.

You can use regular expressions to separate the domain name and the file path, then only urlencode the path. Here's the regex documentation, here's a tutorial.

Related

Is it possible to use multiple slash Django URL as one variable in Django?

I'm new to Django. I'm now creating a project. In that project, I've links like this:
https://localhost:8000/example.com/example/path/
In the URL the example.com/example/path/ can be dynamically long as like this
example.com
or
example.com/asset/css/style.css
or
domain.com/core/content/auth/assets/js/vendor/jquery.js
I've used <str:domainurl> But is not working. As it has multiple forward slashes. And the forward slashes URL length generated while web scraping.
So is there is a way to use the full URL as one variable?

You want to be using the path path converter [Django docs]:
path('<path:domainurl>/', some_view)
Quoting Django docs:
path - Matches any non-empty string, including the path separator, '/'. This allows you to match against a complete URL path rather than a segment of a URL path as with str.
Note: Design your url patterns and order them carefully if you are going to use this. Django uses the first matching url pattern to
serve any request.

If you want Django to match some custom strings take a look at re_path (https://docs.djangoproject.com/en/3.1/ref/urls/#django.urls.re_path) function. It basically allows you to pass a regular expression.
For the urls samples you provided you would probably want an expression like this (in your urls.py):
re_path(r'^(?P<my_url>[a-z0-9/\.]+)$', your_view_function)
This will pass a my_url kwarg to your view which you can then process as you want.

Trying to grab a section of a URL and then amend it to the end of a line in python

So I am trying to a session ID from the end of a URL, and then add to another URL. I am basically opening an internal site which goes to a homepage, and then to search for an item, goes to another page. To get around having to give out the password we use a script for it, which we currently use autohotkey, which doesn't work very well, has a lot of issues, and generally is more of a pain than just loading the site and logging in.
So here is my progress:
First I tried:
sid = urlparse(browser.current_url).query
url=
urljoin('http://internal.site/BelManage/find_pc_by_name.asp?',
sid)
That failed, which makes sense. So then I imported urlencode and did:
updateurl='http://internal.site/BelManage/find_pc_by_name.asp?{}'.format(urllib.parse.urlencode(sid))
This fails stating not a valid non-string sequence or mapping object.
Because I have to grab the sid with .query from urlparse it means I cannot use a string concatenation, unless I convert sid as a set to a string, which I am not for sure of an easy way to do that.
Any ideas of a better way to do this?

Ok I forgot everything, and then remembered, and was able to get this working.
I converted sid to a str via str(sid), defined the url variable with the address and a simple concatenation and it seems to be working now.

How can I write the regex for those urls

I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.

Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),

Python urllib2 ensuring url is well-formed

In ValueError: unknown url type in urllib2, though the url is fine if opened in a browser, it was pointed out that before calling opener.open() you must ensure that the url passed to it is well-formed (ie - has a "http://" prefix for HTTP urls, "ftp://" for FTP, etc).
The question was refined to ask "Is it possible to handle such cases automatically with some builtin function or I have to do error handling with subsequent string concatenation?" Or put another way: is there a Python built-in for doing this?
However, this refined question was never answered, hence the re-asking here. It's easy enough to do myself, but why reinvent the wheel right?

In Python it's quite common to accept and exception instead of checkin the value in advance. So something like this would be perfectly of for me and probably for most python programmers:
try:
opener.open(url)
except ValueError,e:
# fix url and try again
# ...
But I don't see how you would like to handle urls without prefix automatically. The prefix defines the protocol to be used. If it's not given, how would you "guess" it?

If you want to default to prepending http://, you really need to do this on your own. There is no reason why this should be better than prepending e.g. gopher: or mailto: or news: - there are plenty of protocols.
Just because web-browsers today hide the http:// prefix from their users doesn't make it obsolete.

Getting a URL with Python

I'm trying to do something similar to placekitten.com, wherein a user can input two strings after the base URL and have those strings alter the output. I'm doing this in Python, and I cannot for the life of me figure out how to grab the URL. In PHP I can do it with query string and $_REQUEST. I can't find a similar method in Python that doesn't rely on CGI.
(I know I could do this with Django, but that's serious overkill for this project.)

This is just by looking at the docs but have you tried it?
cherrypy.request.path_info
The docs say:
The ‘relative path’ portion of the Request-URI. This is relative to the script_name (‘mount point’) of the application which is handling this request.
http://docs.cherrypy.org/stable/refman/_cprequest.html#cherrypy._cprequest.Request.path_info

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

URL Encoding/Decoding in python (whole url, not just the path) - python

Ok, I found the following on pypi. This seems to solve the problem. https://github.com/seomoz/url-py/ This is the url egg from seomoz. Seems to do the job very well.

You can use regular expressions to separate the domain name and the file path, then only urlencode the path. Here's the regex documentation, here's a tutorial.

Related

Is it possible to use multiple slash Django URL as one variable in Django?

Trying to grab a section of a URL and then amend it to the end of a line in python

How can I write the regex for those urls

Python urllib2 ensuring url is well-formed

Getting a URL with Python

Categories

Resources