How can I write the regex for those urls - python

I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.

Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),

Related

Django Url Dispatcher regex

I want SOMETHING to be non mandatory so basically I want www.site.com/endpoint to redirect to www.site.com/something/endpoint automatically. I there a way to do it on one line?
Right now I am doing:
url(r'^SOMETHING/endpoint$', 'endpoint', name='endpoint'),
url(r'^endpoint$', RedirectView.as_view(url='SOMETHING/endpoint')),
Cheers.
Why do you want to do this on one line? You're talking about two different URLs that do two different things—one is doing an HTTP redirect and the other is rendering a view. Two lines is the right way to go.
Writing a broader regex to cover both URLs will allow you to use the same view for both, but will not cause a redirect (that is, it will not change the URL to SOMETHING/endpoint).
I believe you are asking to let any url ending in "endpoint" to redirect.
To accomplish this change your redirect url regex to r"endpoint$". The caret operator in regex essentially says from the start of the string.
This regex will match for any url ending in "endpoint", eg. foo/endpoint, bar/endpoint

Regular Expression Query Python

I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.

using unicode strings with white space as Django url variable

Is there a problem with using unicode (hebrew specificaly) strings including white space.
some of them also include characters such as "%" .
I'm experiencing some problems and since this is my first Django project I want to rule out this as a problem before going further into debugging.
And if there is a known Django problem with this kind of urls is there a way around it?
I know I can reformat the text to solve some of those problems but since I'm preparing a site that uses raw open government data sets (perfectly legal) I would like to stick to the original format as possible.
thanks for the help
Django shouldn't have any problems with unicode URLs, or whitespace in URLs for that matter (although you might want to take care to make sure whitespace is urlecoded (%20).
Either way, though, using white space in a URL is just bad form. It's not guaranteed to work unless it's urlencoded, and then that's just one more thing to worry about. Best to make any field that will eventually become part of a URL a SlugField (so spaces aren't allowed to begin with) or run the value through slugify before placing it in the URL:
In template:
http://domain.com/{{ some_string_with_spaces|slugify }}/
Or in python code:
from django.template.defaultfilters import slugify
u'http://domain.com/%s/' % slugify(some_string_with_spaces)
Take a look here for a fairly comprehensive discussion on what makes an invalid (or valid) URL.

Any way to detect mistyped urls in python?

My python program involves going to a user-supplied url and then doing stuff on the page. Ideally, mistyped urls would be recognized and pop up an error. But if they have the right syntax and just don't point anywhere, then either an ISP error page or an ad site is loaded instead.
For example:
"http://washingtonn.edu" --> http://search5.comcast.com/?cat=dnsr&con=dsqcy&url=washingtonn.edu
"http://www.amazdon.com/" --> http://www.amazdon.com/
Is there any way to detect these without knowing all the possible pages? The second one might be pretty hard because it's an actual site, but I'd be happy with catching the first.
Thanks!
Unless I am misunderstanding your question, what you ask for is impossible, doesn't make sense, or is far far from trivial.
If you think about it, other than a 404 error, where you detect that a page does not exist, if a page does exist there is not way of knowing whether the page is "good" or "bad" as this is subjective. It might be possible to apply some general rules, but you can't make embrace all the possibilities.
The only way would be something like what Google does with the suggestions, but this would imply a huge database with a list of popularity of websites, and test every time for proximity, but that is far beyond trivial and probably not necessary.
For handling 404 statutes in python you could use lie httplib.
Good luck!
You can check the HTTP status code of your requests. Probably most interesting for you is the 404 - Not Found status. In the second case, you are right - if the response is a web page, you can't know if is what user wanted or is a typo
What you're talking about is heuristics and it's actually a very complex topic. You could have a list of common websites and common misspellings- if something cannot resolve (i.e, 404 HTTP response) check the input against the list, and pick the "closest" answer (this is a whole algorithm in-of-itself). It wouldn't be too reliable though, because a misspelled website may indeed resolve correctly (although to the unintended domain).
a really simple solution, if you're very concerned about misspelled urls is to just ask for the URL twice.
You could use a regex to check for a valid url, and also use httplib to check for the response codes and require a 200 to continue.
HTTPConnection.getresponse() will return 200 if a url is valid

Django URL regex question

I had a quick question about Django URL configuration, and I guess REGEX as well. I have a small configuration application that takes in environments and artifacts in the following way:
url(r'^env/(?P<env>\w+)/artifact/(?P<artifact>\w+)/$', 'config.views.ipview', name="bothlist"),
Now, this works fine, but what I would like to do is have it be able to have additional parameters that are optional, such as a verbose mode or no formating mode. I know how to do this just fine in the views, but I can't wrap my head around the regex.
the call would be something like
GET /env/<env>/artifact/<artifact>/<opt:verbose>/<opt:noformat>
Any help would be appreciated, thanks!
-Shawn
I wouldn't put such options into the URL. As you said, these are optional options, they might only change the output. They don't belong in an URL.
Your initial regex should match URLs like:
/env/<env>/artifact/<artifact>?verbose=1&noformat=1
Imho this is a much better usage of URLs

Categories

Resources