Regular expressions and russian symbols in Django - python

I have the one url like this
url(ur'^gradebook/(?P<group>[\w\-А-Яа-я])$', some_view, name='some_view')
and I expect it to process a request like
../gradebook/group='Ф-12б'
but I get an error and the server crashes.
Please help me figure out the Russian symbols

The group='…' part is more a problem, since the equation sign = is not part of the character group.
Furthermore you should match multiple characters:
# quantifier &downarrow;
url(ur'^gradebook/(?P[\w\-А-Яа-я]+)$', some_view, name='some_view')
then this can match a URL:
/gradebook/Ф-12б
but if you want to match the group='…' as well, you should include the = and the ' character:
# extra characters &downarrow;&downarrow;
url(ur"^gradebook/(?P[\w\-А-Яа-я'=]+)$", some_view, name='some_view')
Then you can match with:
/gradebook/group='Ф-12б'
although that might accept too much, since it can also accept f'q'a=gr=f for example.

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

variables with space in url (django)

I am having the same issue as How to pass variables with spaces through URL in :Django. I have tried the solutions mentioned but everything is returning as "The resource you are looking for has been removed, had its name changed, or is temporarily unavailable."
I am trying to pass a file name example : new 3
in urls.py:
url(r'^file_up/delete_file/(?P<oname>[0-9A-Za-z\ ]+)/$', 'app.views.delete_file' , name='delete_file'),
in views.py:
def delete_file(request,fname):
return render_to_response(
'app/submission_error.html',
{'fname':fname,
},
context_instance=RequestContext(request)
)
url : demo.net/file_up/delete_file/new%25203/
Thanks for the help
Thinking this over; are you stuck with having to use spaces? If not, I think you may find your patterns (and variables) easier to work with. A dash or underscore, or even a forward slash will look cleaner, and more predictable.
I also found this: https://stackoverflow.com/a/497972/352452 which cites:
The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
You may also be able to capture your space with a literal %20. Not sure. Just leaving some thoughts here that come to mind.
demo.net/file_up/delete_file/new%25203/
This URL is double-encoded. The space is first encoded to %20, then the % character is encoded to %25. Django only decodes the URL once, so the decoded url is /file_up/delete_file/new%203/. Your pattern does not match the literal %20.
If you want to stick to spaces instead of a different delimiter, you should find the source of that URL and make sure it is only encoded once: demo.net/file_up/delete_file/new%203/.

not understanding the constructed url in Django framework

I'm new to Django framework and learning it; many time I get the url patterns in urls.py as given below
url(r'^tracking/(?P<some_slug>[\w.-]+)/(?P<mail_64>{})/$'.format(base64_pattern), 'tracking_image_url', name='tracking_image_url'),
I understand the part P but after that [\w.-]+ is added or sometimes its simply w+.
Please can anyone make me understand these terms what they are? and for what they stand?
\w is a regular expression which matches any alphanumeric character and underscore. So, \w+ matches repeated alphanumeric characters (and underscores) and [\w-]+ adds the - to the set of matchable characters.

Django URLs - trailing slash gets added to variable value

I have a django application hosted with Apache. I'm busy using the django restframework to create an API, but I am having issues with URLs. As an example, I have a URL like this:
url(r'path/to/endpoint/(?P<db_id>.+)/$', views.PathDetail.as_view())
If I try to access this url and don't include the trailing slash, it will not match. If I add a question mark on at the end like this:
url(r'path/to/endpoint/(?P<db_id>.+)/?', views.PathDetail.as_view())
This matches with and without a trailing slash. The only issue is that if a trailing slash is used, it now gets included in the db_id variable in my view. So when it searches the database, the id doesn't match. I don't want to have to go through all my views and remove trailing slashes from my url variables using string handling.
So my question is, what is the best way to make a url match both with and without a trailing slash without including that trailing slash in a parameter that gets sent to the view?
Your pattern for the parameter is .+, which means 1 or more of any character, including /. No wonder the slash is included in it, why wouldn't it?
If you want the pattern to include anything but /, use [^/]+ instead. If you want the pattern to include anything except slashes at the end, use .*[^/] for the pattern.
The .+ part of your regex will match one or more characters. This match is "greedy", meaning it will match as many characters as it can.
Check out: http://www.regular-expressions.info/repeat.html.
In the first case, the / has to be there for the full pattern to match.
In the second case, when the slash is missing, the pattern will match anyway because the slash is optional.
If the slash is present, the greedy db_id field will expand to the end (including the slash) and the slash will not match anything, but the overall pattern will still match because the slash is optional.
Some easy solutions would be to make the db_id non greedy by using the ? modifier: (?P<db_id>.+?)/? or make the field not match any slashes: (?P<db_id>[^/]+)/?

Regex in Django for URL with % or &

I have a URL that is either going to be united-states/boulder-21781/tool-&-anchor/mulligan-21/. Assuming the best strategy is to encode the &, the url changes to united-states/boulder-21781/tool-%26-anchor/mulligan-21/
I'm trying to write a url conf that will accept this, but the regex I'm using isn't working. I have:
url(r'^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex'= '(?i)([\.\-\_\w]+)'}, 'view_tip_page', name='tip_page'),
What do I add to capture the %? or should i just include the &?
My first recommendation would be to not do it. As you yourself are demonstrating, not everybody knows that a & is a perfectly valid character in a URI before the first ?, and you are bound to get into trouble. It also looks ugly, is harder to type, and more jarring than, say, and, or even just n. Having said that, if you really want it in there, just put it in there in the character class.
Not related to your question, the way you're building that regex is weird; you're not capturing any of the bits of the path for use by the view. You're also including the (?i) global modifier four times, and specifying _ which is already part of \w. I dunno, I'd expect something like
r'(?i)(?P<country>[.\w-]+)/(?P<city>[.\w-]+)-(?P<cityno>[\d+])/...etc...
but maybe I'm missing something.
Well currently there is no way for you to match % or & in your regex. Depending on whether it is encoded or not, you will need to add one or the other to the character class in your regex, and it should match.
I might change it to something like the following:
r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'}
And proof that it works:
>>> pattern = re.compile(r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'})
>>> s = 'united-states/boulder-21781/tool-%26-anchor/mulligan-21/'
>>> match = pattern.match(s)
>>> match.groups()
('united-states', 'boulder', '21781', 'tool-%26-anchor', 'mulligan', '21')
A few comments on your regex:
The (?i) isn't really doing anything, since you are using \w which will already match both upper and lowercase. If you do want to use (?i) I would move it out of the replacement string and into the format string ('(?i)...' % {'regex': '...'} instead of '...' % {'regex': '(?i)...'}), since otherwise it will show up multipe times.
Note that character class was changed from [\.\-\_\w] to [-.%\w], this is because underscores are included in \w, you don't need to escape the hyphen if it comes at the beginning of the character class, and you don't need to escape the . inside of character classes.
Also, \w does match digits so technically to match something like 'boulder-21781' you could just use %(regex)s instead of %(regex)s-(\d+), but I didn't want to change that in case it was intentionally adding some additional verification of the format.

Categories

Resources