not understanding the constructed url in Django framework - python

I'm new to Django framework and learning it; many time I get the url patterns in urls.py as given below
url(r'^tracking/(?P<some_slug>[\w.-]+)/(?P<mail_64>{})/$'.format(base64_pattern), 'tracking_image_url', name='tracking_image_url'),
I understand the part P but after that [\w.-]+ is added or sometimes its simply w+.
Please can anyone make me understand these terms what they are? and for what they stand?

\w is a regular expression which matches any alphanumeric character and underscore. So, \w+ matches repeated alphanumeric characters (and underscores) and [\w-]+ adds the - to the set of matchable characters.

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Django: Can't resolve an URL pattern from urls.py

My problem is the following:
Inside my urls.py I have defined these url patterns:
url(r'^image/upload', 'main.views.presentations.upload_image'),
url(r'^image/upload-from-url', 'main.views.presentations.upload_image_from_url'),
the problem is when I call from my browser the URL
myowndomain:8000/image/upload-from-url
Django always execute the first pattern (r'^image/upload')
Is there any solution to my problem?
Django uses the first matching pattern, and your ^image/upload pattern doesn't include anything to stop it matching the longer text. The solution is to require that your pattern also match the end of the string:
r'^image/upload$'
By convention, Django URLs generally have a trailing slash as well, but that's not strictly required:
r'^image/upload/$'
You need to insert the dollar sign "$" at the end of the pattern. The dollar sign is a character that represents position. In the case of regex, this is the end of the string. Because both image/upload and image/upload-from-url match what you're looking for, you need to explicitly say where to stop in the pattern.

Django URL Reg-Ex

Hi all,
How does this expression actually work?
urlpatterns = patterns('',
url(r'^get/(?P<app_id>\d+)/$', 'app.views.app'),
...
)
I understand what it does, at least to map a url entered by the user to the app() function in the app's view page. I also understand it is a regular expression that ends up taking the id of the app and mapping it to the url. But where is this function going? What is going on with the r'^...?P /$ (I get the d+ is a digit regex, of the id itself, but that's about it).
I also understand this url function draws from the django.conf.urls module.
Perhaps my misunderstanding is more buried in my lack of regex experience. Nonetheless, I need help! I do not like using things I do not understand, and I am guilty.
Let's take a look: r'^get/(?P<app_id>\d+)/$'
The r'' means that assume as string characters every character inside the string quotes.
^ character means the beginning of the regular expression. For example, forget/123 won't match the expression because doesn't start with get, if the sign weren't there, it should've match it because it won't be forcing the matched string to begin with get, just that get...appears in the string.
The $ character means the end of the expression. If absent, get/123/xd may match the expression and this is not desired.
(?P<>) is a way to give a name/alias to a group in the expression.
You should read the python's regular expressions documentation. It's very good to know about regular expressions because they're very useful.
Hope this helps!
r just changes how the following string literal is interpreted. Backslashes (\) are not treated as escape sequences, that means that the regex in the string will be used as is.
^ at the beginning and $ at the end match and the end of the string respectively.
(?P<name>...) is a saving named group - it helps you to cut a part of url and pass it as a parameter into the view. See more in django named groups docs.
Hope that helps.

List of Python regular expressions for a newbie?

I recently learned a little Python and I couldnt find a good list of the RegEx's (don't know if that is the correct plural tense...) with complete explanations even a rookie will understand :)
Anybody know a such list?
Vide:
Well, for starters - hit up the python docs on the re module. Good list of features and methods, as well as info about special regex characters such as \w. There's also a chapter in Dive into Python about regular expressions that uses the aforementioned module.
Check out the re module docs for some basic RegEx syntax.
For more, read Introduction To RegEx, or other of the many guides online. (or books!)
You could also try RegEx Buddy, which helps you learn regular expressions by telling you what they do an parsing them.
The Django Book http://www.djangobook.com/en/2.0/chapter03/ chapter on urls/views has a great "newbie" friendly table explaining the gist of regexes. combine that with the info on the python.docs http://docs.python.org/library/re.html and you'll master RegEx in no time.
an excerpt:
Regular Expressions
Regular expressions (or regexes) are a compact way of specifying patterns in text. While Django URLconfs allow arbitrary regexes for powerful URL matching, you’ll probably only use a few regex symbols in practice. Here’s a selection of common symbols:
Symbol Matches
. (dot) Any single character
\d Any single digit
[A-Z] Any character between A and Z (uppercase)
[a-z] Any character between a and z (lowercase)
[A-Za-z] Any character between a and z (case-insensitive)
+ One or more of the previous expression (e.g., \d+ matches one or more digits)
? Zero or one of the previous expression (e.g., \d? matches zero or one digits)
* Zero or more of the previous expression (e.g., \d* matches zero, one or more than one >digit)
{1,3} Between one and three (inclusive) of the previous expression (e.g., \d{1,3} matches >one, two or three digits)
But it's turtles all the way down!

Categories

Resources