Do character classes count as groups in regular expressions?

Do character classes count as groups in regular expressions? - python

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.

To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$

answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)

To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

Related

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)

Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Regex in Django for URL with % or &

I have a URL that is either going to be united-states/boulder-21781/tool-&-anchor/mulligan-21/. Assuming the best strategy is to encode the &, the url changes to united-states/boulder-21781/tool-%26-anchor/mulligan-21/
I'm trying to write a url conf that will accept this, but the regex I'm using isn't working. I have:
url(r'^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex'= '(?i)([\.\-\_\w]+)'}, 'view_tip_page', name='tip_page'),
What do I add to capture the %? or should i just include the &?

My first recommendation would be to not do it. As you yourself are demonstrating, not everybody knows that a & is a perfectly valid character in a URI before the first ?, and you are bound to get into trouble. It also looks ugly, is harder to type, and more jarring than, say, and, or even just n. Having said that, if you really want it in there, just put it in there in the character class.
Not related to your question, the way you're building that regex is weird; you're not capturing any of the bits of the path for use by the view. You're also including the (?i) global modifier four times, and specifying _ which is already part of \w. I dunno, I'd expect something like
r'(?i)(?P<country>[.\w-]+)/(?P<city>[.\w-]+)-(?P<cityno>[\d+])/...etc...
but maybe I'm missing something.

Well currently there is no way for you to match % or & in your regex. Depending on whether it is encoded or not, you will need to add one or the other to the character class in your regex, and it should match.
I might change it to something like the following:
r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'}
And proof that it works:
>>> pattern = re.compile(r'(?i)^%(regex)s/%(regex)s-(\d+)/%(regex)s/%(regex)s-(\d+)/$' % {'regex': r'([-.%\w]+)'})
>>> s = 'united-states/boulder-21781/tool-%26-anchor/mulligan-21/'
>>> match = pattern.match(s)
>>> match.groups()
('united-states', 'boulder', '21781', 'tool-%26-anchor', 'mulligan', '21')
A few comments on your regex:
The (?i) isn't really doing anything, since you are using \w which will already match both upper and lowercase. If you do want to use (?i) I would move it out of the replacement string and into the format string ('(?i)...' % {'regex': '...'} instead of '...' % {'regex': '(?i)...'}), since otherwise it will show up multipe times.
Note that character class was changed from [\.\-\_\w] to [-.%\w], this is because underscores are included in \w, you don't need to escape the hyphen if it comes at the beginning of the character class, and you don't need to escape the . inside of character classes.
Also, \w does match digits so technically to match something like 'boulder-21781' you could just use %(regex)s instead of %(regex)s-(\d+), but I didn't want to change that in case it was intentionally adding some additional verification of the format.

beginning and ending sign in regular expression in python

'[A-Za-z0-9-_]*'
'^[A-Za-z0-9-_]*$'
I want to check if a string only contains the sign in the above expression, just want to make sure no more weird sign like #%&/() are in the strings.
I am wondering if there's any difference between these two regular expression? Did the beginning and ending sign matter? Will it affect the result somehow?

Python regular expressions are anchored at the beginning of strings (like in many other languages): hence the ^ sign at the beginning doesn’t make any difference. However, the $ sign does very much make one: if you don’t include it, you’re only going to match the beginning of your string, and the end could contain anything – including the characters you want to exclude. Just try re.match("[a-z0-9]", "abcdef/%&").
In addition to that, you may want to use a regular expression that simply excludes the characters you’re testing for, it’s much safe (hence [^#%&/()] – or maybe you have to do something to escape the parentheses; can’t remember how it works at the moment).

The beginning and end sign match the beginning and end of a String.
The first will match any String that contains zero or more ocurrences of the class [A-Za-z0-9-_] (basically any string whatsoever...).
The second will match an empty String, but not one that contains characters not defined in [A-Za-z0-9-_]

Yes it will. A regex can match anywhere in its input. # will match in your first regex.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.