Roman numerals in Python using "re" module - python

I'm working along on this page and continuing the code to cover the 10's place. My "pattern" is:
>>> pattern = '^M?M?M?(CM?|CD?|D?C?C?C?)(XC?|XL?|L?X?X?X?)$'
If I remove the carat (^) from the front of the "pattern", then strings like 'hat' will find a match:
>>> pattern = 'M?M?M?(CM?|CD?|D?C?C?C?)(XC?|XL?|L?X?X?X?)$'
>>> print re.search(pattern,'hat')
<_sre.SRE_Match object at 0x1004ba360>
but when I leave the carat in the front, then it works fine and 'hat' doesn't find a match. What does the carat do and why does 'hat' find a match?

If you actually print what it's matching, ie:
print re.search(pattern,"hat").group()
You'll see nothing, this is because it's matching to the empty string: "". In your regex, every expression ends with ? indicating 0 or 1 of whatever came before it. Without the ^ at the front, your regex will match anything. It essentially boils down to: pattern = '$', which again matches everything.
The ^ means "starts with." When you put the ^ in, "hat" doesn't match, because it doesn't adhere to any of your requirements and does not start with ""; however, if you put "" in lieu of "hat", you will get a match.

Related

Use Python 3.4+ Regex to match up to, but not including a # symbol, plus a range of lowercase letters

I would like to use Regex in Python 3.4+ to match a combination of the '#' symbol + the next lowercase letter. There's a bunch of obfuscating data in the strings that's making it tricky for me to do this in one clean line of regex. Here's an example string:
Stack #Overflow is a question and answer website for #professional and enthusiast programmers.
I'd like the regex here to match up to the word '#professional' (because it's lowercase), skipping over the '#Overflow' occurrence (because it's uppercase). After the operation I want to be left with:
professional and enthusiast programmers
or
#professional and enthusiast programmers
I can get it to match up to the first # with ^[^#]*, but I'm not seeing a good way to put a range of chars in there to specify that the following character needs to be lowercase(a-z, etc).
My initial thought was to try ^[^#a-z]*, but this doesn't work.
Any ideas of how to make this work with Python?
you're looking for a "positive lookahead" -- an anchor which consumes no part of the string but makes an assertion about the characters afterwards
>>> s = 'Stack #Overflow is a question and answer website for #professional and enthusiast programmers.'
>>> re.search('#(?=[a-z])', s)
<re.Match object; span=(53, 54), match='#'>
the (?=...) part is the positive lookahead, asserting that the # is immediately followed by a lowercase character -- notice this matches the second # and not the first. from here you can get the rest of the string:
>>> s[_.end():]
'professional and enthusiast programmers.'
_ here being the last expression in the repl (you'd want to assign the match to a variable in your actual code)
I think you can use pattern r'#([a-z])(.*)' with re.search to get the expected result
import re
line = "Stack #Overflow is a question and answer website for #professional and enthusias programmers."
matchObj = re.search(r'#([a-z])(.*)', line)
if matchObj:
print("match string : ", matchObj.group())

Python regex: how to achieve this complex replacement rule?

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

Python regex or | is greedy

>>> import re
>>> p = re.compile('.*&l=(.*)(&|$)')
>>> p.search('foo&l=something here&bleh').group(1)
'something here&bleh' # want to remove strings after &
>>> p.search('foo&l=something here').group(1)
'something here' # this is OK
The python documents (2.7) says that the or operator '|' is never greedy. But my codes has not been working fine. I want the regex to stop searching when it reached the next & instead going through the entire string.
You need change .* inside the first capturing group to [^&]*
p = re.compile('.*&l=([^&]*)')
Your regex p = re.compile('.*&l=(.*)(&|$)') matches also the extra chars because .* inside the first capturing group is greedy which matches all the chars upto the last. You all know $ matches the boundary which exists at the last. Hence finds a match.
So .* and then $ finds a match, so it won't get backtarck.
Your regex tries to match everything (.*), then when it reaches the end of the string, it begins to backtrack until it matches &. That's why you are getting that result.
Change your regex to
.*&l=(.*?)(&|$)
Adding the ? will make your regex lazy.
Simple example that demonstrate the issue:
Let's say you want to match everything until the first % character appears, and let's say you write the following regex:
.*%
Let's see how the engine works given the string "abc%def%g".
It first see .*, will try to consume everything, so it'll match the whole string. But then, it tries to match % and fails, so it backtracks to the previous character, it's g, still no match. Will backtrack again, and then it reaches %, it does match! So you'll get abc%def% as a result.

How to fix: b = re.search('\(.*).\d+\.tld', a)

This is the code:
a = '000.222.tld'
b = re.search('(.*).\d+\.tld', a)
would like to see it print
000
so far..
print b.group(0)
gives me this:
000.222.tld
print b.group(1)
gives me this:
000.2
There are a a few problems with your expression:
b = re.match('\(.*)\.\d+\.com', a)
First, that \( means that you're escaping the (—it will only match a literal ( character in the search string. You're not trying to match any parentheses, you're trying to create a capturing group, so don't escape the parens. (Also, you're not escaping the matching ), so you'd get an error about mismatched parens trying to use this…)
Second, you're trying to match .com, but your sample input ends in .tld. Those obviously aren't going to match. Presumably you wanted to match any string of letters, or some other rule?
Finally, you're not using a raw string literal, or escaping your backslashes. Sometimes you get away with this, but do you know the Python backslash-escape rules by heart so well that you can be sure that \d or \. doesn't mean anything? Do you expect anyone who reads your code to also know?
If you fix all of those problems, your regex works:
>>> a = '1.2.tld'
>>> b = re.match(r'(.*)\.\d+\.[A-Za-z]+', a)
>>> b.group(1)
'1'
Now that you've completely changed both the expression and the input, you have completely different problems:
b = re.search('(.*).\d+\.tld', a)
The main problem here, besides again not using a raw string literal, is that you didn't escape the first ., so you're searching for any character there. Since regular expressions are greedy by default, the first .* will capture as much as it can while still leaving room for any character, 1 or more digits, and .tld, so it will match 000.2. But if you escape the ., it will capture as much as it can while still leaving room for a literal ., 1 or more digits, and .tld, which is exactly what you want.
>>> a = '000.222.tld'
>>> b = re.search(r'(.*)\.\d+\.tld', a)
>>> b.group(1)
'000'
Meanwhile, there are some great regular expression debuggers, both downloadable and online. I don't want to recommend one in particular, but Debuggex makes it easy to create a sharable link to a particular test, so here is your first one, and here is your second. Check out the examples and see how much easier it is to find the problems with your pattern that way.
You can do it without regex:
b = a.split('.', 1)[0]

Why is the minimal (non-greedy) match affected by the end of string character '$'?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?
Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)
The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.
Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.
The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.
Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.
With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.
On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.
As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.
Answer to original question:
Why does the first search() span
multiple "/"s rather than taking the
shortest match?
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.
Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.
There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
>>> import re
>>> string = "baaa"
>>>
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>>
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>>
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'
Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.
>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'
Edit: Removed text related to old versions of the question.
Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.
>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis

Categories

Resources