Regular expression misses match at beginning of string

Regular expression misses match at beginning of string - python

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:

I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.

Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

Related

using Python re to check a string

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)

Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!

You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.

Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

unexpected re.sub behavior

I defined
s='f(x) has an occ of x but no y'
def italicize_math(line):
p="(\W|^)(x|y|z|f|g|h)(\W|$)"
repl=r"\1<i>\2</i>\3"
return re.sub(p,repl,line)
and made the following call:
print(italicize_math(s)
The result is
'<i>f</i>(x) has an occ of <i>x</i> but no <i>y</i>'
which is not what I expected. I wanted this instead:
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Can anyone tell me why the first occurence of x was not enclosed in inside the "i" tags?

You seem to be trying to match non-alphanumeric characters (\W) when you really want a word boundary (\b):
>>> p=r"(\b)(x|y|z|f|g|h)(\b)"
>>> re.sub(p,repl,s)
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Of course, ( is non alpha-numeric -- The reason your inner content doesn't match is because \W consumes a character in the match. so with a string like 'f(x)', you match the ( when you match f. Since ( was already matched, it won't match again when you try to match x. By contrast, word boundaries don't consume any characters.

Because the group construct is matching the position at the beginning of the string first and x would overlap the previous match. Also, the first and third groups are redundant since they can be replaced by word boundaries; and you can make use of a character class to combine letters.
p = r'\b([fghxyz])\b'
repl = r'<i>\1</i>'

Like previous answer mention, its because the ( char being consume when matching f thus cause subsequent x to fail the match.
beside replace with word boundary \b, you could also use lookahead regex which just do a peek and won't consume anything match inside the lookahead. Since it didn't consume anything, you don't need the \3 either
p=r"(\W|^)(x|y|z|f|g|h)(?=\W|$)"
repl=r"\1<i>\2</i>"
re.sub(p,repl,line)

Python regex: greedy pattern returning multiple empty matches

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data:
[^\.?!\r\n]*
Output:
>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']
From the Python documentation:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" of the punctuation.
Adding the ^ anchor has this effect:
>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']
Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me.

The * quantifier allows the pattern to capture a substring of length zero. In your original code version (without the ^ anchor in front), the additional matches are:
the zero-length string between the end of hard and the first !
the zero-length string between the first and second !
the zero-length string between the second and third !
the zero-length string between the third ! and the end of the text
You can slice/dice this further if you like here.
Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once.

Get start and stop indexes of overlapping matches?

I need to know start and end indexes of matches from next regular expression:
pat = re.compile("(?=(ATG(?:(?!TAA|TGA|TAG)\w\w\w)*))")
Example string is s='GATGDTATGDTAAAA'
pat.findall(s) returns desired matches ['ATGDTATGD', 'ATGDTAAAA']. How to extract start and end indexes?
I tried:
iters = pat.finditer(s)
for it in iters:
print it.start()
print it.end()
However, it.end() always coincide with it.start(), because the beginning of my pattern starts fro (?= so it does not consume any string (I need it to capture overlapping matches). Obviously pat.findall extracted desired string, but how to get start and stop indexes?

As #Tomalak said, the regexp engine has no builtin notion of overlapping matches, so there's no "clever" solution to be found (which turned out to be wrong - see below). But it's straightforward to do it with a loop:
import re
pat = re.compile("ATG(?:(?!TAA|TGA|TAG)\w\w\w)*")
s = 'GATGDTATGDTAAAA'
i = 0
while True:
m = pat.search(s, i)
if m:
start, end = m.span()
print "match at {}:{} {!r}".format(start, end, m.group())
i = start + 1
else:
break
which displays
match at 1:10 'ATGDTATGD'
match at 6:15 'ATGDTAAAA'
It works by starting the search over again one character beyond the start of the last match, until no more matches are found.
"Clever" or time bomb?
If you want to live dangerously, there's a 2-character change you can make to your original finditer code:
print it.start(1)
print it.end(1)
That is, get the start and end of the first (1) capturing group. By not passing an argument, you were getting the start and end of the match as a whole - but of course a matching assertion always matches an empty string (and so start and end are equal).
I say this is living dangerously because the semantics of a capturing group inside an assertion (whether lookahead or lookbehind, positive or negative, ...) are fuzzy at best. Hard to say whether you may have stumbled into a bug (or implementation accident) here! Cute :-)
EDIT: After a night's sleep and a brief discussion on Python-Dev, I believe this behavior is intentional (& so reliable too). To find all the (possibly overlapping!) matches for a regexp R, wrap it like so:
pat = re.compile("(?=(" + R + "))")
and then
for m in pat.finditer(some_string):
m.group(1) # the matched substring
m.span(1) # the slice indices of the match substring
# etc
works fine.
Best to read (?=(R)) as "match an empty string here, but only if R starts here, and, if that succeeds, put the info about what R matched into group 1". Then finditer() proceeds as it always does when matching an empty string: it moves the start of the search to the next character, and tries again (same as what the by-hand loop in my first answer did).
Using this with findall() is trickier, because if R contains capturing groups too, you'll get all of them (can't pick & choose, as you can do with a match object such as finditer() returns).

There are no overlapping matches in regular expressions.
Either you match something or you don't. Anything you match can only be part of one match/sub-match.
Look-aheads are ephemeral, they don't increase any real counters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression misses match at beginning of string - python

Related

using Python re to check a string

The Behavior of Alternative Match "|" with .* in a Regex

unexpected re.sub behavior

Python regex: greedy pattern returning multiple empty matches

Get start and stop indexes of overlapping matches?

Categories

Resources