Get start and stop indexes of overlapping matches? - python

I need to know start and end indexes of matches from next regular expression:
pat = re.compile("(?=(ATG(?:(?!TAA|TGA|TAG)\w\w\w)*))")
Example string is s='GATGDTATGDTAAAA'
pat.findall(s) returns desired matches ['ATGDTATGD', 'ATGDTAAAA']. How to extract start and end indexes?
I tried:
iters = pat.finditer(s)
for it in iters:
print it.start()
print it.end()
However, it.end() always coincide with it.start(), because the beginning of my pattern starts fro (?= so it does not consume any string (I need it to capture overlapping matches). Obviously pat.findall extracted desired string, but how to get start and stop indexes?

As #Tomalak said, the regexp engine has no builtin notion of overlapping matches, so there's no "clever" solution to be found (which turned out to be wrong - see below). But it's straightforward to do it with a loop:
import re
pat = re.compile("ATG(?:(?!TAA|TGA|TAG)\w\w\w)*")
s = 'GATGDTATGDTAAAA'
i = 0
while True:
m = pat.search(s, i)
if m:
start, end = m.span()
print "match at {}:{} {!r}".format(start, end, m.group())
i = start + 1
else:
break
which displays
match at 1:10 'ATGDTATGD'
match at 6:15 'ATGDTAAAA'
It works by starting the search over again one character beyond the start of the last match, until no more matches are found.
"Clever" or time bomb?
If you want to live dangerously, there's a 2-character change you can make to your original finditer code:
print it.start(1)
print it.end(1)
That is, get the start and end of the first (1) capturing group. By not passing an argument, you were getting the start and end of the match as a whole - but of course a matching assertion always matches an empty string (and so start and end are equal).
I say this is living dangerously because the semantics of a capturing group inside an assertion (whether lookahead or lookbehind, positive or negative, ...) are fuzzy at best. Hard to say whether you may have stumbled into a bug (or implementation accident) here! Cute :-)
EDIT: After a night's sleep and a brief discussion on Python-Dev, I believe this behavior is intentional (& so reliable too). To find all the (possibly overlapping!) matches for a regexp R, wrap it like so:
pat = re.compile("(?=(" + R + "))")
and then
for m in pat.finditer(some_string):
m.group(1) # the matched substring
m.span(1) # the slice indices of the match substring
# etc
works fine.
Best to read (?=(R)) as "match an empty string here, but only if R starts here, and, if that succeeds, put the info about what R matched into group 1". Then finditer() proceeds as it always does when matching an empty string: it moves the start of the search to the next character, and tries again (same as what the by-hand loop in my first answer did).
Using this with findall() is trickier, because if R contains capturing groups too, you'll get all of them (can't pick & choose, as you can do with a match object such as finditer() returns).

There are no overlapping matches in regular expressions.
Either you match something or you don't. Anything you match can only be part of one match/sub-match.
Look-aheads are ephemeral, they don't increase any real counters.

Related

How can I use regex to match sub-string start by words (not included) to the end of string, and keep non-greedy at same time?

I want to find a sub-string that starts with words (\d月|\d日) (not included in result) and to the end of the string, at the same time, keep the sub-string shortest (non-greedy). for example,
str1 = "秋天9月9日长江工程完成"
res1 = re.search(r'(\d月|\d日).*', str1).group() #return 9月9日长江工程完成
I want to return the result like 长江工程完成,
for another example,
str2 ="秋天9月9日9日长江工程完成"
it should get same results like previous one
thus I tried these several methods, but all return un-expected results, please give me some suggestion...
res1 = re.search(r'(?:(?!\d月|\d日))(?:\d月|\d日)', str1).group() #return 9月
res1 = re.search(r'(?:\d月|\d日)((?:(?!\d月|\d日).)*?)', content).group() #return 9月
If you want to capture the rest of the string, surround .* with a group.
To capture one or more of the same pattern, you can use the + operator.
import re
content = "9月9日9月长江工程完成"
match = re.match(r'(?:\d月|\d日)+(.*)', content)
print(match[1])
Output:
长江工程完成
(?:(?!\d月|\d日))(?:\d月|\d日)
This pattern only captures the initial words, because you don't capture the rest as a group. (Also, it only allows for exactly two occurences).
(?:\d月|\d日)((?:(?!\d月|\d日).)*?)
This pattern requires only matches strings that look like this:
9月4日a6日b0月x - probably not what you need
P.S. Make sure you pick right function from the re: match, search or fullmatch (see What is the difference between re.search and re.match?). You said that you need the whole string needs to start with the given words, so match or fullmatch.

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.
With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$
The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101
I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']
After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)
Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD
The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.
I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.
You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.
Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.
Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);
repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

Regular expression misses match at beginning of string

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:
I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

How to specify the regex string in python

I have the following 2 strings of train station IDs (showing the direction of travel) separated by "-".
String A (strA):
NS1-NS2-NS3-NS4-NS5-NS7-NS8-NS9-NS10-NS11-NS13-NS14-NS15-NS16-NS17-NS18-NS19-NS20-NS21-NS22-NS23-NS24-NS25-NS26-NS27
String B (strB):
NS27-NS26-NS25-NS24-NS23-NS22-NS21-NS20-NS19-NS18-NS17-NS16-NS15-NS14-NS13-NS11-NS10-NS9-NS8-NS7-NS5-NS4-NS3-NS2-NS1
I want to find out which of String A or B contains stations "NS4" followed by "NS1" (answer should be String B).
My current code as follows:
searchStr = ".*NS4-.*NS1(-.*|)"
re.search(searchStr, strA)
re.search(searchStr, strB)
But the result keep returning a match in String A.
May I know how to specify 'searchStr' in order to match only String B?
Two ways to do it: tokenizing and improving the regex.
Tokenizing
tokA = strA.split('-')
tokB = strB.split('-')
print('NS4' in tokA and tokA.index('NS1') > tokA.index('NS4'))
print('NS4' in tokB and tokB.index('NS1') > tokB.index('NS4'))
# False
# True
Regex
import re
pattern = '(^|-)NS4.+NS1(-|$)'
print(re.search(pattern, strA) is not None)
print(re.search(pattern, strB) is not None)
# False
# True
Performance
Tokenization: 2.3072939129997394
Regex: 11.138173280000046
But if you really need performance, I'm sure there are faster ways. Even the tokenization method does multiple passes.
As an alternative to tokenizing, you could use the following expression.
NS4(?=.*?NS1(?!\d))
It literally means:
The characters "NS4" literally.
Followed by any characters, until it finds NS1.
NS1 cannot be followed by a digit.
To educate readers as to what I've used:
(?=) is a Positive Lookahead.
Whatever you place inside this token must be found for the match to be True.
I placed .*? to match anything, as few times as possible using the ? quantifier, followed by NS1 since that is what we want to find.
(?!) is a Negative Lookahead
Whatever you place inside this token, as you might guess, must NOT be found for the match to be True.
I placed a digit in here, so that things like NS10 or NS11 or NS19 are never matched.

Categories

Resources