in regex how to match multiple Or conditions but exclude one condition - python

If I need to match a string "a" with any combination of symbols ##$ before and after it, such as #a#, #a#, $a$ etc, but not a specific pattern #a$. How can I exclude this? Suppose there're too many combinations to manually spell out one-by-one. And it's not negative lookahead or behind cases as seen in other SO answers.
import re
pattern = "[#|#|&]a[#|#|&]"
string = "something#a&others"
re.findall(pattern, string)
Currently the pattern returns results like '#a&' as expected, but also wrongly return on the string to be excluded. The correct pattern should return [] on re.findall(pattern,'#a$')

You can use the character class to list all the possible characters, and use a single negative lookbehind after the match to assert not #a$ directly to the left.
Note that you don't need the | in the character class, as it would match a pipe char and is the same as [#|#&]
[##&$]a[##&$](?<!#a\$)
Regex demo | Python demo
import re
pattern = r"[##&$]a[##&$](?<!#a\$)"
print(re.findall(pattern,'something#a&others#a$'))
Output
['#a&']

I was going to suggest a fairly ugly and complex regex pattern with lookarounds. But instead, you could just proceed with your current pattern and then use a list comprehension to remove the false positive case:
inp = "something#a&others #a$"
matches = re.findall(r'[##&$]+a[##&$]+', inp)
matches = [x for x in matches if x != '#a$']
print(matches) # ['#a&']

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

Python regex: excluding square brackets and the text inside

I am trying to write a regex that excludes square brackets and the text inside them.
My sample text looks like this: 'WordA, WordB, WordC, [WordD]'
I want to match each text item in the string except '[WordD]'. I've tried using a negative lookahead, something like... [A-Z][A-Za-z]+(?!\[[A-Z]+\]) but doing so is still matching the text inside the brackets.
Is negative lookahead the best approach? If so, where am I going wrong?
Rather than a regex, you might consider splitting by commas and then filtering by whether the word starts with [:
output = [word for word in str.split(', ') if word[0] != '[']
If you use a regex, you can match either the beginning of the string, or lookbehind for a space:
re.findall(r'(?:^|(?<= ))[A-Z][A-Za-z]+', str)
Or you could negative lookahead for ] at the end, after a word boundary:
output = re.findall(r'[A-Z][A-Za-z]+\b(?!\])', str)
This can be as simple as
(\w+),
Regex Demo
Retrieve value of Group 1 for desired result.
I'm guessing that maybe you were trying to write some expression similar to:
[A-Z][a-z]*[A-Z](?=,|$)
or,
[A-Z][a-z]+[A-Z](?=,|$)
Test
import re
regex = r"[A-Z][a-z]*[A-Z](?=,|$)"
string = """
WordA, WordB, WordC, [WordD]
WordA, WordB, WordC, [WordD], WordE
"""
print(re.findall(regex, string))
Output
['WordA', 'WordB', 'WordC', 'WordA', 'WordB', 'WordC', 'WordE']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Empty Regex response using finditer and lookahead

I'm having trouble understanding regex behaviour when using lookahead.
I have a given string in which I have two overlapping patterns (starting with M and ending with p). My expected output would be MGMTPRLGLESLLEp and MTPRLGLESLLEp. My python code below results in two empty strings which share a common start with the expected output.
Removal of the lookahead (?=) results in only ONE output string which is the larger one. Is there a way to modify my regex term to prevent empty strings so that I can get both results with one regex term?
import re
string = 'GYMGMTPRLGLESLLEpApMIRVA'
pattern = re.compile(r'(?=M(.*?)p)')
sequences = pattern.finditer(string)
for results in sequences:
print(results.group())
print(results.start())
print(results.end())
The overlapping matches trick with a look-ahead makes use of the fact that the (?=...) pattern matches at an empty location, then pulls out the captured group nested inside the look-ahead.
You need to print out group 1, explicitly:
for results in sequences:
print(results.group(1))
This produces:
GMTPRLGLESLLE
TPRLGLESLLE
You probably want to include the M and p characters in the capturing group:
pattern = re.compile(r'(?=(M.*?p))')
at which point your output becomes:
MGMTPRLGLESLLEp
MTPRLGLESLLEp

Find out till where a regex satisfies a sentence

I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);

regex match special case

I have a cinematic scenario with a bunch of strings like this:
80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla
And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).
My current regex looks like this (code in Python):
selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";
This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.
I think you should add "(?=:)" in the and of your pattern:
r"80101_\d*(?=:)"
This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.
You could use a negative lookahead:
80101_\d*(?!intertitle)
That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.
regex101 demo
Your pattern could be written as:
pattern = selection + r"_\d*(?!intertitle)"
You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.
^80101_\d*:.*$
See the Demo: https://regex101.com/r/yqGgrv/1
Here is the respective python code as well:
In [1]: s = """80101_intertitle:Blablabla
...: 80101_1:BlablablaBlablabla
...: 80101_2:Blablabla
...: 80101_:BlablablaBlablablaBlablabla
...: 80101_3:BlablablaBlablabla
...: 80101_11:Blablabla
...: 801_1:Blablabla
...: 801_2:Blablabla"""
In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]:
['80101_1:BlablablaBlablabla',
'80101_2:Blablabla',
'80101_:BlablablaBlablablaBlablabla',
'80101_3:BlablablaBlablabla',
'80101_11:Blablabla']
Yes, that is easily done:
import re
s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''
matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
print(match)

Categories

Resources