See which component in regex alternation was captured - python

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.

To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention

You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.

Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def

The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

Related

How can I add a string inside a string?

The problem is simple, I'm given a random string and a random pattern and I'm told to get all the posible combinations of that pattern that occur in the string and mark then with [target] and [endtarget] at the beggining and end.
For example:
given the following text: "XuyZB8we4"
and the following pattern: "XYZAB"
The expected output would be: "[target]X[endtarget]uy[target]ZB[endtarget]8we4".
I already got the part that identifies all the words, but I can't find a way of placing the [target] and [endtarget] strings after and before the pattern (called in the code match).
import re
def tagger(text, search):
place_s = "[target]"
place_f = "[endtarget]"
pattern = re.compile(rf"[{search}]+")
matches = pattern.finditer(text)
for match in matches:
print(match)
return test_string
test_string = "alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z "
pattern = "XYZAB"
print(tagger(test_string, pattern))
I also tried the for with the sub method, but I couldn't get it to work.
for match in matches:
re.sub(match.group(0), place_s + match.group(0) + place_f, text)
return text
re.sub allows you to pass backreferences to matched groups within your pattern. so you do need to enclose your pattern in parentheses, or create a named group, and then it will replace all matches in the entire string at once with your desired replacements:
In [10]: re.sub(r'([XYZAB]+)', r'[target]\1[endtarget]', test_string)
Out[10]: 'alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] '
With this approach, re.finditer is not not needed at all.

Regular expression with condition

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.
This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter. For example, each of the following is valid document IDs:
ABCD-123
ABCD-123V
XKCD-999
COMP-200
I have tried following regular expression for finding all ids:
re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z]{0,1})", text.read())
These expressions work correctly but I have a problem when Ids are connected to words like:
XKCD-999James
The regular expression should return XKCD-999 but it is returning XKCD-999J which is incorrect.
What changes should I do in RE to get the correct?
Use a negative lookahead assertion to ignore patterns that have trailing letters:
exp = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())
# ^^^^^^^^^^^^^^^^^^^^
As you are using word characters, you can optionally match a char A-Z followed by a word boundary.
\b[A-Z]{4}-[0-9]{3}(?:[A-Z]\b)?
Regex demo
Note that using re.findall will return the captured groups, so if you want to return just the whole match, you can omit the groups.
With the capture groups, the pattern can be:
\b([A-Z]{4})(-)([0-9]{3}(?:[A-Z]\b)?)
Regex demo
How about you use a boundary operation \b ?
[A-Z]{4}-\d{3}(?:[A-Z]\b)?
Regex101 Sample - https://regex101.com/r/DhC5Vd/4
text = "XKCD-999James"
exp = re.findall(r"[A-Z]{4}-\d{3}(?:[A-Z]\b)?", text)
#OUTPUT: ['XKCD-999']

in regex how to match multiple Or conditions but exclude one condition

If I need to match a string "a" with any combination of symbols ##$ before and after it, such as #a#, #a#, $a$ etc, but not a specific pattern #a$. How can I exclude this? Suppose there're too many combinations to manually spell out one-by-one. And it's not negative lookahead or behind cases as seen in other SO answers.
import re
pattern = "[#|#|&]a[#|#|&]"
string = "something#a&others"
re.findall(pattern, string)
Currently the pattern returns results like '#a&' as expected, but also wrongly return on the string to be excluded. The correct pattern should return [] on re.findall(pattern,'#a$')
You can use the character class to list all the possible characters, and use a single negative lookbehind after the match to assert not #a$ directly to the left.
Note that you don't need the | in the character class, as it would match a pipe char and is the same as [#|#&]
[##&$]a[##&$](?<!#a\$)
Regex demo | Python demo
import re
pattern = r"[##&$]a[##&$](?<!#a\$)"
print(re.findall(pattern,'something#a&others#a$'))
Output
['#a&']
I was going to suggest a fairly ugly and complex regex pattern with lookarounds. But instead, you could just proceed with your current pattern and then use a list comprehension to remove the false positive case:
inp = "something#a&others #a$"
matches = re.findall(r'[##&$]+a[##&$]+', inp)
matches = [x for x in matches if x != '#a$']
print(matches) # ['#a&']

non greedy Python regex from end of string

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.
I try to explain with an example:
Input can be one of the following
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'
I need to find the last part of the string ending with
somestring_otherstring.xml
In all the above cases the regex should return XX1234567890_84481.xml
My best try is:
result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)
Here I used:
(_.+)? to match "_anystring" in a non greedy mode
\.xml$ to match ".xml" in the final part of the string
The output I get is not correct:
_x-y-z_XX1234567890_84481.xml
I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.
Could anyone explain me how to implement a non greedy regex from the right?
Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.
To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:
[^_]+_[^_]+\.xml$
Regex demo | Python demo
That will match
[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string
For example:
import re
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())
Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:
'[^_]+_[^_]+\.xml$'
The [^_] is a character class matching any character which is not an underscore.
You need to use this regex to capture what you want,
[^_]*_[^_]*\.xml
Demo
Check out this Python code,
import re
arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']
for s in arr:
m = re.search(r'[^_]*_[^_]*\.xml', s)
if (m):
print(m.group(0))
Prints,
XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml
The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

Python Regex to Extract Domain from Text

I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...
Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'

Categories

Resources