Find out till where a regex satisfies a sentence - python

I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.

Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.

You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

How can I use regex to match sub-string start by words (not included) to the end of string, and keep non-greedy at same time?

I want to find a sub-string that starts with words (\d月|\d日) (not included in result) and to the end of the string, at the same time, keep the sub-string shortest (non-greedy). for example,
str1 = "秋天9月9日长江工程完成"
res1 = re.search(r'(\d月|\d日).*', str1).group() #return 9月9日长江工程完成
I want to return the result like 长江工程完成,
for another example,
str2 ="秋天9月9日9日长江工程完成"
it should get same results like previous one
thus I tried these several methods, but all return un-expected results, please give me some suggestion...
res1 = re.search(r'(?:(?!\d月|\d日))(?:\d月|\d日)', str1).group() #return 9月
res1 = re.search(r'(?:\d月|\d日)((?:(?!\d月|\d日).)*?)', content).group() #return 9月
If you want to capture the rest of the string, surround .* with a group.
To capture one or more of the same pattern, you can use the + operator.
import re
content = "9月9日9月长江工程完成"
match = re.match(r'(?:\d月|\d日)+(.*)', content)
print(match[1])
Output:
长江工程完成
(?:(?!\d月|\d日))(?:\d月|\d日)
This pattern only captures the initial words, because you don't capture the rest as a group. (Also, it only allows for exactly two occurences).
(?:\d月|\d日)((?:(?!\d月|\d日).)*?)
This pattern requires only matches strings that look like this:
9月4日a6日b0月x - probably not what you need
P.S. Make sure you pick right function from the re: match, search or fullmatch (see What is the difference between re.search and re.match?). You said that you need the whole string needs to start with the given words, so match or fullmatch.

in regex how to match multiple Or conditions but exclude one condition

If I need to match a string "a" with any combination of symbols ##$ before and after it, such as #a#, #a#, $a$ etc, but not a specific pattern #a$. How can I exclude this? Suppose there're too many combinations to manually spell out one-by-one. And it's not negative lookahead or behind cases as seen in other SO answers.
import re
pattern = "[#|#|&]a[#|#|&]"
string = "something#a&others"
re.findall(pattern, string)
Currently the pattern returns results like '#a&' as expected, but also wrongly return on the string to be excluded. The correct pattern should return [] on re.findall(pattern,'#a$')
You can use the character class to list all the possible characters, and use a single negative lookbehind after the match to assert not #a$ directly to the left.
Note that you don't need the | in the character class, as it would match a pipe char and is the same as [#|#&]
[##&$]a[##&$](?<!#a\$)
Regex demo | Python demo
import re
pattern = r"[##&$]a[##&$](?<!#a\$)"
print(re.findall(pattern,'something#a&others#a$'))
Output
['#a&']
I was going to suggest a fairly ugly and complex regex pattern with lookarounds. But instead, you could just proceed with your current pattern and then use a list comprehension to remove the false positive case:
inp = "something#a&others #a$"
matches = re.findall(r'[##&$]+a[##&$]+', inp)
matches = [x for x in matches if x != '#a$']
print(matches) # ['#a&']

Regex to extract the string

I need help with regex to get the following out of the string
dal001.caxxxxx.test.com. ---> caxxxxx.test.com
caxxxx.test.com -----> caxxxx.test.com
So basically in the first example, I don't want dal001 or anything that starts with 3 letters and 3 digits and want the rest of the string if it starts with only ca.
In second example I want the whole string that starts only with ca.
So far I have tried (^[a-z]{3}[\d]+\.)?(ca.*) but it doesn't work when the string is
dal001.mycaxxxx.test.com.
Any help would be appreciated.
You can use
^(?:[a-z]{3}\d{3}\.)?(ca.*)
See the regex demo. To make it case insensitive, compile with re.I (re.search(rx, s, re.I), see below).
Details:
^ - start of string
(?:[a-z]{3}\d{3}\.)? - an optional sequence of 3 letters and then 3 digits and a .
(ca.*) - Group 1: ca and the rest of the string.
See the Python demo:
import re
rx = r"^(?:[a-z]{3}\d{3}\.)?(ca.*)"
strs = ["dal001.caxxxxx.test.com","caxxxx.test.com"]
for s in strs:
m = re.search(rx, s)
if m:
print( m.group(1) )
Use re.sub like so:
import re
strs = ['dal001.caxxxxx.test.com', 'caxxxx.test.com']
for s in strs:
s = re.sub(r'^[A-Za-z]{3}\d{3}[.]', '', s)
print(s)
# caxxxxx.test.com
# caxxxx.test.com
if you are using re:
import re
my_strings = ['dal001.caxxxxx.test.com', 'caxxxxx.test.com']
my_regex = r'^(?:[a-zA-Z]{3}[0-9]{3}\.)?(ca.*)'
compiled_regex = re.compile(r)
for a_string in my_strings:
if compiled_regex.match(a_string):
compiled_regex.sub(r'\1', a_string)
my_regex matches a string that starts (^ anchors to the start of the string) with [3 letters][3 digits][a .], but only optionally, and using a non-capturing group (the (?:) will not get a numbered reference to use in sub). In either case, it must then contain ca followed by anything, and this part is used as the replacement in the call to re.sub. re.compile is used to make it a bit faster, in case you have many strings to match.
Note on re.compile:
Some answers don't bother pre-compiling the regex before the loop. They have made a trade: removing a single line of code, at the cost of re-compiling the regex implicitly on every iteration. If you will use a regex in a loop body, you should always compile it first. Doing so can have a major effect on the speed of a program, and there is no added cost even when the number of iterations is small. Here is a comparison of compiled vs. non-compiled versions of the same loop using the same regex for different numbers of loop iterations and number of trials. Judge for yourself.

Replace particular strings in python

I need to replace all occurrences of "W32 L30" with "W32in L30in" in a large corpus of text. The numbers after W, L also vary.
I thought of using this regex expressions
[W]([-+]?\d*\.\d+|\d+)
[L]([-+]?\d*\.\d+|\d+)
But these would only find the number after each W and L, so it's still laborious and very time consuming to replace every occurrence so I was wondering if there's a way to do this directly in regex.
You can use a capture group and simplify the regex. Next we can then use a backref to do the replacement. Like:
import re
RGX = re.compile(r'([WL]([-+]?\d*\.\d+|\d+))(in)?')
result = RGX.sub(r'\1in', some_string)
The \1 is used to reference the first capture group: the result of the string we capture with [WL]([-+]?\d*\.\d+|\d+). The last part (in)? optionally also matches the word in, such that in case there is already an in, we simply replace it with the same value.
So if some_string is for instance:
>>> some_string
'A W2 in C3.15 where L2.4in and a bit A4'
>>> RGX.sub(r'\1in', some_string)
'A W2in in C3.15 where L2.4in and a bit A4'

Categories

Resources