The problem is simple, I'm given a random string and a random pattern and I'm told to get all the posible combinations of that pattern that occur in the string and mark then with [target] and [endtarget] at the beggining and end.
For example:
given the following text: "XuyZB8we4"
and the following pattern: "XYZAB"
The expected output would be: "[target]X[endtarget]uy[target]ZB[endtarget]8we4".
I already got the part that identifies all the words, but I can't find a way of placing the [target] and [endtarget] strings after and before the pattern (called in the code match).
import re
def tagger(text, search):
place_s = "[target]"
place_f = "[endtarget]"
pattern = re.compile(rf"[{search}]+")
matches = pattern.finditer(text)
for match in matches:
print(match)
return test_string
test_string = "alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z "
pattern = "XYZAB"
print(tagger(test_string, pattern))
I also tried the for with the sub method, but I couldn't get it to work.
for match in matches:
re.sub(match.group(0), place_s + match.group(0) + place_f, text)
return text
re.sub allows you to pass backreferences to matched groups within your pattern. so you do need to enclose your pattern in parentheses, or create a named group, and then it will replace all matches in the entire string at once with your desired replacements:
In [10]: re.sub(r'([XYZAB]+)', r'[target]\1[endtarget]', test_string)
Out[10]: 'alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] '
With this approach, re.finditer is not not needed at all.
Related
In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']
How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.
I want to use a regular expression to detect and substitute some phrases. These phrases follow the
same pattern but deviate at some points. All the phrases are in the same string.
For instance I have this string:
/this/is//an example of what I want /to///do
I want to catch all the words inside and including the // and substitute them with "".
To solve this, I used the following code:
import re
txt = "/this/is//an example of what i want /to///do"
re.search("/.*/",txt1, re.VERBOSE)
pattern1 = r"/.*?/\w+"
a = re.sub(pattern1,"",txt)
The result is:
' example of what i want '
which is what I want, that is, to substitute the phrases within // with "". But when I run the same pattern on the following sentence
"/this/is//an example of what i want to /do"
I get
' example of what i want to /do'
How can I use one regex and remove all the phrases and //, irrespective of the number of // in a phrase?
In your example code, you can omit this part re.search("/.*/",txt1, re.VERBOSE) as is executes the command, but you are not doing anything with the result.
You can match 1 or more / followed by word chars:
/+\w+
Or a bit broader match, matching one or more / followed by all chars other than / or a whitspace chars:
/+[^\s/]+
/+ Match 1+ occurrences of /
[^\s/]+ Match 1+ occurrences of any char except a whitespace char or /
Regex demo
import re
strings = [
"/this/is//an example of what I want /to///do",
"/this/is//an example of what i want to /do"
]
for txt in strings:
pattern1 = r"/+[^\s/]+"
a = re.sub(pattern1, "", txt)
print(a)
Output
example of what I want
example of what i want to
You can use
/(?:[^/\s]*/)*\w+
See the regex demo. Details:
/ - a slash
(?:[^/\s]*/)* - zero or more repetitions of any char other than a slash and whitespace
\w+ - one or more word chars.
See the Python demo:
import re
rx = re.compile(r"/(?:[^/\s]*/)*\w+")
texts = ["/this/is//an example of what I want /to///do", "/this/is//an example of what i want to /do"]
for text in texts:
print( rx.sub('', text).strip() )
# => example of what I want
# example of what i want to
I have a cinematic scenario with a bunch of strings like this:
80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla
And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).
My current regex looks like this (code in Python):
selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";
This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.
I think you should add "(?=:)" in the and of your pattern:
r"80101_\d*(?=:)"
This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.
You could use a negative lookahead:
80101_\d*(?!intertitle)
That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.
regex101 demo
Your pattern could be written as:
pattern = selection + r"_\d*(?!intertitle)"
You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.
^80101_\d*:.*$
See the Demo: https://regex101.com/r/yqGgrv/1
Here is the respective python code as well:
In [1]: s = """80101_intertitle:Blablabla
...: 80101_1:BlablablaBlablabla
...: 80101_2:Blablabla
...: 80101_:BlablablaBlablablaBlablabla
...: 80101_3:BlablablaBlablabla
...: 80101_11:Blablabla
...: 801_1:Blablabla
...: 801_2:Blablabla"""
In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]:
['80101_1:BlablablaBlablabla',
'80101_2:Blablabla',
'80101_:BlablablaBlablablaBlablabla',
'80101_3:BlablablaBlablabla',
'80101_11:Blablabla']
Yes, that is easily done:
import re
s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''
matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
print(match)
I'm trying to find the entire word exactly using regex but have the word i'm searching for be a variable value coming from user input. I've tried this:
regex = r"\b(?=\w)" + re.escape(user_input) + r"\b"
if re.match(regex, string_to_search[i], re.IGNORECASE):
<some code>...
but it matches every occurrence of the string. It matches "var"->"var" which is correct but also matches "var"->"var"iable and I only want it to match "var"->"var" or "string"->"string"
Input: "sword"
String_to_search = "There once was a swordsmith that made a sword"
Desired output: Match "sword" to "sword" and not "swordsmith"
You seem you want to use a pattern that matches an entire string. Note that \b word boundary is needed when you wan to find partial matches. When you need a full string match, you need anchors. Since re.match anchors the match at the start of string, all you need is $ (end of string position) at the end of the pattern:
regex = '{}$'.format(re.escape(user_input))
and then use
re.match(regex, search_string, re.IGNORCASE)
You can try re.finditer like that:
>>> import re
>>> user_input = "var"
>>> text = "var variable var variable"
>>> regex = r"(?=\b%s\b)" % re.escape(user_input)
>>> [m.start() for m in re.finditer(regex, text)]
[0, 13]
It'll find all matches iteratively.