Regex - Replacing a matching subgroup - python

The following code finds in a string the names of regex like groups to be replaced. I would like to use this so as to change the names name_1, name_2 and not_escaped to test_name_1, test_name_2 and test_not_escaped respectively. In the matches m, each name is equal to m.group(2). How can I do that ?
p = re.compile(r"(?<!\\)(\\\\)*\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
for m in p.finditer(text):
print(
'---',
m.group(),
m.group(2)
)
This gives the following output.
---
\g<name_1>
name_1
---
\g<name_2>
name_2
---
\\\\\g<not_escaped>
not_escaped

You'd need to reproduce the whole group 0 text, using \<digit> back-references to re-used captured groups:
p.sub(r'\1\\g<test_\2>', text)
Here \1 refers to the initial backslashes group, and \2 to the name to be prefixed by test_.
For this to work, you do need to move the * into the first capturing group to make sure that captured group was not un-matched:
p = re.compile(r"(?<!\\)((?:\\\\)*)\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
I've used a non-capturing group ((?:...)) to still keep the backslashes grouped together.
Demo:
>>> text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
>>> p = re.compile(r"(?<!\\)((?:\\\\)*)\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
>>> print(p.sub(r'\1\\g<test_\2>', text))
</\g<test_name_1>\g<test_name_2>\\\\\g<test_not_escaped>\\g<escaped>>>

The easiest way to accomplish this is by using a series of three simple calls to str.replace rather than using regexes for replacement:
import re
p = re.compile(r"(?<!\\)(\\\\)*\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
for m in p.finditer(text):
if m.groups(2):
replacement = m.groups(2)[1]
text = text.replace(replacement, 'test_' + replacement)

Related

Replace a character enclosed with lowercase letters

All the examples I've found on stack overflow are too complicated for me to reverse engineer.
Consider this toy example
s = "asdfasd a_b dsfd"
I want s = "asdfasd a'b dsfd"
That is: find two characters separated by an underscore and replace that underscore with an apostrophe
Attempt:
re.sub("[a-z](_)[a-z]","'",s)
# "asdfasd ' dsfd"
I thought the () were supposed to solve this problem?
Even more confusing is the fact that it seems that we successfully found the character we want to replace:
re.findall("[a-z](_)[a-z]",s)
#['_']
why doesn't this get replaced?
Thanks
Use look-ahead and look-behind patterns:
re.sub("(?<=[a-z])_(?=[a-z])","'",s)
Look ahead/behind patterns have zero width and thus do not replace anything.
UPD:
The problem was that re.sub will replace the whole matched expression, including the preceding and the following letter.
re.findall was still matching the whole expression, but it also had a group (the parenthesis inside), which you observed. The whole match was still a_b
lookahead/lookbehind expressions check that the search is preceded/followed by a pattern, but do not include it into the match.
another option was to create several groups, and put those groups into the replacement: re.sub("([a-z])_([a-z])", r"\1'\2", s)
When using re.sub, the text to keep must be captured, the text to remove should not.
Use
re.sub(r"([a-z])_(?=[a-z])",r"\1'",s)
See proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
Python code:
import re
s = "asdfasd a_b dsfd"
print(re.sub(r"([a-z])_(?=[a-z])",r"\1'",s))
Output:
asdfasd a'b dsfd
The re.sub will replace everything it matched .
There's a more general way to solve your problem , and you do not need to re-modify your regular expression.
Code below:
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0)) if _reg else r
#
re.sub(reg,repl, s)
output: 'Data: year=am_replace_str, monthday=am_replace_str, month=am_replace_str, some other text'
Of course, if your case does not contain groups , your code may like this :
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0))
#
re.sub(reg,repl, s)

Python Regex - find all occurences of a group after a prefix

I have a strings like that:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
I would like to get the elements separated by space after the H inside groups, so I tried:
myRegex = r'\bH\s*[\s|\:]+(?:\s?(\b\d+[\.?\d+]*\b))*'
It's fine with string s1
print(re.search(myRegex , s1).groups())
I's giving me: ('1234.34.34',) => It's fine
But for s2, I have:
print(re.search(myRegex , s2).groups())
It's sending back only the last group ('123.5',), but I'm expecting to have ('1234.34.34', '12.12', '123.5').
Do you have an idea how to get my expected value?
In addition, I'm not limited to 2 groups, I may have much more...
Thanks a lot
Fred
In your pattern, in this part (?:\s?(\b\d+[\.?\d+]*\b))* you have a capturing group inside a repeating non capturing group which will give the capturing group the value of the last iteration of the outer non capturing group.
The last iteration will match 123.5 and that will be the group 1 value.
One option is to match the whole pattern and use a capturing group for the last part.
\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b
Regex demo | Python demo
If you have the group, you could use split:
import re
s2 = 'H: 1234.34.34 12.12 123.5'
myRegex = r'\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b'
res = re.search(myRegex , s2)
if res:
print(res.group(1).split())
Output
['1234.34.34', '12.12', '123.5']
Using the PyPi regex module, you could make use of \G to get iterative matches for the numbers and use \K to forget what was currently matched, which would be the space before the number.
(?:\bH:|\G(?!A)) \K\d+(?:\.\d+)+
Regex demo | Python demo
Assuming your string will always start with H:, you can do as follows :
s2 = 'H: 1234.34.34 12.12 123.5'
output = s2.split("H: ")[-1].split()
Output will be
['1234.34.34', '12.12', '123.5']
The first split will allow you to get all your character after the "H: "
The second split will split your sentences following your spaces.
Based on your examples, you don't need a regex, split() will suffice:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
match1 = s1.split()[1:]
match2 = s2.split()[1:]
print(match1)
print(match2)
['1234.34.34']
['1234.34.34', '12.12', '123.5']

python regex find text with digits

I have text like:
sometext...one=1290...sometext...two=12985...sometext...three=1233...
How can I find one=1290 and two=12985 but not three or four or five? There are can be from 4 to 5 digits after =. I tried this:
import re
pattern = r"(one|two)+=+(\d{4,5})+\D"
found = re.findall(pattern, sometext, flags=re.IGNORECASE)
print(found)
It gives me results like: [('one', '1290')].
If i use pattern = r"((one|two)+=+(\d{4,5})+\D)" it gives me [('one=1290', 'one', '1290')]. How can I get just one=1290?
You were close. You need to use a single capture group (or none for that matter):
((?:one|two)+=+\d{4,5})+
Full code:
import re
string = 'sometext...one=1290...sometext...two=12985...sometext...three=1233...'
pattern = r"((?:one|two)+=+\d{4,5})+"
found = re.findall(pattern, string, flags=re.IGNORECASE)
print(found)
# ['one=1290', 'two=12985']
Make the inner groups non capturing: ((?:one|two)+=+(?:\d{4,5})+\D)
The reason that you are getting results like [('one', '1290')] rather than one=1290 is because you are using capture groups. Use:
r"(?:one|two)=(?:\d{4,5})(?=\D)"
I have removed the additional + repeaters, as they were (I think?) unnecessary. You don't want to match things like oneonetwo===1234, right?
Using (?:...) rather than (...) defines a non-capture group. This prevents the result of the capture from being returned, and you instead get the whole match.
Similarly, using (?=\D) defines a look-ahead - so this is excluded from the match result.

Stripping variable borders with python re

How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'

(?:) regular expression Python

I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);
From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']
?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1
It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'

Categories

Resources