Python - Replace part of regex? - python

I have the following piece of code:
import re
s = "Example {String}"
replaced = re.sub('{.*?}', 'a', s)
print replaced
Which prints:
Example a
Is there a way to modify the regex so that it prints:
Example {a}
That is I would like to replace only the .* part and keep all the surrounding characters. However, I need the surrounding characters to define my regex. Is this possible?

Two ways to do this, either capture your pre- and post-fixes, or use lookbehinds and lookaheads.
# capture
s = "Example {String}"
replaced = re.sub(r'({).*?(})', r'\1a\2', s)
# lookaround
s = "Example {String}"
replaced = re.sub(r'(?<={).*?(?=})', r'a', s)
If you capture you pre- and post-fixes, you're able to use those capture groups in your substitution using \{capture_num}, e.g. \1 and \2.
If you use lookarounds, you essentially DON'T MATCH those pre-/post-fixes, you simply assert that they are there. Thus the substitution only swaps the letters in between.
Of course for this simple sub, the better solution is probably:
re.sub(r'{.*?}', r'{a}', s)

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

Replace a character enclosed with lowercase letters

All the examples I've found on stack overflow are too complicated for me to reverse engineer.
Consider this toy example
s = "asdfasd a_b dsfd"
I want s = "asdfasd a'b dsfd"
That is: find two characters separated by an underscore and replace that underscore with an apostrophe
Attempt:
re.sub("[a-z](_)[a-z]","'",s)
# "asdfasd ' dsfd"
I thought the () were supposed to solve this problem?
Even more confusing is the fact that it seems that we successfully found the character we want to replace:
re.findall("[a-z](_)[a-z]",s)
#['_']
why doesn't this get replaced?
Thanks
Use look-ahead and look-behind patterns:
re.sub("(?<=[a-z])_(?=[a-z])","'",s)
Look ahead/behind patterns have zero width and thus do not replace anything.
UPD:
The problem was that re.sub will replace the whole matched expression, including the preceding and the following letter.
re.findall was still matching the whole expression, but it also had a group (the parenthesis inside), which you observed. The whole match was still a_b
lookahead/lookbehind expressions check that the search is preceded/followed by a pattern, but do not include it into the match.
another option was to create several groups, and put those groups into the replacement: re.sub("([a-z])_([a-z])", r"\1'\2", s)
When using re.sub, the text to keep must be captured, the text to remove should not.
Use
re.sub(r"([a-z])_(?=[a-z])",r"\1'",s)
See proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
Python code:
import re
s = "asdfasd a_b dsfd"
print(re.sub(r"([a-z])_(?=[a-z])",r"\1'",s))
Output:
asdfasd a'b dsfd
The re.sub will replace everything it matched .
There's a more general way to solve your problem , and you do not need to re-modify your regular expression.
Code below:
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0)) if _reg else r
#
re.sub(reg,repl, s)
output: 'Data: year=am_replace_str, monthday=am_replace_str, month=am_replace_str, some other text'
Of course, if your case does not contain groups , your code may like this :
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0))
#
re.sub(reg,repl, s)

python regex find text with digits

I have text like:
sometext...one=1290...sometext...two=12985...sometext...three=1233...
How can I find one=1290 and two=12985 but not three or four or five? There are can be from 4 to 5 digits after =. I tried this:
import re
pattern = r"(one|two)+=+(\d{4,5})+\D"
found = re.findall(pattern, sometext, flags=re.IGNORECASE)
print(found)
It gives me results like: [('one', '1290')].
If i use pattern = r"((one|two)+=+(\d{4,5})+\D)" it gives me [('one=1290', 'one', '1290')]. How can I get just one=1290?
You were close. You need to use a single capture group (or none for that matter):
((?:one|two)+=+\d{4,5})+
Full code:
import re
string = 'sometext...one=1290...sometext...two=12985...sometext...three=1233...'
pattern = r"((?:one|two)+=+\d{4,5})+"
found = re.findall(pattern, string, flags=re.IGNORECASE)
print(found)
# ['one=1290', 'two=12985']
Make the inner groups non capturing: ((?:one|two)+=+(?:\d{4,5})+\D)
The reason that you are getting results like [('one', '1290')] rather than one=1290 is because you are using capture groups. Use:
r"(?:one|two)=(?:\d{4,5})(?=\D)"
I have removed the additional + repeaters, as they were (I think?) unnecessary. You don't want to match things like oneonetwo===1234, right?
Using (?:...) rather than (...) defines a non-capture group. This prevents the result of the capture from being returned, and you instead get the whole match.
Similarly, using (?=\D) defines a look-ahead - so this is excluded from the match result.

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?
The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)
To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

python. re.findall and re.sub with '^'

I try to change string like s='2.3^2+3^3-√0.04*2+√4',
where 2.3^2 has to change to pow(2.3,2), 3^3 - pow(3,3), √0.04 - sqrt(0.04) and
√4 - sqrt(4).
s='2.3^2+3^3-√0.04*2+√4'
patt1='[0-9]+\.[0-9]+\^[0-9]+|[0-9]+\^[0-9]'
patt2='√[0-9]+\.[0-9]+|√[0-9]+'
idx1=re.findall(patt1, s)
idx2=re.findall(patt2, s)
idx11=[]
idx22=[]
for i in range(len(idx1)):
idx11.append('pow('+idx1[i][:idx1[i].find('^')]+','+idx1[i][idx1[i].find('^')+1:]+')')
for i in range(len(idx2)):
idx22.append('sqrt('+idx2[i][idx2[i].find('√')+1:]+')')
for i in range(len(idx11)):
s=re.sub(idx1[i], idx11[i], s)
for i in range(len(idx22)):
s=re.sub(idx2[i], idx22[i], s)
print(s)
Temp results:
idx1=['2.3^2', '3^3']
idx2=['√0.04', '√4']
idx11=['pow(2.3,2)', 'pow(3,3)']
idx22=['sqrt(0.04)', 'sqrt(4)']
but string result:
2.3^2+3^3-sqrt(0.04)*2+sqrt(4)
Why calculating 'idx1' is right, but re.sub don't insert this value into string ?
(sorry for my english:)
Try this using only re.sub()
Input string:
s='2.3^2+3^3-√0.04*2+√4'
Replacing for pow()
s = re.sub("(\d+(?:\.\d+)?)\^(\d+)", "pow(\\1,\\2)", s)
Replacing for sqrt()
s = re.sub("√(\d+(?:\.\d+)?)", "sqrt(\\1)", s)
Output:
pow(2.3,2)+pow(3,3)-sqrt(0.04)*2+sqrt(4)
() means group capture and \\1 means first captured group from regex match. Using this link you can get the detail explanation for the regex.
I've only got python 2.7.5 but this works for me, using str.replace rather than re.sub. Once you've gone to the effort of finding the matches and constructing their replacements, this is a simple find and replace job:
for i in range(len(idx11)):
s = s.replace(idx1[i], idx11[i])
for i in range(len(idx22)):
s = s.replace(idx2[i], idx22[i])
edit
I think you're going about this in quite a long-winded way. You can use re.sub in one go to make these changes:
s = re.sub('(\d+(\.\d+)?)\^(\d+)', r'pow(\1,\3)', s)
Will substitute 2.3^2+3^3 for pow(2.3,2)+pow(3,3) and:
s = re.sub('√(\d+(\.\d+)?)', r'sqrt(\1)', s)
Will substitute √0.04*2+√4 to sqrt(0.04)*2+sqrt(4)
There's a few things going on here that are different. Firstly, \d, which matches a digit, the same as [0-9]. Secondly, the ( ) capture whatever is inside them. In the replacement, you can refer to these captured groups by the order in which they appear. In the pow example I'm using the first and third group that I have captured.
The prefix r before the replacement string means that the string is to be treated as "raw", so characters are interpreted literally. The groups are accessed by \1, \2 etc. but because the backslash \ is an escape character, I would have to escape it each time (\\1, \\2, etc.) without the r.

Categories

Resources