Replace named captured groups with arbitrary values in Python - python

I need to replace the value inside a capture group of a regular expression with some arbitrary value; I've had a look at the re.sub, but it seems to be working in a different way.
I have a string like this one :
s = 'monthday=1, month=5, year=2018'
and I have a regex matching it with captured groups like the following :
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
now I want to replace the group named d with aaa, the group named m with bbb and group named Y with ccc, like in the following example :
'monthday=aaa, month=bbb, year=ccc'
basically I want to keep all the non matching string and substitute the matching group with some arbitrary value.
Is there a way to achieve the desired result ?
Note
This is just an example, I could have other input regexs with different structure, but same name capturing groups ...
Update
Since it seems like most of the people are focusing on the sample data, I add another sample, let's say that I have this other input data and regex :
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
as you can see I still have the same number of capturing groups(3) and they are named the same way, but the structure is totally different... What I need though is as before replacing the capturing group with some arbitrary text :
'ccc-bbb-aaa'
replace capture group named Y with ccc, the capture group named m with bbb and the capture group named d with aaa.
In the case, regexes are not the best tool for the job, I'm open to some other proposal that achieve my goal.

This is a completely backwards use of regex. The point of capture groups is to hold text you want to keep, not text you want to replace.
Since you've written your regex the wrong way, you have to do most of the substitution operation manually:
"""
Replaces the text captured by named groups.
"""
def replace_groups(pattern, string, replacements):
pattern = re.compile(pattern)
# create a dict of {group_index: group_name} for use later
groupnames = {index: name for name, index in pattern.groupindex.items()}
def repl(match):
# we have to split the matched text into chunks we want to keep and
# chunks we want to replace
# captured text will be replaced. uncaptured text will be kept.
text = match.group()
chunks = []
lastindex = 0
for i in range(1, pattern.groups+1):
groupname = groupnames.get(i)
if groupname not in replacements:
continue
# keep the text between this match and the last
chunks.append(text[lastindex:match.start(i)])
# then instead of the captured text, insert the replacement text for this group
chunks.append(replacements[groupname])
lastindex = match.end(i)
chunks.append(text[lastindex:])
# join all the junks to obtain the final string with replacements
return ''.join(chunks)
# for each occurence call our custom replacement function
return re.sub(pattern, repl, string)
>>> replace_groups(pattern, s, {'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
'monthday=aaa, month=bbb, year=ccc'

You can use string formatting with a regex substitution:
import re
s = 'monthday=1, month=5, year=2018'
s = re.sub('(?<=\=)\d+', '{}', s).format(*['aaa', 'bbb', 'ccc'])
Output:
'monthday=aaa, month=bbb, year=ccc'
Edit: given an arbitrary input string and regex, you can use formatting like so:
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
new_s = re.sub(regex, '{}', input).format(*["aaa", "bbb", "ccc"])

Extended Python 3.x solution on extended example (re.sub() with replacement function):
import re
d = {'d':'aaa', 'm':'bbb', 'Y':'ccc'} # predefined dict of replace words
pat = re.compile('(monthday=)(?P<d>\d{1,2})|(month=)(?P<m>\d{1,2})|(year=)(?P<Y>20\d{2})')
def repl(m):
pair = next(t for t in m.groupdict().items() if t[1])
k = next(filter(None, m.groups())) # preceding `key` for currently replaced sequence (i.e. 'monthday=' or 'month=' or 'year=')
return k + d.get(pair[0], '')
s = 'Data: year=2018, monthday=1, month=5, some other text'
result = pat.sub(repl, s)
print(result)
The output:
Data: year=ccc, monthday=aaa, month=bbb, some other text
For Python 2.7 :
change the line k = next(filter(None, m.groups())) to:
k = filter(None, m.groups())[0]

I suggest you use a loop
import re
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
s = 'monthday=1, month=1, year=2017 \n'
s+= 'monthday=2, month=2, year=2019'
regex_as_str = 'monthday={d}, month={m}, year={Y}'
matches = [match.groupdict() for match in regex.finditer(s)]
for match in matches:
s = s.replace(
regex_as_str.format(**match),
regex_as_str.format(**{'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
)
You can do this multile times wiht your different regex patterns
Or you can join ("or") both patterns together

Related

Multiple regex substitutions using a dict with regex expressions as keys

I want to make multiple substitutions to a string using multiple regular expressions. I also want to make the substitutions in a single pass to avoid creating multiple instances of the string.
Let's say for argument that I want to make the substitutions below, while avoiding multiple use of re.sub(), whether explicitly or with a loop:
import re
text = "local foals drink cola"
text = re.sub("(?<=o)a", "w", text)
text = re.sub("l(?=a)", "co", text)
print(text) # "local fowls drink cocoa"
The closest solution I have found for this is to compile a regular expression from a dictionary of substitution targets and then to use a lambda function to replace each matched target with its value in the dictionary. However, this approach does not work when using metacharacters, thus removing the functionality needed from regular expressions in this example.
Let me demonstrate first with an example that works without metacharacters:
import re
text = "local foals drink cola"
subs_dict = {"a":"w", "l":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
print(text) # "coocwco fowcos drink cocow"
Now observe that adding the desired metacharacters to the dictionary keys results in a KeyError:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
>>> KeyError: "a"
The reason for this is that the sub() function correctly finds a match for the expression "(?<=o)a", so this must now be found in the dictionary to return its substitution, but the value submitted for dictionary lookup by match.group(0) is the corresponding matched string "a". It also does not work to search for match.re in the dictionary (i.e. the expression that produced the match) because the value of that is the whole disjoint expression that was compiled from the dictionary keys (i.e. "(?<=o)a|l(?=a)").
EDIT: In case anyone would benefit from seeing thejonny's solution implemented with a lambda function as close to my originals as possible, it would work like this:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join("("+key+")" for key in subs_dict))
group_index = 1
indexed_subs = {}
for target, sub in subs_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
text = re.sub(subs_regex, lambda match: indexed_subs[match.lastindex], text)
print(text) # "local fowls drink cocoa"
If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |ing the expressions, and then check which group found a match:
(exp1)|(exp2)|(exp3)
Or maybe named groups so you don't have to count the subgroups inside the subexpressions.
The replacement function than can look which group matched, and chose the replacement from a list.
I came up with this implementation:
import re
def dictsub(replacements, string):
"""things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
exprall = re.compile("|".join("("+x+")" for x in replacements))
gi = 1
replacements_by_gi = {}
for (expr, replacement) in replacements.items():
replacements_by_gi[gi] = replacement
gi += re.compile(expr).groups + 1
def choose(match):
return replacements_by_gi[match.lastindex]
return re.sub(exprall, choose, string)
text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))
that prints local fowls drink cocoa
You could do this by keeping your key as the expected match and storing both your replace and regex in a nested dict. Given you're looking to match specific chars, this definition should work.
subs_dict = {"a": {'replace': 'w', 'regex': '(?<=o)a'}, 'l': {'replace': 'co', 'regex': 'l(?=a)'}}
subs_regex = re.compile("|".join([subs_dict[k]['regex'] for k in subs_dict.keys()]))
re.sub(subs_regex, lambda match: subs_dict[match.group(0)]['replace'], text)
'local fowls drink cocoa'

Replace characters with particular format with a variable value in python

I have filenames with the particular format as given
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
.
.
.
I want to remove the BH?.M part with the value in a string variable in name.
name=['T','D','FG'.....]
expected output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
.
.
.
Is it possible with str.replace()?
You could use the built-in regex module (re) alongside the following pattern to effectively replace the content in your strings.
Pattern
'(?<=BH)[A-Z]+\.M'
This pattern looks behind (non-matching) to ensure to check for the substring 'BH', then matches on any uppercase character [A-Z] one or more times + followed by the substring '.M'.
Solution
The below solution uses re.sub() alongside the pattern outlined above to return a string with the substring matched by the pattern replaced with that defined here as replacement.
import re
original = 'II.NIL.10.BHB.M.2078.198.160857'
replacement = 'FG'
output = re.sub(r'(?<=BH)[A-Z]+\.M', replacement, original)
print(output)
Output
II.NIL.10.BHFG.2078.198.160857
Processing multiple files
To repeat this process for multiple files you could apply the above logic within a loop/comprehension, running the re.sub() function on each original/replacement pairing and storing/processing appropriately.
The below example uses the data from your original question alongside the above logic to create a list containing the results of each re.sub() operation by way of a dictionary mapping between the original filenames and substrings to be inserted using re.sub().
import re
originals = [
'II.NIL.10.BHZ.M.2058.190.160877',
'II.NIL.10.BHA.M.2008.190.168857',
'II.NIL.10.BHB.M.2078.198.160857'
]
replacements = ['T','D','FG']
mapping = {originals[i]: replacements[i] for i, _ in enumerate(originals)}
results = [re.sub(r'(?<=BH)[A-Z]+\.M', v, k) for k,v in mapping.items()]
for r in results:
print(r)
Output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
Nope, you cannot use str.replace with a wildcard. You will have to use regex with something such as the following
import re
filenames = ['II.NIL.10.BHA.M.2008.190.168857 ', 'II.NIL.10.BHB.M.2078.198.160857',
'II.NIL.10.BHC.M.2078.198.160857']
name = ['T','D','FG']
newfilenames = []
for i in range(len(filenames)):
newfilenames.append(re.sub(r'BH.?\.M', 'BH'+name[i], filenames[i]))
print(' '.join(newfilenames)) # outputs II.NIL.10.BHT.2008.190.168857 II.NIL.10.BHD.2078.198.160857 II.NIL.10.BHFG.2078.198.160857
You can use iter with next in the replacement lambda of re.sub:
import re
name = iter(['T','D','FG'])
s = """
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
"""
result = re.sub('(?<=BH)\w\.\w', lambda x:f'{next(name)}', s)
Output:
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Can I expand a single string with the union of several regular expressions in python?

I'm hacking on a package that transforms file types, allowing the user to specify the transformation (a python function) and a regular expression for how the filename is to be changed.
In one case, I have a series of regexes and a single output string which I'd like to be expanded with the union of all my regex groups:
import re
re_strings = ['(.*).txt', '(.*).ogg', 'another(?P<name>.*)']
regexes = map(re.compile, re_strings]
input_files = ['cats.txt', 'music.ogg', 'anotherpilgrim.xls']
matches = [regexes[i].match(input_files[i]) for i in range(len(regexes))]
outputstr = 'Text file about: \1, audio file about: \2, and another file on \g<name>.'
# should be 'Text file about: cats, audio file about: music, and another file on pilgrim.xls'
I'd like to have outputstr be expanded with the union of the regular expressions (perhaps concatenation makes more sense for the \2 reference?). I could concatenate the re's, separating them by some unused character:
final_re = re.compile('\n'.join(re_strings))
final_files = '\n'.join(input_files)
match = final_re.search(final_files)
But this forces the re's to match the entire file, not just some portion of the filename. I can put in a catch-all group between the files a la (.*?) but that will surely mess up the group references and it might mess up the original patterns (which I have no control over). I guess I could also force named groups everywhere, then union all the regex .groupdict()s...
Python doesn't allow partial expansion so all group references have to be valid, so there's no chance of doing a series of expansions for the groupdict anyway like:
for m in matches:
outputstr = m.expand(outputstr)
Thanks for any advice!
Just for the record, here is how to combine the results of several regular expression results, and make replacements across them all.
Given several query strings, and several regular expression matches:
import re
query_str = ["abcdyyy", "hijkzzz"]
re_pattern = [r"(a)(b)(?P<first_name>c)(d)",
r"(h)(i)(?P<second_name>j)(k)"]
# match each separately
matches= [re.search(p,q) for p,q in
zip(re_pattern, query_str)]
We would like to make a substitution string combining the results of all the searches:
replacement = r"[\4_\g<first_name>_\2_\1:\5:\6:\8:\g<second_name>]"
To do this, we need to:
Merge the search results
Have a proxy in place of the merged results (match_substitute)
Have a proxy object to handle named groups e.g. "first_name" (pattern_substitute)
This is handled by the following code. The results are in "result":
import sre_parse
#
# dummy object to provide group() function and "string" member variable
#
class match_substitute:
def __init__(self, matches):
self.data = []
for m in matches:
self.data.extend(m.groups())
self.string = ""
# regular expression groups count from 1 not from zero!
def group(self, ii):
return self.data[ii - 1]
#
# dummy object to provide groupindex dictionary for named groups
#
class pattern_substitute:
def __init__(self, matches, re_pattern):
#
# Named group support
# Increment indices so they are point to the correct position
# in the merged list of matching groups
#
self.groupindex = dict()
offset = 0
for p, m in zip(re_pattern, matches):
for k,v in sre_parse.parse(p).pattern.groupdict.iteritems():
self.groupindex[k] = v + offset
offset += len(m.groups())
match = match_substitute(matches)
pattern = pattern_substitute(matches, re_pattern)
#
# parse and substitute
#
template = sre_parse.parse_template(replacement, pattern)
result = sre_parse.expand_template(template, match)

Categories

Resources