Regex substitute multiple patterns with corresponding replacement patterns - python

tokens = ['analytics', 'mining', 'quantities', ...]
for i in tokens:
stem = re.sub(r'(\w+)(tics$)', r'\1sis', i, flags=re.IGNORECASE)
In this example, I'm replacing 'analytics' with 'analysis' using the re.sub().
What I want to do is to do this replacement using multiple patterns, for example:
stem = re.sub(r'(\w+)(ing$)', r'\1e', i, flags=re.IGNORECASE)
So that 'mining' would be replaced by 'mine'. And so on.
I was thinking of using a dict with patterns and repls. I imagine the dict would look something like this:
rules = {
r'(\w+)(tics$)': r'\1sis',
r'(\w+)(ing$)': r'\1e',
...
}
Would the backreference even work in a dict? I also don't know how to implement a dict into re.sub. How should I proceed?
Edit for further clarification:
The whole tokens list has a lot of items and I want to do the replacement on words that match the pattern. For example there might be the word 'dining' further down in the tokens and I want to the the second rule to catch that and replace it with 'dine'.

Try using a custom function in re.sub
Ex:
import re
tokens = ['analytics', 'mining', 'quantities', 'analyticsddd']
replacement = {"tics": "sis", "ing": "e"}
ptrn = re.compile("(" + "|".join(replacement.keys()) + ")$")
for i in tokens:
print(ptrn.sub(lambda x: replacement.get(x.group(), x.group()), i))
Output:
analysis
mine
quantities
analyticsddd

Related

Replace characters with particular format with a variable value in python

I have filenames with the particular format as given
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
.
.
.
I want to remove the BH?.M part with the value in a string variable in name.
name=['T','D','FG'.....]
expected output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
.
.
.
Is it possible with str.replace()?
You could use the built-in regex module (re) alongside the following pattern to effectively replace the content in your strings.
Pattern
'(?<=BH)[A-Z]+\.M'
This pattern looks behind (non-matching) to ensure to check for the substring 'BH', then matches on any uppercase character [A-Z] one or more times + followed by the substring '.M'.
Solution
The below solution uses re.sub() alongside the pattern outlined above to return a string with the substring matched by the pattern replaced with that defined here as replacement.
import re
original = 'II.NIL.10.BHB.M.2078.198.160857'
replacement = 'FG'
output = re.sub(r'(?<=BH)[A-Z]+\.M', replacement, original)
print(output)
Output
II.NIL.10.BHFG.2078.198.160857
Processing multiple files
To repeat this process for multiple files you could apply the above logic within a loop/comprehension, running the re.sub() function on each original/replacement pairing and storing/processing appropriately.
The below example uses the data from your original question alongside the above logic to create a list containing the results of each re.sub() operation by way of a dictionary mapping between the original filenames and substrings to be inserted using re.sub().
import re
originals = [
'II.NIL.10.BHZ.M.2058.190.160877',
'II.NIL.10.BHA.M.2008.190.168857',
'II.NIL.10.BHB.M.2078.198.160857'
]
replacements = ['T','D','FG']
mapping = {originals[i]: replacements[i] for i, _ in enumerate(originals)}
results = [re.sub(r'(?<=BH)[A-Z]+\.M', v, k) for k,v in mapping.items()]
for r in results:
print(r)
Output
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857
Nope, you cannot use str.replace with a wildcard. You will have to use regex with something such as the following
import re
filenames = ['II.NIL.10.BHA.M.2008.190.168857 ', 'II.NIL.10.BHB.M.2078.198.160857',
'II.NIL.10.BHC.M.2078.198.160857']
name = ['T','D','FG']
newfilenames = []
for i in range(len(filenames)):
newfilenames.append(re.sub(r'BH.?\.M', 'BH'+name[i], filenames[i]))
print(' '.join(newfilenames)) # outputs II.NIL.10.BHT.2008.190.168857 II.NIL.10.BHD.2078.198.160857 II.NIL.10.BHFG.2078.198.160857
You can use iter with next in the replacement lambda of re.sub:
import re
name = iter(['T','D','FG'])
s = """
II.NIL.10.BHZ.M.2058.190.160877
II.NIL.10.BHA.M.2008.190.168857
II.NIL.10.BHB.M.2078.198.160857
"""
result = re.sub('(?<=BH)\w\.\w', lambda x:f'{next(name)}', s)
Output:
II.NIL.10.BHT.2058.190.160877
II.NIL.10.BHD.2008.190.168857
II.NIL.10.BHFG.2078.198.160857

Python - How to use regex to find multiple words and extract them at the same time

Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']

Replace named captured groups with arbitrary values in Python

I need to replace the value inside a capture group of a regular expression with some arbitrary value; I've had a look at the re.sub, but it seems to be working in a different way.
I have a string like this one :
s = 'monthday=1, month=5, year=2018'
and I have a regex matching it with captured groups like the following :
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
now I want to replace the group named d with aaa, the group named m with bbb and group named Y with ccc, like in the following example :
'monthday=aaa, month=bbb, year=ccc'
basically I want to keep all the non matching string and substitute the matching group with some arbitrary value.
Is there a way to achieve the desired result ?
Note
This is just an example, I could have other input regexs with different structure, but same name capturing groups ...
Update
Since it seems like most of the people are focusing on the sample data, I add another sample, let's say that I have this other input data and regex :
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
as you can see I still have the same number of capturing groups(3) and they are named the same way, but the structure is totally different... What I need though is as before replacing the capturing group with some arbitrary text :
'ccc-bbb-aaa'
replace capture group named Y with ccc, the capture group named m with bbb and the capture group named d with aaa.
In the case, regexes are not the best tool for the job, I'm open to some other proposal that achieve my goal.
This is a completely backwards use of regex. The point of capture groups is to hold text you want to keep, not text you want to replace.
Since you've written your regex the wrong way, you have to do most of the substitution operation manually:
"""
Replaces the text captured by named groups.
"""
def replace_groups(pattern, string, replacements):
pattern = re.compile(pattern)
# create a dict of {group_index: group_name} for use later
groupnames = {index: name for name, index in pattern.groupindex.items()}
def repl(match):
# we have to split the matched text into chunks we want to keep and
# chunks we want to replace
# captured text will be replaced. uncaptured text will be kept.
text = match.group()
chunks = []
lastindex = 0
for i in range(1, pattern.groups+1):
groupname = groupnames.get(i)
if groupname not in replacements:
continue
# keep the text between this match and the last
chunks.append(text[lastindex:match.start(i)])
# then instead of the captured text, insert the replacement text for this group
chunks.append(replacements[groupname])
lastindex = match.end(i)
chunks.append(text[lastindex:])
# join all the junks to obtain the final string with replacements
return ''.join(chunks)
# for each occurence call our custom replacement function
return re.sub(pattern, repl, string)
>>> replace_groups(pattern, s, {'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
'monthday=aaa, month=bbb, year=ccc'
You can use string formatting with a regex substitution:
import re
s = 'monthday=1, month=5, year=2018'
s = re.sub('(?<=\=)\d+', '{}', s).format(*['aaa', 'bbb', 'ccc'])
Output:
'monthday=aaa, month=bbb, year=ccc'
Edit: given an arbitrary input string and regex, you can use formatting like so:
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
new_s = re.sub(regex, '{}', input).format(*["aaa", "bbb", "ccc"])
Extended Python 3.x solution on extended example (re.sub() with replacement function):
import re
d = {'d':'aaa', 'm':'bbb', 'Y':'ccc'} # predefined dict of replace words
pat = re.compile('(monthday=)(?P<d>\d{1,2})|(month=)(?P<m>\d{1,2})|(year=)(?P<Y>20\d{2})')
def repl(m):
pair = next(t for t in m.groupdict().items() if t[1])
k = next(filter(None, m.groups())) # preceding `key` for currently replaced sequence (i.e. 'monthday=' or 'month=' or 'year=')
return k + d.get(pair[0], '')
s = 'Data: year=2018, monthday=1, month=5, some other text'
result = pat.sub(repl, s)
print(result)
The output:
Data: year=ccc, monthday=aaa, month=bbb, some other text
For Python 2.7 :
change the line k = next(filter(None, m.groups())) to:
k = filter(None, m.groups())[0]
I suggest you use a loop
import re
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
s = 'monthday=1, month=1, year=2017 \n'
s+= 'monthday=2, month=2, year=2019'
regex_as_str = 'monthday={d}, month={m}, year={Y}'
matches = [match.groupdict() for match in regex.finditer(s)]
for match in matches:
s = s.replace(
regex_as_str.format(**match),
regex_as_str.format(**{'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
)
You can do this multile times wiht your different regex patterns
Or you can join ("or") both patterns together

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Trying to convert a nested loop with two sequences into a lambda

i've got this function that checks all the words in the 1st sequence,
if they are ending with one of the words in the 2nd sequence, remove that end substring.
I'm trying to achieve all that in one simple lambda function that is supposed to go into a pipeline processing, and can't find a way to do it.
I'll be grateful if you could help me with this:
str_test = ("Thiship is a test string testing slowly i'm helpless")
stem_rules = ('less', 'ship', 'ing', 'es', 'ly','s')
str_test2 = str_test.split()
for i in str_test2:
for j in stem_rules:
if(i.endswith(j)):
str_test2[str_test2.index(i)] = i[:-len(j)]
break
This is a one-liner that activates a (simple?) lambda that does it.
(lambda words, rules: sum([[word[:-len(rule)]] if word.endswith(rule) else [] for word in words for rule in rules], []))(str_test.split(), stem_rules)
It's not clear how it's working, and it's not good practice to do it.
What it generally does is create a list with a single string out of matches, or an empty list out of misses, and then aggregates everything to single list, containing only the matches.
Currently it will output on every match, and not just longest match or anything like that, but once you figure out how it's working, maybe you can select the shortest match from the list of matches for each word in the input.
May god be with you.
The first thing I'd do is toss your i.endswith(j) for j in stem_rules out and make it a regex that matches and captures the prefix string and matches (but doesn't capture) any suffix
import re
match_end = re.compile("(.*?)(?:" + "|".join(".*?" + stem + "$" for stem in stem_rules) + ")")
# This is the same as:
re.compile(r"""
(.*?) # Capturing group matching the prefix
(?: # Begins a non-capturing group...
stem1$|
stem2$|
stem3$ # ...which matches an alternation of the stems, asserting end of string
) # ends the non-capturing group""", re.X)
Then you can use that regex to sub each item in the list.
f = lambda word: match_end.sub(r"\1", word)
Use that wrapped in a list comprehension and you should have your result
words = [f(word) for word in str_test.split()]
# or map(f, str_test.split())
To convert you current code into a single lambda, each step in the pipeline needs to behave in a very functional manner: receive some data, and then emit some data. You need to avoid anything that deviates from that paradigm -- in particular, the use of things like break. Here's one way to rewrite the steps in that manner:
text = ("Thiship is a test string testing slowly i'm helpless")
stems = ('less', 'ship', 'ing', 'es', 'ly','s')
# The steps:
# - get words from the text
# - pair each word with its matching stems
# - create a list of cleaned words (stems removed)
# - make the new text
words = text.split()
wstems = [ (w, [s for s in stems if w.endswith(s)]) for w in words ]
cwords = [ w[0:-len(ss[0])] if ss else w for w, ss in wstems ]
text2 = ' '.join(cwords)
print text2
With those parts in hands, a single lambda can be created using ordinary substitution. Here's the monstrosity:
f = lambda txt: [
w[0:-len(ss[0])] if ss else w
for w, ss in [ (w, [s for s in stems if w.endswith(s)]) for w in txt.split() ]
]
text3 = ' '.join(f(text))
print text3
I wasn't sure whether you want the lambda to return the new words or the new text -- adjust as needed.

Categories

Resources