Count overlapping regex matches once again - python

How can I obtain the number of overlapping regex matches using Python?
I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:
input example string: akka
search pattern: a.*k
A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).
The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).

I think that what you're looking for is probably better done with a parsing library like lepl:
>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]
I believe that the length of the output from parser.parse_all is what you're looking for.
Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.
Edit: Based on the comment from #Shamanu4, I see you want matching results starting from any position, you can do that as follows:
>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']

Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants
def myregex(pattern,text,dir=0):
import re
m = re.search(pattern, text)
if m:
yield m.group(0)
if len(m.group('suffix')):
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
yield r
if dir<1 :
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
yield r
def myprocess(pattern, text):
parts = pattern.split("*")
for i in range(0, len(parts)-1 ):
res=""
for j in range(0, len(parts) ):
if j==0:
res+="(?P<prefix>"
if j==i:
res+=")(?P<suffix>"
res+=parts[j]
if j==i+1:
res+=")(?P<end>"
if j<len(parts)-1:
if j==i:
res+=".*"
else:
res+=".*?"
else:
res+=")"
for r in myregex(res,text):
yield r
def mycount(pattern, text):
return set(myprocess(pattern, text))
test:
>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

Related

Python RegEx: how to replace each match individually

I have a string s, a pattern p and a replacement r, i need to get the list of strings in which only one match with p has been replaced with r.
Example:
s = 'AbcAbAcc'
p = 'A'
r = '_'
// Output:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
I have tried with re.finditer(p, s) but i couldn't figure out how to replace each match with r.
You can replace them manually after finding all the matches:
[s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
The result is:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
How does it work?
re.finditer(p,s) will find all matches (each will be a re.Match
object)
the re.Match objects have start() and end() method which return the location of the match
you can replace the part of string between begin and end using this code: s[:begin] + replacement + s[end:]
You don't need regex for this, it's as simple as
[s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
Full code: See it working here
s = 'AbcAbAcc'
p = 'A'
r = '_'
x = [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
print(x)
Outputs:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
As mentioned, this only works on one character, for anything longer than one character or requiring a regex, use zvone's answer.
For a performance comparison between mine and zvone's answer (plus a third method of doing this without regex), see here or test it yourself with the code below:
import timeit,re
s = 'AbcAbAcc'
p = 'A'
r = '_'
def x1():
return [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
def x2():
return [s[:i]+r+s[i+1:] for i in range(len(s)) if s[i]==p]
def x3():
return [s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
print(x1())
print(timeit.timeit(x1, number=100000))
print(x2())
print(timeit.timeit(x2, number=100000))
print(x3())
print(timeit.timeit(x3, number=100000))

Replace a substring in a string according to a list

According to tutorialspoint:
The method replace() returns a copy of the string in which the occurrences of old have been replaced with new. https://www.tutorialspoint.com/python/string_replace.htm
Therefore one can use:
>>> text = 'fhihihi'
>>> text.replace('hi', 'o')
'fooo'
With this idea, given a list [1,2,3], and a string 'fhihihi' is there a method to replace a substring hi with 1, 2, and 3 in order? For example, this theoretical solution would yield:
'f123'
You can create a format string out of your initial string:
>>> text = 'fhihihi'
>>> replacement = [1,2,3]
>>> text.replace('hi', '{}').format(*replacement)
'f123'
Use re.sub:
import re
counter = 0
def replacer(match):
global counter
counter += 1
return str(counter)
re.sub(r'hi', replacer, text)
This is going to be way faster than any alternative using str.replace
One solution with re.sub:
text = 'fhihihi'
lst = [1,2,3]
import re
print(re.sub(r'hi', lambda g, l=iter(lst): str(next(l)), text))
Prints:
f123
Other answers gave good solutions. If you want to re-invent the wheel, here is one way.
text = "fhihihi"
target = "hi"
l = len(target)
i = 0
c = 0
new_string_list = []
while i < len(text):
if text[i:i + l] == target:
new_string_list.append(str(c))
i += l
c += 1
continue
new_string_list.append(text[i])
i += 1
print("".join(new_string_list))
Used a list to prevent consecutive string creation.

Parsing text files with "magic" values

Background
I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA
The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:
A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0
Problem
Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:
A[#1] + B[#2] - C[#3] [[#1]] #1 # #1
Question
Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?
use a regular expression replacement function with a dictionary.
Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:
import re
d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))
prints:
A[400] + B[20] - C[100] [[400]]
You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:
import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
print(command)
Output:
A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]
Just use
\[([^][]+)\]
And replace this with the desired result, e.g. 123.
Broken down, this says
\[ # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\] # closing bracket
See a demo on regex101.com.
For your changed requirements, you could use an OrderedDict:
import re
from collections import OrderedDict
rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()
def replacer(match):
item = match.group(1)
d[item] = 1
return '[#{}]'.format(list(d.keys()).index(item) + 1)
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)
Which yields
A[#1] + B[#2] - C[#3] [[#1]]
The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.
For the sake of academic completeness, you could do it all on your own as well:
import re
class Replacer:
rx = re.compile(r'\[([^][]+)\]')
keywords = []
def do_replace(self, match):
idx = self.lookup(match.group(1))
return '[#{}]'.format(idx + 1)
def replace(self, string):
return self.rx.sub(self.do_replace, string)
def lookup(self, item):
for idx, key in enumerate(self.keywords):
if key == item:
return idx
self.keywords.append(item)
return len(self.keywords)-1
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
rpl = Replacer()
string = rpl.replace(string)
print(string)
Can also be done using pyparsing.
This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.
To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.
>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
... def __missing__(self, key):
... return key
...
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
... new.append(replace[item])
...
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'
I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.
Just three types of tokens to watch for, then a similar number of branches within a while statement.
from plex import *
from io import StringIO
stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')
lexicon = Lexicon([
(Rep1(AnyBut('[]')), 'not_brackets'),
(Str('['), 'left_bracket'),
(Str(']'), 'right_bracket'),
])
class Replace(dict):
def __missing__(self, key):
return key
replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
token = scanner.read()
if token[0] is None:
break
elif token[0]=='no_brackets':
new_statement.append(replace[token[1]])
else:
new_statement.append(token[1])
print (''.join(new_statement))
Result:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]

Python regular expression?

import re
pattern = re.compile(r"(\d{3})+$")
print pattern.match("123567").groups()
output result:
('567',)
I need the result is ('123','567').
The (\d{3}) only can output last group, but I want output every group.
I am doing it in a bit of pythonic way
Solution 1
Python Code
p = re.compile(r'(?<=\d)(?=(?:\d{3})+$)')
test_str = "2890191245"
tmp = [x.start() for x in re.finditer(p, test_str)]
res = [test_str[0: tmp[0]]] + [(test_str[tmp[i]: tmp[i] + 3]) for i in range(len(tmp))]
Ideone Demo
Solution 2 (one liner)
print(re.sub("(?<=\d)(?=(\d{3})+$)", ",", test_str).split(","))
Ideone Demo

Splitting a string before the nth occurrence of a character [duplicate]

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Categories

Resources