Replace items from the back of a string

Replace items from the back of a string - python

I want to replace a few occurrences from the end of a string. I have tried this:
replace_me="<!doctype html><html><body><p>hello</p></body></html>"
print('replacing 3 matches of > from back of string. please wait...')
replace_me.replace('>','>',-1)
print(replace_me)
But it gives me unreplaced output: <!doctype html><html><body><p>hello</p></body></html>
Complete Output
Is it even possible to replace the last few occurrences of a string?

If you're replacing from the back, you can just flip the string and replace it with your backwards match too to replace N occurrences:
replace_me="<!doctype html><html><body><p>hello</p></body></html>"
N=3
newstr = '>'[::-1]
replace_me_new = replace_me[::-1].replace('>',newstr,N)[::-1]
print(replace_me_new)
which outputs:
<!doctype html><html><body><p>hello</p></body></html>
To generalize in a way that mimics str.replace():
def rreplace(s, old, new, count=-1):
return s[::-1].replace(old[::-1], new[::-1], count)[::-1]

def replace_last(a, b, s, n=1):
for _ in range(n):
i = s.rindex(a)
s = s[0:i] + b + s[i+len(a):]
return s
Usage:
>>> replace_last('>', '>', "<!doctype html><html><body><p>hello</p></body></html>", n=3)
'<!doctype html><html><body><p>hello</p></body></html>'

Related

Python RegEx: how to replace each match individually

I have a string s, a pattern p and a replacement r, i need to get the list of strings in which only one match with p has been replaced with r.
Example:
s = 'AbcAbAcc'
p = 'A'
r = '_'
// Output:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
I have tried with re.finditer(p, s) but i couldn't figure out how to replace each match with r.

You can replace them manually after finding all the matches:
[s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
The result is:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
How does it work?
re.finditer(p,s) will find all matches (each will be a re.Match
object)
the re.Match objects have start() and end() method which return the location of the match
you can replace the part of string between begin and end using this code: s[:begin] + replacement + s[end:]

You don't need regex for this, it's as simple as
[s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
Full code: See it working here
s = 'AbcAbAcc'
p = 'A'
r = '_'
x = [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
print(x)
Outputs:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
As mentioned, this only works on one character, for anything longer than one character or requiring a regex, use zvone's answer.
For a performance comparison between mine and zvone's answer (plus a third method of doing this without regex), see here or test it yourself with the code below:
import timeit,re
s = 'AbcAbAcc'
p = 'A'
r = '_'
def x1():
return [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
def x2():
return [s[:i]+r+s[i+1:] for i in range(len(s)) if s[i]==p]
def x3():
return [s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
print(x1())
print(timeit.timeit(x1, number=100000))
print(x2())
print(timeit.timeit(x2, number=100000))
print(x3())
print(timeit.timeit(x3, number=100000))

Splitting a string before the nth occurrence of a character [duplicate]

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?

>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)

Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2

I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)

I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.

>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']

It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.

I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!

In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']

As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

String formatting by inserting ':'

I have a string '0000000000000201' in python
dpid_string = '0000000000000201'
Which is the best way to convert this to the following string
00:00:00:00:00:00:02:01

You'd partition the string into chunks of size 2, and join them with str.join():
':'.join([dpid_string[i:i + 2] for i in range(0, len(dpid_string), 2)])
Demo:
>>> dpid_string = '0000000000000201'
>>> ':'.join([dpid_string[i:i + 2] for i in range(0, len(dpid_string), 2)])
'00:00:00:00:00:00:02:01'

seq = '0000000000000201'
length = 2
":".join([seq[i:i+length] for i in range(0, len(seq), length)])

Although not very simple, you can do
dpid_string = '0000000000000201'
''.join([':' + char if not i % 2 else char for i, char in enumerate(dpid_string)])[1:]
To break it down from within the list comprehension:
[char for char in dpid_string] just loops over characters and returns them as a list.
We want it to return a string, so we join the full list using ''.join(list).
Now we want it to react on the location of the character, so we want to assess the index. Therefore we use i, value in enumerate(list)
If this index is even, add a colon before the char (modulus 2 is False).
Now this leaves us with a colon at index 0, we remove it by indexing [1:]

An alternative using re.sub:
import re
dpid_string = '0000000000000201'
subbed = re.sub('(..)(?!$)', r'\1:', dpid_string)
# 00:00:00:00:00:00:02:01
Read as take every 2 characters that aren't at the end of the string, and replace it with those two characters followed by :.

Concatenate strings if they have an overlapping region

I am trying to write a script that will find strings that share an overlapping region of 5 letters at the beginning or end of each string (shown in example below).
facgakfjeakfjekfzpgghi
pgghiaewkfjaekfjkjakjfkj
kjfkjaejfaefkajewf
I am trying to create a new string which concatenates all three, so the output would be:
facgakfjeakfjekfzpgghiaewkfjaekfjkjakjfkjaejfaefkajewf
Edit:
This is the input:
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
**the list is not ordered
What I've written so far *but is not correct:
def findOverlap(seq)
i = 0
while i < len(seq):
for x[i]:
#check if x[0:5] == [:5] elsewhere
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
findOverlap(x)

Create a dictionary mapping the first 5 characters of each string to its tail
strings = {s[:5]: s[5:] for s in x}
and a set of all the suffixes:
suffixes = set(s[-5:] for s in x)
Now find the string whose prefix does not match any suffix:
prefix = next(p for p in strings if p not in suffixes)
Now we can follow the chain of strings:
result = [prefix]
while prefix in strings:
result.append(strings[prefix])
prefix = strings[prefix][-5:]
print "".join(result)

A brute-force approach - do all combinations and return the first that matches linking terms:
def solution(x):
from itertools import permutations
for perm in permutations(x):
linked = [perm[i][:-5] for i in range(len(perm)-1)
if perm[i][-5:]==perm[i+1][:5]]
if len(perm)-1==len(linked):
return "".join(linked)+perm[-1]
return None
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
print solution(x)

Loop over each pair of candidates, reverse the second string and use the answer from here

Count overlapping regex matches once again

How can I obtain the number of overlapping regex matches using Python?
I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:
input example string: akka
search pattern: a.*k
A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).
The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).

I think that what you're looking for is probably better done with a parsing library like lepl:
>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]
I believe that the length of the output from parser.parse_all is what you're looking for.
Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.
Edit: Based on the comment from #Shamanu4, I see you want matching results starting from any position, you can do that as follows:
>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']

Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants
def myregex(pattern,text,dir=0):
import re
m = re.search(pattern, text)
if m:
yield m.group(0)
if len(m.group('suffix')):
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
yield r
if dir<1 :
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
yield r
def myprocess(pattern, text):
parts = pattern.split("*")
for i in range(0, len(parts)-1 ):
res=""
for j in range(0, len(parts) ):
if j==0:
res+="(?P<prefix>"
if j==i:
res+=")(?P<suffix>"
res+=parts[j]
if j==i+1:
res+=")(?P<end>"
if j<len(parts)-1:
if j==i:
res+=".*"
else:
res+=".*?"
else:
res+=")"
for r in myregex(res,text):
yield r
def mycount(pattern, text):
return set(myprocess(pattern, text))
test:
>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace items from the back of a string - python

def replace_last(a, b, s, n=1): for _ in range(n): i = s.rindex(a) s = s[0:i] + b + s[i+len(a):] return s Usage: >>> replace_last('>', '>', "<!doctype html><html><body><p>hello</p></body></html>", n=3) '<!doctype html><html><body><p>hello</p></body></html>'

Related

Python RegEx: how to replace each match individually

Splitting a string before the nth occurrence of a character [duplicate]

String formatting by inserting ':'

Concatenate strings if they have an overlapping region

Count overlapping regex matches once again

Categories

Resources