str.replace() or re.sub() continually until substring no longer present - python

Let's say I have the following string: 'streets are shiny.' I wish to find every occurrence of the string 'st' and replace it with 'ts'. So the result should read 'tseets are shiny'.
I know this can be done using re.sub() or str.replace(). However, say I have the following strings:
'st'
'sts'
'stst'
I want them to change to 'ts','tss' and 'ttss' respectively, as I want all occurrences of 'st' to change to 'ts'.
What is the best way to replace these strings with optimal runtime? I know I could continually perform a check to see if "st" in string until this returns False, but is there a better way?

I think that a while loop that just checks if the 'st' is in the string is best in this case:
def recursive_replace(s, sub, new):
while sub in s:
s = s.replace(sub, new)
return s
tests = ['st', 'sts', 'stst']
print [recursive_replace(test, 'st', 'ts') for test in tests]
#OUT: ['ts', 'tss', 'ttss']

While the looping solutions are probably the simplest, you can actually write a re.sub call with a custom function to do all the transformations at once.
The key insight for this is that your rule (changing st to ts) will end up moving all ss in a block of mixed ss and ts to the right of all the ts. We can simply count the ss and ts and make an appropriate replacement:
def sub_func(match):
text = match.group(1)
return "t"*text.count("t") + "s"*text.count("s")
re.sub(r'(s[st]*t)', sub_func, text)

You can do that with a pretty simple while loop:
s="stst"
while('st' in s):
s = s.replace("st", "ts")
print(s)
ttss

If you want to continually check, then the other questions work well (with the problem that if you have something like stt you would get stt->tst->tts). I don't know if want that.
I think however, that you are trying to replace multiple occurences of st with ts. If that is the case, you should definitely use string.replace. .replace replaces every occurrence of a str, up to the extent you want.
This should be faster according to this.
string.replace(s, old, new[, maxreplace])
example:
>>>import string
>>>st='streets are shiny.streets are shiny.streets are shiny.'
>>>string.replace(st,'st','ts')
#out: 'tsreets are shiny.tsreets are shiny.tsreets are shiny.'

Naively you could do:
>>> ['t'*s.count('t')+'s'*s.count('s') for s in ['st', 'sts', 'stst']]
['ts', 'tss', 'ttss']

Related

How to split strings with special characters without removing those characters?

I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Right-to-left string replace in Python?

I want to do a string replace in Python, but only do the first instance going from right to left. In an ideal world I'd have:
myStr = "mississippi"
print myStr.rreplace("iss","XXX",1)
> missXXXippi
What's the best way of doing this, given that rreplace doesn't exist?
rsplit and join could be used to simulate the effects of an rreplace
>>> 'XXX'.join('mississippi'.rsplit('iss', 1))
'missXXXippi'
>>> myStr[::-1].replace("iss"[::-1], "XXX"[::-1], 1)[::-1]
'missXXXippi'
>>> re.sub(r'(.*)iss',r'\1XXX',myStr)
'missXXXippi'
The regex engine cosumes all the string and then starts backtracking untill iss is found. Then it replaces the found string with the needed pattern.
Some speed tests
The solution with [::-1] turns out to be faster.
The solution with re was only faster for long strings (longer than 1 million symbols).
you may reverse a string like so:
myStr[::-1]
to replace just add the .replace:
print myStr[::-1].replace("iss","XXX",1)
however now your string is backwards, so re-reverse it:
myStr[::-1].replace("iss","XXX",1)[::-1]
and you're done.
If your replace strings are static just reverse them in file to reduce overhead.
If not, the same trick will work.
myStr[::-1].replace("iss"[::-1],"XXX"[::-1],1)[::-1]
def rreplace(s, old, new):
try:
place = s.rindex(old)
return ''.join((s[:place],new,s[place+len(old):]))
except ValueError:
return s
You could also use str.rpartition() which splits the string by the specified separator from right and returns a tuple:
myStr = "mississippi"
first, sep, last = myStr.rpartition('iss')
print(first + 'XXX' + last)
# missXXXippi
Using the package fishhook (available through pip), you can add this functionality.
from fishhook import hook
#hook(str)
def rreplace(self, old, new, count=-1):
return self[::-1].replace(old[::-1], new[::-1], count)[::-1]
print('abxycdxyef'.rreplace('xy', '--', count=1))
# 'abxycd--ef'
It's kind of a dirty hack, but you could reverse the string and replace with also reversed strings.
"mississippi".reverse().replace('iss'.reverse(), 'XXX'.reverse(),1).reverse()

Convert not_camel_case to notCamelCase and/or NotCamelCase in Python?

Basically, the reverse of this. Here's my attempt, but it's not working.
def titlecase(value):
s1 = re.sub('(_)([a-z][A-Z][0-9]+)', r'\2'.upper(), value)
return s1
def titlecase(value):
return "".join(word.title() for word in value.split("_"))
Python is more readable than regex, and easier to fix when it's not doing what you want.
If you want the first letter lowercase as well, I would use a second function that calls the function above to do most of the work, then just lowercases the first letter:
def titlecase2(value):
return value[:1].lower() + titlecase(value)[1:]
You have an error with your regex. Instead of
([a-z][A-Z][0-9]+) # would match 'oN3' but not 'one'
use
([a-zA-Z0-9]+) # matches any alphanumeric word
However, this also won't work because r'\2'.upper() can't be used that way. Instead, try:
s1 = re.sub('(_)([a-zA-Z0-9]+)', lambda p: p.group(2).capitalize(), value)
#kindall provide good solution(credit goes to him).
But if you want syntax "myCamel" the first word does not need to be capitalized then you have to change a bit:
def titlecase(value):
rest = value.split("_")
return rest[0]+"".join(word.title() for word in rest[1:])
For NotCamelCase, Using a regex or a loop sounds like an overkill.
str.title().replace("_", "")
Like jtbandes said, you should mash the character classes together like
([a-zA-Z0-9]+)
The next trick is what you do with the replacement. When you say
r'\2'.upper()
the upper() actually happens before called sub. But you can use another feature of sub: you can pass a function to handle the match:
re.sub('(_)([a-zA-Z0-9]+)', lambda match: match.group(2).capitalize(), value)
Now your lambda will get called with the match. Also you can use subn to have the replacement happen on more than one place:
re.subn('(_)([a-zA-Z0-9]+)', lambda match: match.group(2).capitalize(), value)[0]

How to convert specific character sequences in a string to upper case using Python?

I am looking to accomplish the following and am wondering if anyone has a suggestion as to how best go about it.
I have a string, say 'this-is,-toronto.-and-this-is,-boston', and I would like to convert all occurrences of ',-[a-z]' to ',-[A-Z]'. In this case the result of the conversion would be 'this-is,-Toronto.-and-this-is,-Boston'.
I've been trying to get something working with re.sub(), but as yet haven't figured out how how
testString = 'this-is,-toronto.-and-this-is,-boston'
re.sub(r',_([a-z])', r',_??', testString)
Thanks!
re.sub can take a function which returns the replacement string:
import re
s = 'this-is,-toronto.-and-this-is,-boston'
t = re.sub(',-[a-z]', lambda x: x.group(0).upper(), s)
print t
prints
this-is,-Toronto.-and-this-is,-Boston

Categories

Resources