Python removing delimiters from strings - python

I have 2 related questions/ issues.
def remove_delimiters (delimiters, s):
for d in delimiters:
ind = s.find(d)
while ind != -1:
s = s[:ind] + s[ind+1:]
ind = s.find(d)
return ' '.join(s.split())
delimiters = [",", ".", "!", "?", "/", "&", "-", ":", ";", "#", "'", "..."]
d_dataset_list = ['hey-you...are you ok?']
d_list = []
for d in d_dataset_list:
d_list.append(remove_delimiters(delimiters, d[1]))
print d_list
Output = 'heyyouare you ok'
What is the best way of avoiding strings being combined together when a delimiter is removed? For example, so that the output is hey you are you ok ?
There may be a number of different sequences of ..., for example .. or .......... etc. How does one go around implementing some form of rule, where if more than one . appear after each other, to remove it? I want to try and avoid hard-coding all sequences in my delimiters list. Thankyou

You could try something like this:
Given delimiters d, join them to a regular expression
>>> d = ",.!?/&-:;#'..."
>>> "["+"\\".join(d)+"]"
"[,\\.\\!\\?\\/\\&\\-\\:\\;\\#\\'\\.\\.\\.]"
Split the string using this regex with re.split
>>> s = 'hey-you...are you ok?'
>>> re.split("["+"\\".join(d)+"]", s)
['hey', 'you', '', '', 'are you ok', '']
Join all the non-empty fragments back together
>>> ' '.join(w for w in re.split("["+"\\".join(d)+"]", s) if w)
'hey you are you ok'
Also, if you just want to remove all non-word characters, you can just use the character group \W instead of manually enumerating all the delimiters:
>>> ' '.join(w for w in re.split(r"\W", s) if w)
'hey you are you ok'

So first of all, your function for removing delimiters could be simplified greatly by using the replace function (http://www.tutorialspoint.com/python/string_replace.htm)
This would help solve your first question. Instead of just removing them, replace with a space, then get rid of the spaces using the pattern you already used (split() treats consecutive delimiters as one)
A better function, which does this, would be:
def remove_delimiters (delimiters, s):
new_s = s
for i in delimiters: #replace each delimiter in turn with a space
new_s = new_s.replace(i, ' ')
return ' '.join(new_s.split())
to answer your second question, I'd say it's time for regular expressions
>>> import re
... ss = 'hey ... you are ....... what?'
... print re.sub('[.+]',' ',ss)
hey you are what?
>>>

Related

How do I split a string with several delimiters, but only once on each delimiter? Python

I am trying to split a string such as the one below, with all of the delimiters below, but only once.
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = '\t \v ;'
The output, in this case, would be:
['it', ' seems; like', 'a good\tday to watch', 'a\vmovie.']
Obviously the example above is a nonsense example, but I am trying to learn whether or not this is possible. Would a fairly involved regex be in order?
Apologies if this question had been asked before. I did a fair bit of searching and could not find something quite like my example. Thanks for your time!
This should do the trick:
import re
def split_once_by(s, delims):
delims = set(delims)
parts = []
while delims:
delim_re = '({})'.format('|'.join(re.escape(d) for d in delims))
result = re.split(delim_re, s, maxsplit=1)
if len(result) == 3:
first, delim, s = result
parts.append(first)
delims.remove(delim)
else:
break
parts.append(s)
return parts
Example:
>>> split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;')
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
Burning Alcohol's answer inspired me to write this (IMO) better function:
def split_once_by(s, delims):
split_points = sorted((s.find(d), -len(d), d) for d in delims)
start = 0
for stop, _longest_first, d in split_points:
if stop < start: continue
yield s[start:stop]
start = stop + len(d)
yield s[start:]
with usage:
>>> list(split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;'))
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
A simple algorithm would do,
test_string = r'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = [r'\t', r'\v', ';']
# find the index of each first occurence and sort it
delimiters = sorted(delimiters, key=lambda delimiter: test_string.find(delimiter))
splitted_string = [test_string]
# perform split with option maxsplit
for index, delimiter in enumerate(delimiters):
if delimiter in splitted_string[-1]:
splitted_string += splitted_string[-1].split(delimiter, maxsplit=1)
splitted_string.pop(index)
print(splitted_string)
# ['it', ' seems; like', 'a good\\tday to watch', 'a\\vmovie.']
Just create a list of patterns and apply them once:
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
patterns = ['\t', '\v', ';']
for pattern in patterns:
string = '*****'.join(string.split(pattern, maxsplit=1))
print(string.split('*****'))
Output:
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
So, what is "*****" ??
On each iteration, when you apply the split method you get a list. So, in the next iteration, You can't apply the .split () method (because you have a list), so you have to join each value of that list with some weird character like "****" or "###" or "^^^^^^^" or whatever you want, in order to re-apply the split () in the next iteration.
Finally, for each "*****" on your string, you will have one pattern of the list, so you can use this to make a final split.

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

How to do CamelCase split in python

What I was trying to achieve, was something like this:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
So I searched and found this perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
As the next logical step I tried:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
Why does this not work, and how do I achieve the result from the linked question in python?
Edit: Solution summary
I tested all provided solutions with a few test cases:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
In summary you could say the solution by #kalefranz does not match the question (see the last case) and the solution by #casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by #200_success returns an empty list.
I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.
As #AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.
Here is a solution using re.finditer() that emulates splitting:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
Use re.sub() and split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()
Result
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
To deal with dromedary too, you can use:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])
Working solution, without regexp
I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.
Here is a quite straightforward solution in pure python:
def camel_case_split(s):
idx = list(map(str.isupper, s))
# mark change of case
l = [0]
for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
if x and not y: # "Ul"
l.append(i)
elif not x and y: # "lU"
l.append(i+1)
l.append(len(s))
# for "lUl", index of "U" will pop twice, have to filter that
return [s[x:y] for x, y in zip(l, l[1:]) if x < y]



And some tests
TESTS = [
("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
("XYZCamelCaseXYZ", ['XYZ', 'Camel', 'Case', 'XYZ']),
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
]
def test():
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
Edit: a solution which streams data in one pass
This solution leverages the fact that the decision to split word or not can be taken locally, just considering the current character and the previous one.
def camel_case_split(s):
u = True # case of previous char
w = b = '' # current word, buffer for last uppercase letter
for c in s:
o = c.isupper()
if u and o:
w += b
b = c
elif u and not o:
if len(w)>0:
yield w
w = b + c
b = ''
elif not u and o:
yield w
w = ''
b = c
else: # not u and not o:
w += c
u = o
if len(w)>0 or len(b)>0: # flush
yield w + b
It is theoretically faster and lesser memory usage.
same tests suite applies
but list must be built by caller
def test():
for (q,a) in TESTS:
r = list(camel_case_split(q))
print(q,a,r)
assert r == a
Try it online
I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
import re
re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result
# '(?<=[a-z])' --> means preceding lowercase char (group A)
# '(?=[A-Z])' --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
The documentation for python's re.split says:
Note that split will never split a string on an empty pattern match.
When seeing this:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.
Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
This solution also supports numbers, spaces, and auto remove underscores:
def camel_terms(value):
return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
Some tests:
tests = [
"XYZCamelCase",
"CamelCaseXYZ",
"Camel_CaseXYZ",
"3DCamelCase",
"Camel5Case",
"Camel5Case5D",
"Camel Case XYZ"
]
for test in tests:
print(test, "=>", camel_terms(test))
results:
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
Simple solution:
re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", str(text))
Here's another solution that requires less code and no complicated regular expressions:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
Edit
The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
EDIT:
As suggested, here is the code to put the functionality in a single function.
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.
def camel_case_split(identifier, remove_single_letter_words=False):
"""Parses CamelCase and Snake naming"""
concat_words = re.split('[^a-zA-Z]+', identifier)
def camel_case_split(string):
bldrs = [[string[0].upper()]]
string = string[1:]
for idx, c in enumerate(string):
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
bldrs.append([c])
else:
bldrs[-1].append(c)
words = [''.join(bldr) for bldr in bldrs]
words = [word.lower() for word in words]
return words
words = []
for word in concat_words:
if len(word) > 0:
words.extend(camel_case_split(word))
if remove_single_letter_words:
subset_words = []
for word in words:
if len(word) > 1:
subset_words.append(word)
if len(subset_words) > 0:
words = subset_words
return words
My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide:
- treat all non-alphanumeric input (e.g. !##$%^&*() etc) as a word separator
- handle digits as follows:
- cannot be in the middle of a word
- cannot be at the beginning of the word unless the phrase starts with a digit
def splitWords(s):
new_s = re.sub(r'[^a-zA-Z0-9]', ' ', # not alphanumeric
re.sub(r'([0-9]+)([^0-9])', '\\1 \\2', # digit followed by non-digit
re.sub(r'([a-z])([A-Z])','\\1 \\2', # lower case followed by upper case
re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
s
)
)
)
)
return [x for x in new_s.split(' ') if x]
Output:
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
print test + ':' + str(splitWords(test))
:[]
:[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
Maybe this will be enough to for some people:
a = "SomeCamelTextUpper"
def camelText(val):
return ''.join([' ' + i if i.isupper() else i for i in val]).strip()
print(camelText(a))
It dosen't work with the type "CamelXYZ", but with 'typical' CamelCase scenario should work just fine.
I think below is the optimim
Def count_word():
Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))
Print(count_word())

Grouping the characters and performing substitution

I want to replace my string based on the values in my dictionary. I want to try this with regular expression.
d = { 't':'ch' , 'r' : 'gh'}
s = ' Text to replace '
m = re.search('#a pattern to just get each character ',s)
m.group() # this should get me 'T' 'e' 'x' 't' .....
# how can I replace each character in string S with its corresponding key: value in my dictionary? I looked at re.sub() but could figure out how it can be used here.
I want to generate an output -> Texch cho gheplace
Using re.sub:
>>> d = { 't':'ch' , 'r' : 'gh'}
>>> s = ' Text to replace '
>>> import re
>>> pattern = '|'.join(map(re.escape, d))
>>> re.sub(pattern, lambda m: d[m.group()], s)
' Texch cho gheplace '
The second argument to the re.sub can be a function. The return value of the function is used as a replacement string.
If there is no character in the values of the dictionary appear as a key in the dictionary, then its fairly simple. You can straight away use str.replace function, like this
for char in d:
s = s.replace(char, d[char])
print s # Texch cho gheplace
Even simpler, you can use the following and this will work even if the keys appear in any of the values in the dictionary.
s, d = ' Text to replace ', { 't':'ch' , 'r' : 'gh'}
print "".join(d.get(char, char) for char in s) # Texch cho gheplace

How to strip characters from the right side of every word in Python?

Say, if I have a text like
text='a!a b! c!!!'
I want an outcome like this:
text='a!a b c'
So, if the end of each words is '!', I want to get rid of it. If there are multiple '!' in the end of a word, all of them will be eliminated.
print " ".join(word.rstrip("!") for word in text.split())
As an alternative to the split/strip approach
" ".join(x.rstrip("!") for x in text.split())
which won't preserve whitespace exactly, you could perhaps use a regex such as
re.sub(r"!+\B", "", text)
which blanks out all exclamations that aren't immediate followed by the start of a word.
import re
>>> testWord = 'a!a b! c!!!'
>>> re.sub(r'(!+)(?=\s|$)', '', testWord)
'a!a b c'
This preserves any extra spaces you may have in your string which does not happen with str.split()
Here's a non-regex, non-split based approach:
from itertools import groupby
def word_rstrip(s, to_rstrip):
words = (''.join(g) for k,g in groupby(s, str.isspace))
new_words = (w.rstrip(to_strip) for w in words)
return ''.join(new_words)
This works first by using itertools.groupby to group together contiguous characters based on whether or not they're whitespace:
>>> s = "a!a b! c!!"
>>> [''.join(g) for k,g in groupby(s, str.isspace)]
['a!a', ' ', 'b!', ' ', 'c!!']
Effectively, this is like a whitespace-preserving .split(). Once we've got this, we can use rstrip as we always would, and then recombine:
>>> [''.join(g).rstrip("!") for k,g in groupby(s, str.isspace)]
['a!a', ' ', 'b', ' ', 'c']
>>> ''.join(''.join(g).rstrip("!") for k,g in groupby(s, str.isspace))
'a!a b c'
We can also pass whatever we like:
>>> word_rstrip("a!! this_apostrophe_won't_vanish these_ones_will'''", "!'")
"a this_apostrophe_won't_vanish these_ones_will"

Categories

Resources