What is the clean way in python to do this simple text fixing - checking if every full stop (except the last one) is followed by space. Assume that having a dot not followed by an empty space is the only possible error we can get in the input string.
I am doing this:
def textFix(text):
result = re.sub('\.(?!\s)', '. ', text)
if (result[len(result) - 1]) == ' ':
return result[:-1]
return result

You may check it with
See the regex demo. It matches a dot not followed with whitespace or end of string, that is, any non-final dot that has no whitespace after it.
Or, you may also consider
to match any dot followed with a non-whitespace char.
See another demo.
Python demo:
import re
rx = r"\.(?=\S)"
s = "Text1. Text2.Text3."
result = re.sub(rx, ". ", s)
# => "Text1. Text2. Text3."

Your technique looks perfect. But also include a check to avoid adding space after last dot (.)
where (?!$) helps make sure if the . is followed by end of string $ then isn't matched and so no space is added after it.
Regex 101 demo


Split a string by comma except when in bracket and except when directly before and/or after the comma is a dash "-"?

just trying to figure out how to plit a string by comma except when in bracket AND except when directly before and/or after the comma is a dash. I have already found some good solutions for how to deal with the bracket problem but I do not have any clue how to extend this to my problem.
Here is an example:
example_string = 'A-la-carte-Küche, Garnieren (Speisen, Getränke), Kosten-, Leistungsrechnung, Berufsausbildung, -fortbildung'
aim = ['A-la-carte-Küche', 'Garnieren (Speisen, Getränke)', 'Kosten-, Leistungsrechnung', 'Berufsausbildung, -fortbildung']
So far, I have managed to do the following:
>>> re.split(r',\s*(?![^()]*\))', example_string)
>>> out: ['A-la-carte-Küche', 'Garnieren (Speisen, Getränke)', 'Kosten-', 'Leistungsrechnung', 'Berufsausbildung', '-fortbildung']
Note the difference between aim and out for the terms 'Kosten-, Leistungsrechnung' and 'Berufsausbildung, -fortbildung'.
Would be glad if someone could help me out such that the output looks like aim.
Thanks in advance!
If you can make use of the python regex module, you could do:
The pattern matches:
\([^()]*\) Match from an opening till closing parenthesis
(*SKIP)(*F) Skip the match
| Or
(?<!-)\s*,\s*(?!,) Match a comma between optional whitespace chars to split on
Regex demo
import regex
example_string = 'A-la-carte-Küche, Garnieren (Speisen, Getränke), Kosten-, Leistungsrechnung, Berufsausbildung, -fortbildung'
print(regex.split(r"\([^()]*\)(*SKIP)(*F)|(?<!-)\s*,\s*(?!,)", example_string))
['A-la-carte-Küche', ' Garnieren (Speisen, Getränke)', ' Kosten-, Leistungsrechnung', ' Berufsausbildung', ' -fortbildung']
You can use
re.split(r'(?<!-),(?!\s*-)\s*(?![^()]*\))', example_string)
See the Python demo. Details:
(?<!-) - a negative lookbehind that fails the match if there is a - char immediately to the left of the current location
, - a comma
(?!\s*-) - a negative lookahead that fails the match if there is a - char immediately to the right of the current location
\s* - zero or more whitespaces
(?![^()]*\)) - a negative lookahead that fails the match if there are zero or more chars other than ) and ( and then a ) char immediately to the right of the current location.
See the regex demo, too.

Split by '.' when not preceded by digit

I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

Merging three regex patterns used for text cleaning to improve efficiency

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
def cap(match):
return (
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?
You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: if else ('' if else ' '), text)
a test here this is test but keep 1.2
Python demo

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = ""
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = ''
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> ''
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
Regex demo

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal:
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here:
[a-z] A range of characters. Matches any character in the specified
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
re.sub('[..+]+',' ', s)
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

