Merging three regex patterns used for text cleaning to improve efficiency - python

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
first_strip=re.sub("[’']",'',text)
def cap(match):
return (match.group().lower())
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?

You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
return match.group().lower()
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
print(third_strip)
Output
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: x.group(1).lower() if x.group(1) else ('' if x.group(2) else ' '), text)
print(result)
Output
a test here this is test but keep 1.2
Python demo

Related

Split by '.' when not preceded by digit

I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

Add space after full stops

What is the clean way in python to do this simple text fixing - checking if every full stop (except the last one) is followed by space. Assume that having a dot not followed by an empty space is the only possible error we can get in the input string.
I am doing this:
def textFix(text):
result = re.sub('\.(?!\s)', '. ', text)
if (result[len(result) - 1]) == ' ':
return result[:-1]
return result
You may check it with
\.(?!\s|$)
See the regex demo. It matches a dot not followed with whitespace or end of string, that is, any non-final dot that has no whitespace after it.
Or, you may also consider
\.(?=\S)
to match any dot followed with a non-whitespace char.
See another demo.
Python demo:
import re
rx = r"\.(?=\S)"
s = "Text1. Text2.Text3."
result = re.sub(rx, ". ", s)
print(result)
# => "Text1. Text2. Text3."
Your technique looks perfect. But also include a check to avoid adding space after last dot (.)
\.(?!\s)(?!$)
where (?!$) helps make sure if the . is followed by end of string $ then isn't matched and so no space is added after it.
Regex 101 demo

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Categories

Resources