Python (2.7) - Replacing multiple patterns in a string using re - python

I am trying to think of a more elegant way of replacing multiple patterns in a given string using re in relation to a little problem, which is to remove from a given string all substrings consisting of more than two spaces and also all substrings where a letter starts after a period without any space. So the sentence
'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
should be corrected to:
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
My solution, below, seems a bit messy. I was wondering whether there was a nicer way of doing this, as in a one-liner regex.
def correct( astring ):
import re
bstring = re.sub( r' +', ' ', astring )
letters = [frag.strip( '.' ) for frag in re.findall( r'\.\w', bstring )]
for letter in letters:
bstring = re.sub( r'\.{}'.format( letter ), '. {}'.format( letter ), bstring )
return bstring

s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
print(re.sub("\s+"," ",s).replace(".",". ").rstrip())
This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.

You could use re.sub function like below. This would add exactly two spaces next to the dot except the last dot and it also replaces one or more spaces except the one after dot with a single space.
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub(r'(?<!\.)\s+', ' ' ,re.sub(r'\.\s*(?!$)', r'. ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
OR
>>> re.sub(r'\.\s*(?!$)', r'. ', re.sub(r'\s+', ' ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

An approach without using any RegEX
>>> ' '.join(s.split()).replace('.','. ')[:-1]
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

What pure regex? Like this?
>>> import re
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

Related

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

Exclude matching the period character in a [\W\d]+ regex

I'd like to remove all from a string except alphabetic characters and periods.
I made the below function in python. How would I extend the regex so periods are NOT stripped from the string? This needs to work for unicode strings.
def normalize(self, text):
text = re.sub(ur"(?u)[\W\d]+", ' ', text)
print text
return text
change the semantics from 'strip everything in this group' to 'strip everything that's not in this group' and use:
text = re.sub(ur"(?u)[^a-zA-Z\.]+", ' ', text)
update
i don't think the above mentioned solution will work with all unicode alphabet.
the answers here offer alternative modules to the builtin re that support unicode letter groups.
another option is combining the two approaches:
>>> text = '1234abcd.à!##$'
>>> re.sub(ur'(?u)([^\w\.]|\d)+',' ',text)
' abcd.\xc3 '

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说:“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说:“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3
Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说:“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说:“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。?!;~“]+(?:[。?!;~]|“.*?”)', s)
Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说:“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。?!;~.]*?[?!。.;~])|(?:[^“。?!;~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说:“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。?!;~.]*?)[?!。.;~]|([^“。?!;~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说:“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))
First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))
I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.
May be this will work:
$str = '他说:“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说:“今天天气很好
[1] => 我很开心
[2] => ”
)

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources