Punctuation not detected between words with no space - python

How can I split sentences, when punctuation is detected (.?!) and occurs between two words without a space?
Example:
>>> splitText = re.split("(?<=[.?!])\s+", "This is an example. Not
working as expected.Because there isn't a space after dot.")
output:
['This is an example.',
"Not working as expected.Because there isn't a space after dot."]
expected:
['This is an example.',
'Not working as expected.',
'Because there isn't a space after dot.']`

splitText = re.split("[.?!]\s*", "This is an example. Not working as expected.Because there isn't a space after dot.")
+ is used for 1 or more of something, * for zero of more.
if you need to keep the . you probably don't want to split, instead you could do:
splitText = re.findall(".*?[.?!]", "This is an example. Not working as expected.Because there isn't a space after dot.")
which gives
['This is an example.',
' Not working as expected.',
"Because there isn't a space after dot."]
you can trim those by playing with the regex (eg '\s*.*?[.?!]') or just using .trim()

Use
https://regex101.com/r/icrJNl/3/.
import re
from pprint import pprint
split_text = re.findall(".*?[?.!]", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Note: .*? is a lazy (or non-greedy) quantifier in opposite to .* which is a greedy quantifier.
Output:
['This is an example!',
' Working as expected?',
'Because.']
Another solution:
import re
from pprint import pprint
split_text = re.split("([?.!])", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Output:
['This is an example',
'!',
' Working as expected',
'?',
'Because',
'.',
'']

Related

Why are there space outcome in my re.split() result

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?
Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

Join negative particle

How to attach a negative particle to the next word in python (for all texts where "no" can be)?
For example, make this string ['This is not apple']
Into this: ['This is not_apple']
You can use a regular expression:
\bnot\s+(?=\w) to match the word not (but not other words ending in not) followed by one or more space and another word.
import re
s = 'This is not apple'
s2 = re.sub(r'\bnot\s+(?=\w)', 'not_', s)
output: 'This is not_apple'
with a list
import re
l = ['This is not apple']
[re.sub(r'\bnot\s+(?=\w)', 'not_', s) for s in l]
output: ['This is not_apple']

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING
You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']
You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).
One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.
You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']
You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']
Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']

Python (2.7) - Replacing multiple patterns in a string using re

I am trying to think of a more elegant way of replacing multiple patterns in a given string using re in relation to a little problem, which is to remove from a given string all substrings consisting of more than two spaces and also all substrings where a letter starts after a period without any space. So the sentence
'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
should be corrected to:
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
My solution, below, seems a bit messy. I was wondering whether there was a nicer way of doing this, as in a one-liner regex.
def correct( astring ):
import re
bstring = re.sub( r' +', ' ', astring )
letters = [frag.strip( '.' ) for frag in re.findall( r'\.\w', bstring )]
for letter in letters:
bstring = re.sub( r'\.{}'.format( letter ), '. {}'.format( letter ), bstring )
return bstring
s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
print(re.sub("\s+"," ",s).replace(".",". ").rstrip())
This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.
You could use re.sub function like below. This would add exactly two spaces next to the dot except the last dot and it also replaces one or more spaces except the one after dot with a single space.
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub(r'(?<!\.)\s+', ' ' ,re.sub(r'\.\s*(?!$)', r'. ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
OR
>>> re.sub(r'\.\s*(?!$)', r'. ', re.sub(r'\s+', ' ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
An approach without using any RegEX
>>> ' '.join(s.split()).replace('.','. ')[:-1]
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
What pure regex? Like this?
>>> import re
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

python: split string after comma and dots

I have a piece of code which splits a string after commas and dots (but not when a digit is before or after a comma or dot):
text = "This is, a sample text. Some more text. $1,200 test."
print re.split('(?<!\d)[,.]|[,.](?!\d)', text)
The result is:
['This is', ' a sample text', ' Some more text', ' $1,200 test', '']
I don't want to lose the commas and dots. So what I am looking for is:
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']
Besides, if a dot in the end of text it produces an empty string in the end of the list. Furthermore, there are white-spaces at the beginning of the split strings. Is there a better method without using re? How would you do this?
Unfortunately you can't use re.split() on a zero-length match, so unless you can guarantee that there will be whitespace after the comma or dot you will need to use a different approach.
Here is one option that uses re.findall():
>>> text = "This is, a sample text. Some more text. $1,200 test."
>>> print re.findall(r'(?:\d[,.]|[^,.])*(?:[,.]|$)', text)
['This is,', ' a sample text.', ' Some more text.', ' $1,200 test.', '']
This doesn't strip whitespace and you will get an empty match at the end if the string ends with a comma or dot, but those are pretty easy fixes.
If it is a safe assumption that there will be whitespace after every comma and dot you want to split on, then we can just split the string on that whitespace which makes it a little simpler:
>>> print re.split(r'(?<=[,.])(?<!\d.)\s', text)
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']

Categories

Resources