Join negative particle - python

How to attach a negative particle to the next word in python (for all texts where "no" can be)?
For example, make this string ['This is not apple']
Into this: ['This is not_apple']

You can use a regular expression:
\bnot\s+(?=\w) to match the word not (but not other words ending in not) followed by one or more space and another word.
import re
s = 'This is not apple'
s2 = re.sub(r'\bnot\s+(?=\w)', 'not_', s)
output: 'This is not_apple'
with a list
import re
l = ['This is not apple']
[re.sub(r'\bnot\s+(?=\w)', 'not_', s) for s in l]
output: ['This is not_apple']

Related

Regex only returning first part of match

I'd like to extract the cat and another mat from this sentence:
>>> text = "the cat sat on another mat"
>>>
>>> re.findall('(the|another)\s+\w+', text)
['the', 'another']
But it won't return the cat and mat following. If I change it to re.findall('another\s+\w+', text) then it finds that part, but why doesn't the (first thing | second thing) work?
(Using Python's re module)
I would do
import re
text = "the cat sat on another mat"
re.findall('the\s+\w+|another\s+\w+', text)
The result should be
>>> ['the cat', 'another mat']
re.findall returns only the substrings in the capture group if a capture group exists in the given regex pattern, so in this case you should use a non-capturing group instead, so that re.findall would return the entire matches:
re.findall('(?:the|another)\s+\w+', text)
This returns:
['the cat', 'another mat']

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING
You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']
You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).
One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.
You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']
You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']
Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']

Punctuation not detected between words with no space

How can I split sentences, when punctuation is detected (.?!) and occurs between two words without a space?
Example:
>>> splitText = re.split("(?<=[.?!])\s+", "This is an example. Not
working as expected.Because there isn't a space after dot.")
output:
['This is an example.',
"Not working as expected.Because there isn't a space after dot."]
expected:
['This is an example.',
'Not working as expected.',
'Because there isn't a space after dot.']`
splitText = re.split("[.?!]\s*", "This is an example. Not working as expected.Because there isn't a space after dot.")
+ is used for 1 or more of something, * for zero of more.
if you need to keep the . you probably don't want to split, instead you could do:
splitText = re.findall(".*?[.?!]", "This is an example. Not working as expected.Because there isn't a space after dot.")
which gives
['This is an example.',
' Not working as expected.',
"Because there isn't a space after dot."]
you can trim those by playing with the regex (eg '\s*.*?[.?!]') or just using .trim()
Use
https://regex101.com/r/icrJNl/3/.
import re
from pprint import pprint
split_text = re.findall(".*?[?.!]", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Note: .*? is a lazy (or non-greedy) quantifier in opposite to .* which is a greedy quantifier.
Output:
['This is an example!',
' Working as expected?',
'Because.']
Another solution:
import re
from pprint import pprint
split_text = re.split("([?.!])", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Output:
['This is an example',
'!',
' Working as expected',
'?',
'Because',
'.',
'']

re.findall() isn't as greedy as expected - Python 2.7

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
Per this regex tester, I should, in theory, be getting a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I am actually getting is like this:
>>> [' World', ' speaking']
The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.
The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
It is easy to see what is going on using re.finditer() and exploring the match objects:
>>> import re
>>> text = "Hello World! This is your captain speaking."
>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)
>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)
The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:
>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'
You can change your regex somewhat:
>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

Find all the words in string except one word in python with regex

I'm working with regex in python and I'd like to search for all the words in a string except one word.
Code:
import re
string = "The world is too big"
print re.findall("regex", string)
If I want to get all the words except for the word "too" (so the output will be ["The", "world", "is", "big"]), How can I implement this in regex?
You don't even need to use regex for this task, simply use split and filter:
sentence = "The world is too big"
sentence = list(filter(lambda x: x != 'too', sentence.split()))
print(sentence)
Delete 'too' in string, then split string.
re.sub(r'\btoo\b','',string).split()
Out[15]: ['The', 'world', 'is', 'big']

Categories

Resources