Join negative particle

Join negative particle - python

How to attach a negative particle to the next word in python (for all texts where "no" can be)?
For example, make this string ['This is not apple']
Into this: ['This is not_apple']

You can use a regular expression:
\bnot\s+(?=\w) to match the word not (but not other words ending in not) followed by one or more space and another word.
import re
s = 'This is not apple'
s2 = re.sub(r'\bnot\s+(?=\w)', 'not_', s)
output: 'This is not_apple'
with a list
import re
l = ['This is not apple']
[re.sub(r'\bnot\s+(?=\w)', 'not_', s) for s in l]
output: ['This is not_apple']

Related

Regex only returning first part of match

I'd like to extract the cat and another mat from this sentence:
>>> text = "the cat sat on another mat"
>>>
>>> re.findall('(the|another)\s+\w+', text)
['the', 'another']
But it won't return the cat and mat following. If I change it to re.findall('another\s+\w+', text) then it finds that part, but why doesn't the (first thing | second thing) work?
(Using Python's re module)

I would do
import re
text = "the cat sat on another mat"
re.findall('the\s+\w+|another\s+\w+', text)
The result should be
>>> ['the cat', 'another mat']

re.findall returns only the substrings in the capture group if a capture group exists in the given regex pattern, so in this case you should use a non-capturing group instead, so that re.findall would return the entire matches:
re.findall('(?:the|another)\s+\w+', text)
This returns:
['the cat', 'another mat']

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING

You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']

You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).

One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.

You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']

You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']

Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']

Punctuation not detected between words with no space

How can I split sentences, when punctuation is detected (.?!) and occurs between two words without a space?
Example:
>>> splitText = re.split("(?<=[.?!])\s+", "This is an example. Not
working as expected.Because there isn't a space after dot.")
output:
['This is an example.',
"Not working as expected.Because there isn't a space after dot."]
expected:
['This is an example.',
'Not working as expected.',
'Because there isn't a space after dot.']`

splitText = re.split("[.?!]\s*", "This is an example. Not working as expected.Because there isn't a space after dot.")
+ is used for 1 or more of something, * for zero of more.
if you need to keep the . you probably don't want to split, instead you could do:
splitText = re.findall(".*?[.?!]", "This is an example. Not working as expected.Because there isn't a space after dot.")
which gives
['This is an example.',
' Not working as expected.',
"Because there isn't a space after dot."]
you can trim those by playing with the regex (eg '\s*.*?[.?!]') or just using .trim()

Use
https://regex101.com/r/icrJNl/3/.
import re
from pprint import pprint
split_text = re.findall(".*?[?.!]", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Note: .*? is a lazy (or non-greedy) quantifier in opposite to .* which is a greedy quantifier.
Output:
['This is an example!',
' Working as expected?',
'Because.']
Another solution:
import re
from pprint import pprint
split_text = re.split("([?.!])", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Output:
['This is an example',
'!',
' Working as expected',
'?',
'Because',
'.',
'']

re.findall() isn't as greedy as expected - Python 2.7

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
Per this regex tester, I should, in theory, be getting a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I am actually getting is like this:
>>> [' World', ' speaking']
The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.

The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
It is easy to see what is going on using re.finditer() and exploring the match objects:
>>> import re
>>> text = "Hello World! This is your captain speaking."
>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)
>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)
The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:
>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

You can change your regex somewhat:
>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

Find all the words in string except one word in python with regex

I'm working with regex in python and I'd like to search for all the words in a string except one word.
Code:
import re
string = "The world is too big"
print re.findall("regex", string)
If I want to get all the words except for the word "too" (so the output will be ["The", "world", "is", "big"]), How can I implement this in regex?

You don't even need to use regex for this task, simply use split and filter:
sentence = "The world is too big"
sentence = list(filter(lambda x: x != 'too', sentence.split()))
print(sentence)

Delete 'too' in string, then split string.
re.sub(r'\btoo\b','',string).split()
Out[15]: ['The', 'world', 'is', 'big']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join negative particle - python

How to attach a negative particle to the next word in python (for all texts where "no" can be)? For example, make this string ['This is not apple'] Into this: ['This is not_apple']

Related

Regex only returning first part of match

Using regex to find all phrases that are completely capitalized

Punctuation not detected between words with no space

re.findall() isn't as greedy as expected - Python 2.7

Find all the words in string except one word in python with regex

Categories

Resources