python Regex to match sentences which has n words

python Regex to match sentences which has n words - python

Im looking for a python regex which matches sentences with 3 words before a string.
for example say i have sentence "this is the test" i want to match this one and only if there are any 3 words before the string test.
re.match(r'(\d\w+\d){3}test', "this is the test")
thought the above sentence would work but didnt work.

If you want to match strings with 3 words, separated by whitespace(s) followed by 'test': (\b){3}test
If you want to extract first 3 words, separated by whitespace(s) followed by 'test': (\w+\s+){3}test
If you want same thing, but want to allow whitespace(s) before the stopword: (\w+\s+){3}\w?test
If you want to do same, but with string ending with the stopword: (\w+\s+){3}\w?test$

Related

Split string from digits/number according to sentence length

I have cases that I need to seperate chars/words from digits/numbers which are written consecutively, but I need to do this only when char/word length more than 3.
For example,
input
ferrari03
output must be:
ferrari 03
However, it shouldn't do any action for the followings:
fe03, 03fe, 03ferrari etc.
Can you help me on this one ? I'm trying to do this without coding any logic, but re lib in python.

Using re.sub() we can try:
inp = ["ferrari03", "fe03", "03ferrari", "03fe"]
output = [re.sub(r'^([A-Za-z]{3,})([0-9]+)$', r'\1 \2', i) for i in inp]
print(output) # ['ferrari 03', 'fe03', '03ferrari', '03fe']
Given an input word, the above regex will match should that word begin with 3 or more letters and end in 1 or more digits. In that case, we capture the letters and numbers in the \1 and \2 capture groups, respectively. We replace by inserting a separating space.

Removing the subsequent words after identified key words

I am trying out regular expressions and would like to find out if there is any way to remove immediate subsequent words after the word I have identified.
For example,
text = "This is the full sentence that I wish to apply regex on."
If I want to remove the word "full", I understand that I can do the following:
result = re.sub(r"full", "", text)
which would give me
This is the sentence that I wish to apply regex on.
Is there any way to get the following statement from my above text? (e.g. keep parts 5 words after the word "full" or remove the first 5 words after "full" )
apply regex on.

Match everything from the beginning to the string to the full, then repeat a group of \s\S+ 5 times to match 5 words, and replace the whole match with the empty string:
^.*?full(?:\s\S+){5}\s
https://regex101.com/r/H6tdCH/1

Match only list of words in string

I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks

You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example

How to extract numbers from string using regular expression in Python except when within brackets?

Here are my test strings:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
What regular expression should I use to extract only those numbers (format: \d+-\d+) that are not within brackets (the ones in bold above)?
I've tried this:
(\d+-\d+)(?!\))
But it's matching:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
Note the last digit before the second bracket.
I was trying to drop any match that is followed by a bracket, but it's only dropping one digit rather than the whole match! What am I missing here?
Any help will be greatly appreciated.

You can use a negative look-ahead to get only those values you need like this:
(?![^()]*\))(\d+-\d+)
The (?![^()]*\)) look-ahead actually checks that there are no closing round brackets after the hyphenated numbers.
See demo
Sample code:
import re
p = re.compile(ur'(?![^()]*\))(\d+-\d+)')
test_str = u"Word word word; 123-125\nWord word (1000-1000)\nWord word word (1000-1000); 99-999\nWord word word word"
re.findall(p, test_str)
Output of the sample program:
[u'123-125', u'99-999']

A way consists to describe all you don't want:
[^(\d]*(?:\([^)]*\)[^(\d]*)*
Then you can use an always true assertion: a digits are always preceded by zero or more characters that are not digits and characters between quotes.
You only need to capture the digits in a group:
p = re.compile(r'[^(\d]*(?:\([^)]*\)[^(\d]*)*(\d+-\d+)')
The advantage of this way is that you don't need to test a lookahead at each position in the string, so it is a fast pattern. The inconvenient is that it consumes a little more memory, because the whole match produces more long strings.

Regular expression in Python sentence extractor

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python Regex to match sentences which has n words - python

Related

Split string from digits/number according to sentence length

Removing the subsequent words after identified key words

Match only list of words in string

How to extract numbers from string using regular expression in Python except when within brackets?

Regular expression in Python sentence extractor

Categories

Resources