python Regex to match sentences which has n words - python

Im looking for a python regex which matches sentences with 3 words before a string.
for example say i have sentence "this is the test" i want to match this one and only if there are any 3 words before the string test.
re.match(r'(\d\w+\d){3}test', "this is the test")
thought the above sentence would work but didnt work.

If you want to match strings with 3 words, separated by whitespace(s) followed by 'test': (\b){3}test
If you want to extract first 3 words, separated by whitespace(s) followed by 'test': (\w+\s+){3}test
If you want same thing, but want to allow whitespace(s) before the stopword: (\w+\s+){3}\w?test
If you want to do same, but with string ending with the stopword: (\w+\s+){3}\w?test$

Related

Split string from digits/number according to sentence length

I have cases that I need to seperate chars/words from digits/numbers which are written consecutively, but I need to do this only when char/word length more than 3.
For example,
input
ferrari03
output must be:
ferrari 03
However, it shouldn't do any action for the followings:
fe03, 03fe, 03ferrari etc.
Can you help me on this one ? I'm trying to do this without coding any logic, but re lib in python.
Using re.sub() we can try:
inp = ["ferrari03", "fe03", "03ferrari", "03fe"]
output = [re.sub(r'^([A-Za-z]{3,})([0-9]+)$', r'\1 \2', i) for i in inp]
print(output) # ['ferrari 03', 'fe03', '03ferrari', '03fe']
Given an input word, the above regex will match should that word begin with 3 or more letters and end in 1 or more digits. In that case, we capture the letters and numbers in the \1 and \2 capture groups, respectively. We replace by inserting a separating space.

Removing the subsequent words after identified key words

I am trying out regular expressions and would like to find out if there is any way to remove immediate subsequent words after the word I have identified.
For example,
text = "This is the full sentence that I wish to apply regex on."
If I want to remove the word "full", I understand that I can do the following:
result = re.sub(r"full", "", text)
which would give me
This is the sentence that I wish to apply regex on.
Is there any way to get the following statement from my above text? (e.g. keep parts 5 words after the word "full" or remove the first 5 words after "full" )
apply regex on.
Match everything from the beginning to the string to the full, then repeat a group of \s\S+ 5 times to match 5 words, and replace the whole match with the empty string:
^.*?full(?:\s\S+){5}\s
https://regex101.com/r/H6tdCH/1

Match only list of words in string

I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks
You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example

How to extract numbers from string using regular expression in Python except when within brackets?

Here are my test strings:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
What regular expression should I use to extract only those numbers (format: \d+-\d+) that are not within brackets (the ones in bold above)?
I've tried this:
(\d+-\d+)(?!\))
But it's matching:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
Note the last digit before the second bracket.
I was trying to drop any match that is followed by a bracket, but it's only dropping one digit rather than the whole match! What am I missing here?
Any help will be greatly appreciated.
You can use a negative look-ahead to get only those values you need like this:
(?![^()]*\))(\d+-\d+)
The (?![^()]*\)) look-ahead actually checks that there are no closing round brackets after the hyphenated numbers.
See demo
Sample code:
import re
p = re.compile(ur'(?![^()]*\))(\d+-\d+)')
test_str = u"Word word word; 123-125\nWord word (1000-1000)\nWord word word (1000-1000); 99-999\nWord word word word"
re.findall(p, test_str)
Output of the sample program:
[u'123-125', u'99-999']
A way consists to describe all you don't want:
[^(\d]*(?:\([^)]*\)[^(\d]*)*
Then you can use an always true assertion: a digits are always preceded by zero or more characters that are not digits and characters between quotes.
You only need to capture the digits in a group:
p = re.compile(r'[^(\d]*(?:\([^)]*\)[^(\d]*)*(\d+-\d+)')
The advantage of this way is that you don't need to test a lookahead at each position in the string, so it is a fast pattern. The inconvenient is that it consumes a little more memory, because the whole match produces more long strings.

Regular expression in Python sentence extractor

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.
if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)
Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))
If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]

Categories

Resources