Finding words in phrases using regular expression - python

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?

Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

Related

regex match word and what comes after it

I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.
words = ['qtr','hard','quarter'] # keywords that must exist
test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
excepted output is
['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]
What I have tried so far
re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)
The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.
You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:
pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
groups = pattern.search(x)
if groups:
result.append(groups.group(0))
else:
result.append(None)
OUTPUT:
result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.
Idea extracted from the existing answer and made shorter :
>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0) if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
As mentioned in comment :
But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.
We can try this to overcome that issue :
>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
You can put all needed words in or expression and put your word definition after that
import re
words = ['qtr','hard','quarter']
regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"
p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
for string in test:
result = p.search(string)
if result is not None:
print(p.search(string).group(0))
else:
print(result)
Output:
hard/qtr Mix
qtr
hard
hard work
None

How to get all sentences that contain multiple words in Python

I am trying to make a regular expressions to get all sentences containing two words (order doesn't matter), but I can't find the solution for this.
"Supermarket. This apple costs 0.99."
I want to get back the following sentence:
This apple costs 0.99.
I tried:
([^.]*?(apple)*?(costs)[^.]*\.)
I have problems because the price contains a dot. Also this expressions gives back results with only one of the words.
Approach: For each Phrase, we have to find the sentences which contain all the words of the phrase. So, for each word in the given phrase, we check if a sentence contains it. We do this for each sentence. This process of searching may become faster if the words in the sentence are stored in a set instead of a list.
Below is the implementation of above approach in python:
def getRes(sent, ph):
sentHash = dict()
# Loop for adding hased sentences to sentHash
for s in range(1, len(sent)+1):
sentHash[s] = set(sent[s-1].split())
# For Each Phrase
for p in range(0, len(ph)):
print("Phrase"+str(p + 1)+":")
# Get the list of Words
wordList = ph[p].split()
res = []
# Then Check in every Sentence
for s in range(1, len(sentHash)+1):
wCount = len(wordList)
# Every word in the Phrase
for w in wordList:
if w in sentHash[s]:
wCount -= 1
# If every word in phrase matches
if wCount == 0:
# add Sentence Index to result Array
res.append(s)
if(len(res) == 0):
print("NONE")
else:
print('% s' % ' '.join(map(str, res)))
# Driver Function
def main():
sent = ["Strings are an array of characters",
"Sentences are an array of words"]
ph = ["an array of", "sentences are strings"]
getRes(sent, ph)
main()
You use a negated character class [^.] which matches any character except a dot.
But in your example data Supermarket. This apple costs 0.99. there are 2 dots before the dot at the end, so you can not cross the dot after Supermarket. to match apple
You could for example match until the first dot, then assert costs and use a capture group to match the part with apple and make sure the line ends with a dot.
The assertion for word 1 with a match for word 2 will match the words in both combinations.
^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$
Explanation
^[^.]*\. From the start of the string, match until and including the first dot
\s* Match 0+ whitespace character
(?=.*\bcosts\b) Positive lookahead, assert costs at the right
( Capture group 1 (this has the desired value)
.*\bapple\b.*\. Match the rest of the line that includes apple and ends with a dot
) Close group 1
$ Assert end of string
Regex demo | Python demo
import re
regex = r"^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$"
test_str = ("Supermarket. This apple costs 0.99.\n"
"Supermarket. This costs apple 0.99.\n"
"Supermarket. This apple is 0.99.\n"
"Supermarket. This orange costs 0.99.")
print(re.findall(regex, test_str, re.MULTILINE))
Output
['This apple costs 0.99.', 'This costs apple 0.99.']
I also suggest to first extract sentences and then find sentences that have both words.
However, the problem of splitting text into sentences is pretty hard because of existence of abbreviations, unusual names, etc. One way to do it is by using nltk.tokenize.punkt module.
You'll need to install NLTK and then run this in Python:
import nltk
nltk.download('punkt')
After that you can use English language sentence tokenizer with two regexes:
TEXT = 'Mr. Bean is in supermarket. iPhone 12 by Apple Inc. costs $999.99.'
WORD1 = 'apple'
WORD2 = 'costs'
import nltk.data, re
# Regex helper
find_word = lambda w, s: re.search(r'(^|\W)' + w + r'(\W|$)', s, re.I)
eng_sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for sent in eng_sent_detector.tokenize(TEXT):
if find_word(WORD1, sent) and find_word(WORD2, sent):
print (sent,"\n----")
Output:
iPhone 12 by Apple Inc. costs $999.99.
----
Notice that it handles numbers and abbreviations for you.

Python - How to use re.finditer with multiple patterns

I want to search 3 Words in a String and put them in a List
something like:
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
and the answer should look like:
['got', 'which','had','got']
I haven't found a way to use re.finditer in such a way. Sadly im required to use finditer
rather that findall
You can build the pattern from your list of searched words, then build your output list with a list comprehension from the matches returned by finditer:
import re
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
# the regex will be r'\b(had|which|got)\b'
out = [m.group() for m in regex.finditer(sentence)]
print(out)
# ['got', 'which', 'had', 'got']
The idea is to combine the entries of the pattern list to form a regular expression with ors.
Then, you can use the following code fragment:
import re
sentence = 'Tom once got a bike which he had left outside in the rain so it got rusty. ' \
'Luckily, Margot and Chad saved money for him to buy a new one.'
pattern = ['had', 'which', 'got']
regex = re.compile(r'\b({})\b'.format('|'.join(pattern)))
# regex = re.compile(r'\b(had|which|got)\b')
results = [match.group(1) for match in regex.finditer(sentence)]
print(results)
The result is ['got', 'which', 'had', 'got'].

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7
You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.

Categories

Resources