Match shortest substring in list using for loop - python

I am trying to match items (single words) from one list with items (full sentences) from a second list. This is my code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
if word in line:
print(word,line)
The problem now is that my code outputs substrings, so when looking for a sentence in which 'Python' occurs, I am also getting 'Pythons'; similarly, I am getting 'Funny' when I only want the sentence containing the word 'Fun'.
I have tried adding spaces surrounding the words in the list, but this is not an ideal solution, because the sentences may contain punctuation, and the code does not return a match.
Desired output:
- Time, Time is High
- Fun, That's Fun!
- Python, Python is nice

Since you want exact matches, it'd be better to use == instead of in.
import string
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
for wrd in line.split():
if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
print(word,line)

It is not as easy (requires more lines of code) to achieve retrieving "Fun!" for Fun and at the same time not "Pythons" for Python.. It can be done of course but your rules are not very clear to me at this point. Have a look at this though:
tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
Below you get exactly the same thing only this time instead of a list comprehension you use good old for loops. I though it might help you understand the code above easier.
a = []
for phrase in sentences:
words_in_phrase = phrase.split()
for words in tokens:
if words in words_in_phrase:
a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
What is happening here is that the code returns the string it found and the phrase in which it found it. The way this is done that it takes the phrases in the sentence list and split them on whitespace. So "Pythons" and "Python" are not the same as you wanted but so is "Fun!" and "Fun". This is also case sensitive.

You might want to use dynamically generated regular expressions, ie for "Python" the regexp will look like '\bPython\b'. '\b' is a word boundary.
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import re
for word in tokens:
regexp = re.compile('\b' + word + '\b')
for line in sentences:
if regexp.match(line):
print(line)
print(word,line)

tokenized sentence is better then split it by space, since tokenize will separate punctuation.
for example:
sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']
Code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
for token in tokens:
if token in nltk.word_tokenize(sentence):
print token,sentence

Related

Regex - question about finding every word in a string that begins with a letter

import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?
import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

Regex: Only want space character before and after match

I am using Regex tokenizer for a text passage, and I would like to extract all words that only have white space before and after that. Here is my code:
tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')
For instance, the sentence "we don't have 500 dollars" will end up becoming "we don have dollars". I would like to have "don" eliminated since it does not end with a whitespace. How do I do so?
You can use positive lookahead and lookbehind to achieve this
Code:
import re
pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
print(re.findall(pattern, "we don't have 500 dollars"))
print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))
Output:
['we', 'have', '500', 'dollars']
['Your', 'no', 'good', 'Torrance']
You can play around with this here
https://regex101.com/r/IeLC88/3

How to treat certain words as delimiters in nltk Python?

I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters
The expected output is this:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
This is the code which I trying to make the above output
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
My code output is this:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
How can I make the expected output?
So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol ., you can introduce that to multiple splits by using re.split().
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
Because we are using both single . and . with a whitespace after, the split results will return an additional ''. Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']

How to put key-words in NLTK tokenize?

Input:"My favorite game is call of duty."
And I set "call of duty" as a key-words, this phrase will be one word in tokenize process.
Finally want to get the result:['my','favorite','game','is','call of duty']
So, how to set the key-words in python NLP ?
I think what you want is keyphrase extraction, and you can do it for instance by first tagging each word with it's PoS-tag and then apply some sort of regular expression over the PoS-tags to join interesting words into keyphrases.
import nltk
from nltk import pos_tag
from nltk import tokenize
def extract_phrases(my_tree, phrase):
my_phrases = []
if my_tree.label() == phrase:
my_phrases.append(my_tree.copy(True))
for child in my_tree:
if type(child) is nltk.Tree:
list_of_phrases = extract_phrases(child, phrase)
if len(list_of_phrases) > 0:
my_phrases.extend(list_of_phrases)
return my_phrases
def main():
sentences = ["My favorite game is call of duty"]
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
for x in sentences:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
print "\nNoun phrases:"
list_of_noun_phrases = extract_phrases(tree, 'NP')
for phrase in list_of_noun_phrases:
print phrase, "_".join([x[0] for x in phrase.leaves()])
if __name__ == "__main__":
main()
This will output the following:
Noun phrases:
(NP favorite/JJ game/NN) favorite_game
(NP call/NN) call
(NP duty/NN) duty
But,you can play around with
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
trying other types of expressions, so that you can get exactly what you want, depending on the words/tags you want to join together.
Also if you are interested, check this very good introduction to keyphrase/word extraction:
https://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/
This is, of course, way too late to be useful to the OP, but I thought I'd put this answer here for others:
It sounds like what you might be really asking is: How do I make sure that compound phrases like 'call of duty' get grouped together as one token?
You can use nltk's multiword expression tokenizer, like so:
string = 'My favorite game is call of duty'
tokenized_string = nltk.word_tokenize(string)
mwe = [('call', 'of', 'duty')]
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
tokenized_string = mwe_tokenizer.tokenize(tokenized_string)
Where mwestands for multi-word expression. The value of tokenized_string will be ['My', 'favorite', 'game', 'is', 'call of duty']

find best approach to recognize list of sequence word in sentence

I have two list of words that I would like to find in a sentence based on a sequence. I would like to check is it possible to use "regular expression" or I should use check the sentence by if condition?
n_ali = set(['ali','aliasghar'])
n_leyla = set(['leyla','lili',leila])
positive_adj = set(['good','nice','handsome'])
negative_adj = set(['bad','hate','lousy'])
Sentence = "aliasghar is nice man. ali is handsome man of my life. lili has so many bad attitude who is next to my friend. "
I would like to find any pattern as below:
n_ali + positive_adj
n_ali + negative_adj
n_leyla + positive_adj
n_leyla + negative_adj
I am using python 3.5 in VS2015 and I am new in NLTK. I know how to create a "regular expression" for check a single word but I am not sure what is the best approach for list of similar names. kindly help me and suggest me what is the best way to implement this approach.
You should consider removing stopwords.
import nltk
from nltk.corpus import stopwords
>>> words = [word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')]
>>> words
['aliasghar', 'nice', 'man', '.', 'ali', 'handsome', 'man', 'life', '.', 'lili', 'many', 'bad', 'attitude', 'next', 'friend', '.']
Alright, now you have the data like you want it (mostly). Let's use simple looping to store the results in pairs for ali and leila separately.
>>> ali_adj = []
>>> leila_adj = []
>>> for i, word in enumerate(words[:-1]):
... if word in n_ali and (words[i+1] in positive_adj.union(negative_adj)):
... ali_adj.append((word, words[i+1]))
... if word in n_leyla and (words[i+1] in positive_adj.union(negative_adj)):
... leila_adj.append((word, words[i+1]))
...
>>>
>>> ali_adj
[('aliasghar', 'nice'), ('ali', 'handsome')]
>>> leila_adj
[]
Note that we could not find any adjectives to describe leila because "many" isn't a stopword. You may have to do this type of cleaning of the sentence manually.

Categories

Resources