I have the same problem that was discussed in this link Python extract sentence containing word, but the difference is that I want to find 2 words in the same sentence. I need to extract sentences from a corpus, which contains 2 specific words. Does anyone could help me, please?
If this is what you mean:
import re
txt="I like to eat apple. Me too. Let's go buy some apples."
define_words = 'some apple'
print re.findall(r"([^.]*?%s[^.]*\.)" % define_words,txt)
Output: [" Let's go buy some apples."]
You can also try with:
define_words = raw_input("Enter string: ")
Check if the sentence contain the defined words:
import re
txt="I like to eat apple. Me too. Let's go buy some apples."
words = 'go apples'.split(' ')
sentences = re.findall(r"([^.]*\.)" ,txt)
for sentence in sentences:
if all(word in sentence for word in words):
print sentence
This would be simple using the TextBlob package together with Python's builtin sets.
Basically, iterate through the sentences of your text, and check if their exists an intersection between the set of words in the sentence and your search words.
from text.blob import TextBlob
search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = []
for sentence in blob.sentences:
words = set(sentence.words)
if search_words & words: # intersection
matches.append(str(sentence))
print(matches)
# ["Let's go buy some apples."]
Update:
Or, more Pythonically,
from text.blob import TextBlob
search_words = set(["buy", "apples"])
blob = TextBlob("I like to eat apple. Me too. Let's go buy some apples.")
matches = [str(s) for s in blob.sentences if search_words & set(s.words)]
print(matches)
# ["Let's go buy some apples."]
I think you want an answer using nltk. And I guess that those 2 words don't need to be consecutive right?
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> text = 'I like to eat apple. Me too. Let's go buy some apples.'
>>> words = ['like', 'apple']
>>> sentences = sent_tokenize(text)
>>> for sentence in sentences:
... if (all(map(lambda word: word in sentence, words))):
... print sentence
...
I like to eat apple.
Related
I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]
If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.
So there's a text and phrases from this text we need to match punctuation marks to:
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
The output I need is:
output = [['apples, and donuts'], ['a donut, i would']]
I'm a beginner, so I was thinking about using .replace() but I don't know how to slice a string and access the exact part I need from the text. Could you help me with that? (I'm not allowed to use any libraries)
You can try regex for that
import re
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
print([re.findall(i[0].replace(" ", r"\W*"), text) for i in phrases])
Output
[['apples, and donuts'], ['a donut, i would']]
By iterating over the phrases list and replacing the space with \W* the regex findall method will be able to detect the search word and ignoring the punctuation.
you can remove all the punctuation in the text and then just use the plain substring search. Your only problem then is how to restore, or to map, the text you've found to the original.
You can do it by remembering the original position in the text of each letter that you keep in the search text. Here's an example. I just removed the nested list around each phrase as it looks useless, you can easily account for it if you need.
from pprint import pprint
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = ['apples and donuts', 'a donut i would']
def find_phrase(text, phrases):
clean_text, indices = prepare_text(text)
res = []
for phr in phrases:
i = clean_text.find(phr)
if i != -1:
res.append(text[indices[i] : indices[i+len(phr)-1]+1])
return res
def prepare_text(text, punctuation='.,;!?'):
s = ''
ind = []
for i in range(len(text)):
if text[i] not in punctuation:
s += text[i]
ind.append(i)
return s, ind
if __name__ == "__main__":
pprint(find_phrase(text, phrases))
['apples, and donuts.', 'a donut, i would']
I'm trying to identify all sentences that contain in-text citations in a journal article in pdf formats.
I converted the .pdf to .txt and wanted to find all sentences that contained a citation, possibly in one of the following format:
Smith (1990) stated that....
An agreement was made on... (Smith, 1990).
An agreement was made on... (April, 2005; Smith, 1990)
Mixtures of the above
I first tokenized the txt into sentences:
import nltk
from nltk.tokenize import sent_tokenize
ss = sent_tokenize(text)
This makes type(ss) list, so I converted the list into str to use re findall:
def listtostring(s):
str1 = ' '
return (str1. join(s))
ee = listtostring(ss)
Then, my idea was to identify the sentences that contained a four digit number:
import re
for sentence in ee:
zz = re.findall(r'\d{4}', ee)
if zz:
print (zz)
However, this extracts only the years but not the sentences that contained the years.
Using regex, something (try it out) that can have decent recall while trying to avoid inappropriate matches (\d{4} may give you a few) is
\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)
A python example (using spaCy instead of NLTK) would then be
import re
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")
l = [sent.text for sent in doc.sents]
for sentence in l:
if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
print(sentence)
import re
l = ['This is 1234','Hello','Also 1234']
for sentence in l:
if re.findall(r'\d{4}',sentence):
print(sentence)
Output
This is 1234
Also 1234
I have used the following code to extract a sentence from file(the sentence should contain some or all of the search keywords)
search_keywords=['mother','sing','song']
with open('text.txt', 'r') as in_file:
text = in_file.read()
sentences = text.split(".")
for sentence in sentences:
if (all(map(lambda word: word in sentence, search_keywords))):
print sentence
The problem with the above code is that it does not print the required sentence if one of the search keywords do not match with the sentence words. I want a code that prints the sentence containing some or all of the search keywords. It would be great if the code can also search for a phrase and extract the corresponding sentence.
It seems like you want to count the number of search_keyboards in each sentence. You can do this as follows:
sentences = "My name is sing song. I am a mother. I am happy. You sing like my mother".split(".")
search_keywords=['mother','sing','song']
for sentence in sentences:
print("{} key words in sentence:".format(sum(1 for word in search_keywords if word in sentence)))
print(sentence + "\n")
# Outputs:
#2 key words in sentence:
#My name is sing song
#
#1 key words in sentence:
# I am a mother
#
#0 key words in sentence:
# I am happy
#
#2 key words in sentence:
# You sing like my mother
Or if you only want the sentence(s) that have the most matching search_keywords, you can make a dictionary and find the maximum values:
dct = {}
for sentence in sentences:
dct[sentence] = sum(1 for word in search_keywords if word in sentence)
best_sentences = [key for key,value in dct.items() if value == max(dct.values())]
print("\n".join(best_sentences))
# Outputs:
#My name is sing song
# You sing like my mother
If I understand correctly, you should be using any() instead of all().
if (any(map(lambda word: word in sentence, search_keywords))):
print sentence
So you want to find sentences that contain at least one keyword. You can use any() instead of all().
EDIT:
If you want to find the sentence which contains the most keywords:
sent_words = []
for sentence in sentences:
sent_words.append(set(sentence.split()))
num_keywords = [len(sent & set(search_keywords)) for sent in sent_words]
# Find only one sentence
ind = num_keywords.index(max(num_keywords))
# Find all sentences with that number of keywords
ind = [i for i, x in enumerate(num_keywords) if x == max(num_keywords)]
I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default.
The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?
import nltk
from nltk.corpus import brown
def findtags(userinput, tagged_text):
uinput = userinput.split()
fdist = nltk.FreqDist(tagged_text)
result = []
for item in fdist.items():
for u in uinput:
if u==item[0][0]:
t = (u,item[0][1])
result.append(t)
continue
t = (u, "NN")
result.append(t)
return result
def main():
tags = findtags("the quick brown fox", brown.tagged_words())
print tags
if __name__ == '__main__':
main()
If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]
If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger
By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.
However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain
def type_token_ratio(documentstream):
ttr = defaultdict(list)
for token, pos in list(chain(*documentstream)):
ttr[token].append(pos)
return ttr
def most_freq_tag(ttr, word):
return Counter(ttr[word]).most_common()[0][0]
sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ."
documents = [sent1, sent2]
# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])
# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]
# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]
NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.
Create a dictionary called word_tags whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist.)
Then:
for u in uinput:
result.append(word_tags[u][0])
You can simply use Counter to find most repeated item in a list:
Python
from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]
If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes:
http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger
Anyways, I suggest you to read NLTK book chapter 5
specially:
http://nltk.org/book/ch05.html#the-lookup-tagger
Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
In this case cfd["fox"].max() will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:
likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())
Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.