Is it possible to get concordance for a phrase in NLTK?
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)
corpus.concordance("claim")
for example the above returns
on okay okay okay i can give you the claim number and my information and
decide on the shop okay okay so the claim number is xxxx - xx - xxxx got
and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.
According to this issue it is not (yet) possible to search for multiple words with the concordance() function.
If you read the discussion under the very issue that #b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:
>>> from nltk.app import concordance
>>> concordance()
I munged together this solution...
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
#concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
phraseList=phrase.split(' ')
c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
#Find the offset for each token in the phrase
offsets=[c.offsets(x) for x in phraseList]
offsets_norm=[]
#For each token in the phraselist, find the offsets and rebase them to the start of the phrase
for i in range(len(phraseList)):
offsets_norm.append([x-i for x in offsets[i]])
#We have found the offset of a phrase if the rebased values intersect
#--
# http://stackoverflow.com/a/3852792/454773
#the intersection method takes an arbitrary amount of arguments
#result = set(d[0]).intersection(*d[1:])
#--
intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
for offset in intersects])
outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
return outputs
def n_concordance(txt,phrase,left_margin=5,right_margin=5):
tokens = nltk.word_tokenize(txt)
text = nltk.Text(tokens)
return
n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)
n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
u'; for Whales of a monstrous size are oftentimes cast up dead ']
Related
I want to do fuzzy matching on string with words.
The target string could be like.
"Hello, I am going to watch a film today."
where the words I want to search are.
"flim toda".
This hopefully should return "film today" as a search result.
I have used this method but it seems to be working only with one word.
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
matched_words = []
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
matched_words.append(match)
return matched_words
large_string = "Hello, I am going to watch a film today"
query_string = "film"
print(list(matches(large_string, query_string, 0.8)))
This only works with one word and it returns when there is little noise.
Is there any way to do such fuzzy matching with words?
The feature you are thinking of is called "query suggestion" and does rely on spell checking, but it relies on markov chains built out of search engine query log.
That being said, you use an approach similar to the one described in this answer: https://stackoverflow.com/a/58166648/140837
You can simply use Fuzzysearch, please see the example below;
from fuzzysearch import find_near_matches
text_string = "Hello, I am going to watch a film today."
matches = find_near_matches('flim toda', text_string, max_l_dist=2)
print([my_string[m.start:m.end] for m in matches])
This will give you the desired output.
['film toda']
Please note that you can give a value for max_l_dist parameter based on how much you are going to tolerate.
I'm running the chunk of code under 3. Word to Vectors Integration from this NLP tutorial. It uses Spacy's lexemes to calculate the most similar words to whatever word you give it - I'm using it to try to find the nearest synonyms to the given word. However, if you replace apple with a word like "look" you get a lot of related words, but not synonyms (examples: pretty, there, over, etc.). I was thinking of modifying the code to also filter by part of speech, so that I could just get verbs in the output and would be able to go from that. To do that, I'd need to use tokens so I can use token.pos_, since that function isn't available for lexemes. Does anyone know a way to take the output (list called "others" in the code) and change it from a lexeme to a token? I was reading over spacy's information document for lexemes here, but I haven't been able to find anything about transforming.
I've also tried adding a section of code at the end to the other person's code:
from numpy import dot
from numpy.linalg import norm
import spacy
from spacy.lang.en import English
nlp = English()
parser = spacy.load('en_core_web_md')
my_word = u'calm'
#Generate word vector of the word - apple
apple = parser.vocab[my_word]
#Cosine similarity function
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in parser.vocab if w.has_vector and w.orth_.islower()
and w.lower_ != my_word})
print("done listing")
# sort by similarity score
others.sort(key=lambda w: cosine(w.vector, apple.vector))
others.reverse()
for word in others[:10]:
print(word.orth_)
The part I added:
b = ""
for word in others[:10]:
a = str(word) + ' '
b += a
doc = nlp(b)
print(doc)
token = doc[0]
counter = 1
while counter < 50:
token += doc[counter]
counter += 1
print(token)
This is the output error:
'token += doc[counter]
TypeError: unsupported operand type(s) for +=: 'spacy.tokens.token.Token' and 'spacy.tokens.token.Token'
<spacy.lexeme.Lexeme object at 0x000002920ABFAA68> <spacy.lexeme.Lexeme object at 0x000002920BD56EE8> '
Does anyone have any suggestions to fix what I did or another way to change the lexeme to a token? Thank you!
You have to create a Doc to get a Token, as the Token doesn't own any data -- it's just a view. So, you could do something like:
doc = Doc(lex.vocab, words=[lex.orth_])
I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.
I have a text file with a lot of comments/sentences, and I want to somehow find the most common phrases repeated in the document itself. I tried fiddling with it a bit with NLTK and I found this thread: How to extract common / significant phrases from a series of text entries
However, after trying it, I get odd results like these:
>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)
[('m', 'e'), ('t', 's')]
And in another file where the phrase "this is funny" is very common, I get an empty list [].
How should I go about doing this?
Here's my full code:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt')
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)
I haven't used nltk, but I suspect the problem is that from_words accepts a string or tokens(?) object.
Something akin to
with open('MkXVM6ad9nI.txt') as wordfile:
text = wordfile.read)
tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
is likely to work, although there's probably a specialised API for files too.
I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer.
Here is a random text that output what I was expecting
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective
Output: 'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb
Output: 'worse'
Well, everything here is fine. The behaviour is the same with other adjectives like 'better' (for an irregular form) or 'older' (note that the same test with 'elder' will never output 'old', but I guess that wordnet is not an exhaustive list of all the existing english word)
My question comes when trying with the word 'furter':
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective
Output: 'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb
Output: 'far'
This is the exact opposite behaviour of the one for the 'worse' word!
Can anybody explain me why ? Is it a bug coming from the wordnet synsets data or does it come from my misunderstanding of the english grammar ?
Please excuse me if the question is already answered, I've search on google and SO, but when specifying the keyword "further", I can find anything related but mess because of the popularity of this word...
Thank you in advance,
Romain G.
WordNetLemmatizer uses the ._morphy function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
And the ._morphy function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS. then it sees whether there are other words that are shorter but the same as the reduced word:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions, see _load_exception_map in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
Going back to your example, worse -> bad and further -> far CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.
The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc and ~/nltk_data/corpora/wordnet/adv.exc.
From adv.exc:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
From adj.exc:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...