I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.
Related
I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)
So I looked for a way to get forms of a word.
This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:
from nltk.corpus import wordnet as wn
str = "retrieving"
synsets = wn.synsets(str)
s = set()
result = ""
for synset in synsets:
related = None
lemmas = synset.lemmas()
for lemma in lemmas:
forms = lemma.derivationally_related_forms()
for form in forms:
name = form.name()
s.add(name)
print(list(s))
The output is:
['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']
But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc
and the result is also missing other forms, such as: 'retrieve'
I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms
Is there a way to get the result I am expecting?
You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.
Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )
The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.
If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.
I have to identify cities in a document (has only characters), I do not want to maintain an entire vocabulary as it is not a practical solution. I also do not have Azure text analytics api account.
I have already tried using Spacy, I did ner and identified geolocation and that output is passed to spellchecker() to train the model. But the issue with this is that ner requires sentences and my input has words.
I am relatively new to this field.
You can check out the geotext library.
Working example with a sentence:
text = "The capital of Belarus is Minsk. Minsk is not so far away from Kiev or Moscow. Russians and Belarussians are nice people."
from geotext import GeoText
places = GeoText(text)
print(places.cities)
Output:
['Minsk', 'Minsk', 'Kiev', 'Moscow']
Working example with list of words:
wordList = ['London', 'cricket', 'biryani', 'Vilnius', 'Delhi']
for i in range(len(wordList)):
places = GeoText(wordList[i])
if places.cities:
print(places.cities)
Output:
['London']
['Vilnius']
['Delhi']
geograpy is another alternative. However, I find geotext light due to lesser number of external dependencies.
there is a list of libraries that may help you,
but from my experience, there is not a perfect library for this. If you know all the cities that may appear in the text, then vocabulary is the best thing
When using spacy, the lemma of a token (lemma_) depends on the POS. Therefore, a specific string can have more than one lemmas. For example:
import spacy
nlp = spacy.load('en')
for tok in nlp(u'He leaves early'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
for tok in nlp(u'These are green leaves'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
Will yield that the lemma for 'leaves' can be either 'leave' or 'leaf', depending on context. I'm interested in:
1) Get all possible lemmas for a specific string, regardless of context. Meaning, applying the Lemmatizer without depending on the POS or exceptions, just get all feasible options.
In addition, but independently, I would also like to apply tokenization and get the "correct" lemma.
2) Running over a large corpus only tokenization and lemmatizer, as efficiently as possible, without damaging the lemmatizer at all. I know that I can drop the 'ner' pipeline for example, and shouldn't drop the 'tagger', but didn't receive a straightforward answer regarding parser etc. From a simulation over a corpus, it seems like results yielded the same, but I thought that the 'parser' or 'sentenzicer' should affect? My current code at the moment is:
import multiprocessing
our_num_threads = multiprocessing.cpu_count()
corpus = [u'this is a text', u'this is another text'] ## just an example
nlp = spacy.load('en', disable = ['ner', 'textcat', 'similarity', 'merge_noun_chunks', 'merge_entities', 'tensorizer', 'parser', 'sbd', 'sentencizer'])
nlp.pipe(corpus, n_threads = our_num_threads)
If I have a good answer on 1+2, I can then for my needs use for words that were "lemmatized", consider other possible variations.
Thanks!
I am wondering that is there any efficient way to extract expected target phrase or key phrase from given sentence. So far I tokenized the given sentence and get POS tag for each word. Now I am not sure how to extract target key phrase or keyword from given sentence. The way of doing this is not intuitive to me.
Here is my input sentence list:
sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}
here is the tokenized sentence:
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]
Here I used Spacy to get POS tag of words:
import spacy
nlp = spacy.load('en_core_web_sm')
res=[]
for i in range(len(sentence_list.index)):
for token in i:
res.append(token.pos_)
so I may use NER (a.k.a, name entity relation) from spacy but its output is not the same thing with my pre-defined expected target phrase. Does anyone know how to accomplish this task either using Spacy or stanfordcorenlp module in python? what is an efficient solution to make this happen? Any idea? Thanks in advance :)
desired output:
I want to get the list of target phrase from respective sentence list as follow:
target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}
so I concatenate my input sentence_list with an expected target phrase, my final desired output would be like this:
import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)
How can I get my expected target phrases from a given input sentence list by using spacy? Any idea?
You can possibly do this using spacy by Phrase Matcher.
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
matcher.add('DELL', None, nlp(u"DELL Customer Service"))
doc = nlp(u"My problem was with DELL Customer Service")
matches = matcher(doc)
I am using NLTK to extract nouns from a text-string starting with the following command:
tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))
It works fine in English. Is there an easy way to make it work for German as well?
(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)
Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.
See nltk.corpus.europarl_raw and this answer for example configuration.
Also, consider tagging this question with "nlp".
The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:
from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]
>>> Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')
Update: Another option is spacy, there is a quick example in this blog article:
import spacy
nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')
# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.
The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.
You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)
Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html
#-*- coding: utf8 -*-
import os, glob, codecs
def installStanfordTag():
if not os.path.exists('stanford-postagger-full-2013-06-20'):
os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
os.system('unzip stanford-postagger-full-2013-06-20.zip')
return
def tag(infile):
cmd = "./stanford-postagger.sh "+models[m]+" "+infile
tagout = os.popen(cmd).readlines()
return [i.strip() for i in tagout]
def taglinebyline(sents):
tagged = []
for ss in sents:
os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
tagged.append(tag('stanfordtemp.txt')[0])
return tagged
installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
'dewac':'models/german-dewac.tagger',
'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()
m = 'fast' # It's best to use the fast german tagger if your data is small.
sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']
tagged_sents = taglinebyline(sentences) # Call the stanford tagger
for sent in tagged_sents:
print sent
I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.
It seems to be a little late to answer the question, but it might be helpful for anyone who finds this question by googling like i did. So i'd like to share the things I found out.
The HannoverTagger might be a useful tool for this Task.
You can find tutorials here and here(german), but the second one is in german.
The Tagger seems to use the STTS Tagset, if you need a complete list of all Tags.