I'm trying to increase the efficiency of a non-conformity management program. Basically, I have a database containing about a few hundred rows, each row describes a non-conformity using a text field.
Text is provided in Italian and I have no control over what the user writes.
I'm trying to write a python program using NTLK to detect how many of these rows report the same problem, written differently but with similar content.
For example, the following sentences need to be related, with a high rate of confidence
I received 10 pieces less than what was ordered
10 pieces have not been shipped
I already found the following article describing how to preprocess text for analysis:
How to Develop a Paraphrasing Tool Using NLP (Natural Language Processing) Model in Python
I also found other questions on SO but they all refer to word similarity, two sentences comparison, or comparison using a reference meaning.
This one uses a reference meaning
This one refers to two sentences comparison
In my case, I have no reference and I have multiple sentences that needs to be grouped if they refer to similar problems, so I wonder if this job it's even possible to do with a script.
This answer says that it cannot be done but it's quite old and maybe someone knows something new.
Thanks to everyone who can help me.
Thank's to Anurag Wagh advice I figured it out.
I used this tutorial about gensim and how to use it in many ways.
Chapter 18 does what I was asking for, but during my test, I found out a better way to achieve my goal.
Chatper 11 shows how to build an LDA model and how to extract a list of main topics among a set of documents.
Here is my code used to build the LDA model
# Step 0: Import packages and stopwords
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
from nltk.corpus import stopwords
from gensim import corpora
import re
import nltk
import string
import pattern
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
docs = [doc for doc in open('file.txt', encoding='utf-8')]
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
it_stop_words = it_stop_words + [<custom stop words>]
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word
def lemmatize_word(input_word):
in_word = input_word
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
the_lemmatized_word = word_it.split()[0][0][4]
return the_lemmatized_word
# Step 2: Prepare Data (Remove stopwords and lemmatize)
data_processed = []
for doc in docs:
word_tokenized_list = nltk.tokenize.word_tokenize(doc)
word_tokenized_no_punct = [x.lower() for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
data_processed.append(word_tokenized_no_punct_no_sw_no_apostrophe)
dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]
lda_model = LdaMulticore(corpus=corpus,
id2word=dct,
random_state=100,
num_topics=7,
passes=10,
chunksize=1000,
batch=False,
alpha='asymmetric',
decay=0.5,
offset=64,
eta=None,
eval_every=0,
iterations=100,
gamma_threshold=0.001,
per_word_topics=True)
# save the model
lda_model.save('lda_model.model')
# See the topics
lda_model.print_topics(-1)
With the model trained i can get a list of topic for each new non-conformity and detect if it's related to something already reported by others non-conformities
Perhaps converting document to vectors and the computing distance between two vectors would be helpful
doc2vec can be helpful over here
For a project, I would like to be able to get the noun form of an adjective or adverb if there is one using NLP.
For example, "deathly" would return "death" and "dead" would return "death".
"lively" would return "life".
I've tried using the spacy lemmatizer but it does not manage to get the base radical form.
For example, if I'd do:
import spacy
nlp = spacy.load('en_core_web_sm')
z = nlp("deathly lively")
for token in z:
print(token.lemma_)
It would return:
>>> deathly lively
instead of:
>>> death life
Does anyone have any ideas?
Any answer is appreciated.
From what I've seen so far, SpaCy is not super-great at doing what you want it to do. Instead, I am using a 3rd party library called pyinflect, which is intended to be used as an extension to SpaCy.
While it isn't perfect, I think it will work better than your current approach.
I'm also considering another 3rd-party library called inflect, which might be worth checking out, as well.
I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.
Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:
tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')
I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?
Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.
Many Thanks.
If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.
In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:
python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest
If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:
r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")
r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")
There is no option that you can pass to NLTK's POS-tagging and lemmatizing functions that will make them process other languages.
One solution would be to get a training corpus for each language and to train your own POS-taggers with NLTK, then figure out a lemmatizing solution, maybe dictonary-based, for each language.
That might be overkill though, as there is already a single stop solution for both tasks in Italian, French, Spanish and German (and many other languages): TreeTagger. It is not as state-of-the-art as the POS-taggers and lemmatizers in English, but it still does a good job.
What you want is to install TreeTagger on your system and be able to call it from Python. Here is a GitHub repo by miotto that lets you do just that.
The following snippet shows you how to test that you set up everything correctly. As you can see, I am able to POS-tag and lemmatize in one function call, and I can do it just as easily in English and in French.
>>> import os
>>> os.environ['TREETAGGER'] = "/opt/treetagger/cmd" # Or wherever you installed TreeTagger
>>> from treetagger import TreeTagger
>>> tt_en = TreeTagger(encoding='utf-8', language='english')
>>> tt_en.tag('Does this thing even work?')
[[u'Does', u'VBZ', u'do'], [u'this', u'DT', u'this'], [u'thing', u'NN', u'thing'], [u'even', u'RB', u'even'], [u'work', u'VB', u'work'], [u'?', u'SENT', u'?']]
>>> tt_fr = TreeTagger(encoding='utf-8', language='french')
>>> tt_fr.tag(u'Mon Dieu, faites que ça marche!')
[[u'Mon', u'DET:POS', u'mon'], [u'Dieu', u'NOM', u'Dieu'], [u',', u'PUN', u','], [u'faites', u'VER:pres', u'faire'], [u'que', u'KON', u'que'], [u'\xe7a', u'PRO:DEM', u'cela'], [u'marche', u'NOM', u'marche'], [u'!', u'SENT', u'!']]
Since this question gets asked a lot (and since the installation process is not super straight-forward, IMO), I will write a blog post on the matter and update this answer with a link to it as soon as it is done.
EDIT:
Here is the above-mentioned blog post.
I quite like using SpaCy for multilingual NLP. They have trained models for Catalan, Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Macedonian, Norwegian Bokmäl, Polish, Portuguese, Romanian, Russian and Spanish.
You would simply load a different model depending on the language you're working with:
import spacy
nlp_DE = spacy.load("de_core_news_sm")
nlp_FR = spacy.load("fr_core_news_sm")
It's not as accurate as Treetagger or Hanovertagger but it is very easy to use while outputting useable results that are much better than NLTK.
Is it possible to use non-standard part of speech tags when making a grammar for chunking in the NLTK? For example, I have the following sentence to parse:
complication/patf associated/qlco with/prep breast/noun surgery/diap
independent/adj of/prep the/det use/inpr of/prep surgical/diap device/medd ./pd
Locating the phrases I need from the text is greatly assisted by specialized tags such as "medd" or "diap". I thought that because you can use RegEx for parsing, it would be independent of anything else, but when I try to run the following code, I get an error:
grammar = r'TEST: {<diap>}'
cp = nltk.RegexpParser(grammar)
cp.parse(sentence)
ValueError: Transformation generated invalid chunkstring:
<patf><qlco><prep><noun>{<diap>}<adj><prep><det><inpr><prep>{<diap>}<medd><pd>
I think this has to do with the tags themselves, because the NLTK can't generate a tree from them, but is it possible to skip that part and just get the chunked items returned? Maybe the NLTK isn't the best tool, and if so, can anyone recommend another module for chunking text?
I'm developing in python 2.7.6 with the Anaconda distribution.
Thanks in advance!
Yes it is possible to use custom tags for NLTK chunking. I have used the same.
Refer: How to parse custom tags using nltk.Regexp.parser()
The ValueError and the error description suggest that there is an error in the formation of your grammar and you need to check that. You can update the answer with the same for suggestions on corrections.
#POS Tagging
words=word_tokenize(example_sent)
pos=nltk.pos_tag(words)
print(pos)
#Chunking
chunk=r'Chunk: {<JJ.?>+<NN.?>+}'
par=nltk.RegexpParser(chunk)
par2=par.parse(pos)
print('Chunking - ',par2)
print('------------------------------ Parsing the filtered chunks')
# printing only the required chunks
for i in par2.subtrees():
if i.label()=='Chunk':
print(i)
print('------------------------------NER')
# NER
ner=nltk.ne_chunk(pos)
print(ner)