How can I use NLP to group multiple senteces by semantic similarity - python

I'm trying to increase the efficiency of a non-conformity management program. Basically, I have a database containing about a few hundred rows, each row describes a non-conformity using a text field.
Text is provided in Italian and I have no control over what the user writes.
I'm trying to write a python program using NTLK to detect how many of these rows report the same problem, written differently but with similar content.
For example, the following sentences need to be related, with a high rate of confidence
I received 10 pieces less than what was ordered
10 pieces have not been shipped
I already found the following article describing how to preprocess text for analysis:
How to Develop a Paraphrasing Tool Using NLP (Natural Language Processing) Model in Python
I also found other questions on SO but they all refer to word similarity, two sentences comparison, or comparison using a reference meaning.
This one uses a reference meaning
This one refers to two sentences comparison
In my case, I have no reference and I have multiple sentences that needs to be grouped if they refer to similar problems, so I wonder if this job it's even possible to do with a script.
This answer says that it cannot be done but it's quite old and maybe someone knows something new.
Thanks to everyone who can help me.

Thank's to Anurag Wagh advice I figured it out.
I used this tutorial about gensim and how to use it in many ways.
Chapter 18 does what I was asking for, but during my test, I found out a better way to achieve my goal.
Chatper 11 shows how to build an LDA model and how to extract a list of main topics among a set of documents.
Here is my code used to build the LDA model
# Step 0: Import packages and stopwords
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
from nltk.corpus import stopwords
from gensim import corpora
import re
import nltk
import string
import pattern
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
docs = [doc for doc in open('file.txt', encoding='utf-8')]
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
it_stop_words = it_stop_words + [<custom stop words>]
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word
def lemmatize_word(input_word):
in_word = input_word
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
the_lemmatized_word = word_it.split()[0][0][4]
return the_lemmatized_word
# Step 2: Prepare Data (Remove stopwords and lemmatize)
data_processed = []
for doc in docs:
word_tokenized_list = nltk.tokenize.word_tokenize(doc)
word_tokenized_no_punct = [x.lower() for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
data_processed.append(word_tokenized_no_punct_no_sw_no_apostrophe)
dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]
lda_model = LdaMulticore(corpus=corpus,
id2word=dct,
random_state=100,
num_topics=7,
passes=10,
chunksize=1000,
batch=False,
alpha='asymmetric',
decay=0.5,
offset=64,
eta=None,
eval_every=0,
iterations=100,
gamma_threshold=0.001,
per_word_topics=True)
# save the model
lda_model.save('lda_model.model')
# See the topics
lda_model.print_topics(-1)
With the model trained i can get a list of topic for each new non-conformity and detect if it's related to something already reported by others non-conformities

Perhaps converting document to vectors and the computing distance between two vectors would be helpful
doc2vec can be helpful over here

Related

How to get the se semantic meaning of a word/phrase

Hi guys I need help on a thing, currently I'm working on a project where I have to find the semantic meaning of a word /phrase. For example
Hi, hello, good morning should return regards etc...
Any suggestion?
Thanks in advance
Your question is a bit vague, but here are two ideas that might help:
1. WordNet
WordNet is a lexical database that provides synonyms, categorisations and to some extent the 'semantic meaning' of English words. Here is the web interface to explore the database. Here is how to use it via NLTK.
Example:
from nltk.corpus import wordnet as wn
# get all possible meanings of a word. e.g. "welcome" has two possible meanings as a noun, three meanings as a verb and one meaning as an adjective
wn.synsets('welcome')
# output: [Synset('welcome.n.01'), Synset('welcome.n.02'), Synset('welcome.v.01'), Synset('welcome.v.02'), Synset('welcome.v.03'), Synset('welcome.a.01')]
# get the definition of one of these meanings:
wn.synset('welcome.n.02').definition()
# output: 'a greeting or reception'
# get the hypernym of the specific meaning, i.e. the more abstract category it belongs to
wn.synset('welcome.n.02').hypernyms()
# output: [Synset('greeting.n.01')]
2. Zero-shot-classification
HuggingFace Transformers and zero-shot classification: You can also use a pre-trained deep learning model to classify your text. In this case, you need to manually create labels for all possible different meanings you are looking for in your texts. e.g.: ["greeting", "insult", "congratulation"].
Then you can use the deep learning model to predict which label (broadly speaking 'semantic meaning') is the most adequate for your text.
Example:
# pip install transformers==3.1.0 # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = "Hi, I welcome you to this event"
candidate_labels = ["greeting", "insult", "congratulation"]
classifier(sequence, candidate_labels)
# output: {'sequence': 'Hi, I welcome you to this event',
# 'labels': ['greeting', 'congratulation', 'insult'],
# 'scores': [0.9001138210296631, 0.09858417510986328, 0.001302019809372723]}
=> Each of your labels received a score and the label with the highest score would be the "semantic meaning" of your text.
Here is an interactive web application to see what the library does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.
You have not shown any effort to do your own code, but here is a small example.
words = ['hello','hi','good morning']
x = input('Word here: ')
if x.lower() in words:
print('Regards')

Python Spacy's Lemmatizer: getting all options for lemmas with maximum efficiency

When using spacy, the lemma of a token (lemma_) depends on the POS. Therefore, a specific string can have more than one lemmas. For example:
import spacy
nlp = spacy.load('en')
for tok in nlp(u'He leaves early'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
for tok in nlp(u'These are green leaves'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
Will yield that the lemma for 'leaves' can be either 'leave' or 'leaf', depending on context. I'm interested in:
1) Get all possible lemmas for a specific string, regardless of context. Meaning, applying the Lemmatizer without depending on the POS or exceptions, just get all feasible options.
In addition, but independently, I would also like to apply tokenization and get the "correct" lemma.
2) Running over a large corpus only tokenization and lemmatizer, as efficiently as possible, without damaging the lemmatizer at all. I know that I can drop the 'ner' pipeline for example, and shouldn't drop the 'tagger', but didn't receive a straightforward answer regarding parser etc. From a simulation over a corpus, it seems like results yielded the same, but I thought that the 'parser' or 'sentenzicer' should affect? My current code at the moment is:
import multiprocessing
our_num_threads = multiprocessing.cpu_count()
corpus = [u'this is a text', u'this is another text'] ## just an example
nlp = spacy.load('en', disable = ['ner', 'textcat', 'similarity', 'merge_noun_chunks', 'merge_entities', 'tensorizer', 'parser', 'sbd', 'sentencizer'])
nlp.pipe(corpus, n_threads = our_num_threads)
If I have a good answer on 1+2, I can then for my needs use for words that were "lemmatized", consider other possible variations.
Thanks!

NLP Phrase Search in Python

I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.

Classify a noun into abstract or concrete using NLTK or similar

How can I categorize a list of nouns into abstract or concrete in Python?
For example:
"Have a seat in that chair."
In above sentence chair is noun and can be categorized as concrete.
I would suggest training a classifier using pretrained word vectors.
You need two libraries: spacy for tokenizing text and extracting word vectors, and scikit-learn for machine learning:
import spacy
from sklearn.linear_model import LogisticRegression
import numpy as np
nlp = spacy.load("en_core_web_md")
Distinguishing concrete and abstract nouns is a simple task, so you can train a model with very few examples:
classes = ['concrete', 'abstract']
# todo: add more examples
train_set = [
['apple', 'owl', 'house'],
['agony', 'knowledge', 'process'],
]
X = np.stack([list(nlp(w))[0].vector for part in train_set for w in part])
y = [label for label, part in enumerate(train_set) for _ in part]
classifier = LogisticRegression(C=0.1, class_weight='balanced').fit(X, y)
When you have a trained model, you can apply it to any text:
for token in nlp("Have a seat in that chair with comfort and drink some juice to soothe your thirst."):
if token.pos_ == 'NOUN':
print(token, classes[classifier.predict([token.vector])[0]])
The result looks satisfying:
# seat concrete
# chair concrete
# comfort abstract
# juice concrete
# thirst abstract
You can improve the model by applying it to different nouns, spotting the errors and adding them to the training set under the correct label.
Try to use WordNet via NLTK and explore the hypernym tree of the words you are interested in. WordNet is a lexical database that organises words in a tree-like structure based on their abstraction level. You can use this to get more abstract versions of your target word.
For example, the following example code tells you that that the word "chair" belongs to the category "seats", which belongs to the over-arching category "entity".
The word "anger" on the other hand, belongs to the category "emotion".
from nltk.corpus import wordnet as wn
wn.synsets('chair')
wn.synset('chair.n.01').hypernyms()
# [Synset('seat.n.03')]
wn.synset('chair.n.01').root_hypernyms()
# [Synset('entity.n.01')]
wn.synsets('anger')
wn.synset('anger.n.01').hypernyms()
# [Synset('emotion.n.01')]
=> look at the NLTK WordNet documentation and play around with the hypernym trees to categorise words into abstract or concrete categories. You will have to define yourself what exactly you mean by "abstract" or "concrete" and which categories of words you want to put into these two buckets though.
One simple way could be maintain a dictionary of abstract nouns and for unknown words look at suffixes — words with following suffixes are generally abstract nouns:
-ship
-ism
-ity
-ness
-age
-acy
-ment
-ability, etc.
first, tokenize words by word_tokenize(string)
then use pos_tag from nltk.
import nltk
from nltk import*
string="Have a seat in that chair."
words=nltk.word_tokenize(string)
nltk.pos_tag(words)
This is not tested, but I think it might be almost similar like this.

POS tagging in German

I am using NLTK to extract nouns from a text-string starting with the following command:
tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))
It works fine in English. Is there an easy way to make it work for German as well?
(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)
Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.
See nltk.corpus.europarl_raw and this answer for example configuration.
Also, consider tagging this question with "nlp".
The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:
from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]
>>> Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')
Update: Another option is spacy, there is a quick example in this blog article:
import spacy
nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')
# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.
The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.
You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)
Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html
#-*- coding: utf8 -*-
import os, glob, codecs
def installStanfordTag():
if not os.path.exists('stanford-postagger-full-2013-06-20'):
os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
os.system('unzip stanford-postagger-full-2013-06-20.zip')
return
def tag(infile):
cmd = "./stanford-postagger.sh "+models[m]+" "+infile
tagout = os.popen(cmd).readlines()
return [i.strip() for i in tagout]
def taglinebyline(sents):
tagged = []
for ss in sents:
os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
tagged.append(tag('stanfordtemp.txt')[0])
return tagged
installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
'dewac':'models/german-dewac.tagger',
'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()
m = 'fast' # It's best to use the fast german tagger if your data is small.
sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']
tagged_sents = taglinebyline(sentences) # Call the stanford tagger
for sent in tagged_sents:
print sent
I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.
It seems to be a little late to answer the question, but it might be helpful for anyone who finds this question by googling like i did. So i'd like to share the things I found out.
The HannoverTagger might be a useful tool for this Task.
You can find tutorials here and here(german), but the second one is in german.
The Tagger seems to use the STTS Tagset, if you need a complete list of all Tags.

Categories

Resources