Objective
I am trying to build an Ontology-based semantic search engine specific to our data.
Problem Statement
Our data is just a bunch of sentences and what I want is to give a phrase and get back the sentences which are:
Similar to that phrase
Has a part of a sentence that is similar to the phrase
A sentence which is having contextually similar meanings
Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:
I never thought car buying could take less than 30 minutes to sign and buy.
I found a car that I liked and the purchase process was straightforward and easy
I absolutely hated going car shopping, but today I’m glad I did
The search term will always be one to three-word phrases. Ex: buying experience, driving comfort, infotainment system, interiors, mileage, performance, seating comfort, staff behavior.
Implementation explored already
OpenSemanticSearchAWS Comprehend
Current Implementation
Right now I am exploring one implementation ('Holmes Extractor') which mostly suites the objective I am trying to achieve. For my use case, I am using 'Topic Matching' available in Holmes.
Holmes has already incorporated many important concepts in its design such as NER, pronoun resolution, word embedding based similarity, ontology-based extraction, which makes this implementation even more promising.
What I have tried so far on Holmes
Manually curated a labeled set with sentences
Registered and serialized our dataset with unique label id for each document (sentence)
For comparison prepared a template to get precision, recall, and f1_score of each iteration of running queries (search query is the same as the label)
Using spaCy's 'en_core_web_lg' for word embedding
Score template looks something like:
Query | Predicted Rows | Actual Number of rows in the queried set | Precision | Recall | f1_score
Please find below the parameters used in each iteration. For all the iterations, generated the scores for Manager.overall_similarity_threshold in range(0.0 to 1.0) and topic_match_documents_returning_dictionaries_against.embedding_penalty in range(0.0 to 1.0)
Iteration 1:
Without any Ontology and Manager.embedding_based_matching_on_root_words=False
Iteration 2:
Without any Ontology and Manager.embedding_based_matching_on_root_words=True
Build a custom Ontology using Protege, which is specific to our data set
Iteration 3:
With custom Ontology and Manager.embedding_based_matching_on_root_words=False, Ontology.symmetric_matching=True
Iteration 4:
With custom Ontology and Manager.embedding_based_matching_on_root_words=True, Ontology.symmetric_matching=True
At this stage, I can observe that
custom Ontology,
Manager.embedding_based_matching_on_root_words=True,
Manager.overall_similarity_threshold in range (0.6-0.8),
topic_match_documents_returning_dictionaries_against.embedding_penalty in range in range(0.6-0.8),
together are producing very strong scores.
average scores for 0.8 -> precision: 0.78, recall: 0.712, f1_score: 0.738
average scores for 0.7 -> precision: 0.738, recall: 0.7435, f1_score: 0.726
Still, results are not that accurate even after providing custom ontology, because of a lack of stemming keywords in the Ontology graph.
For example, if in Ontology we have given something like Mileage->fuel efficiency, fuel economy
Holmes will not match sentences with fuel efficient under Mileage, since "fuel efficient" is not mentioned in the Graph
Iteration 5:
With custom Ontology and Manager.embedding_based_matching_on_root_words=False, Ontology.symmetric_matching=False
Iteration 6:
With custom Ontology and Manager.embedding_based_matching_on_root_words=True, Ontology.symmetric_matching=False
Iteration 7:
Along with Iteration 4 parameters, and topic_match_documents_returning_dictionaries_against.tied_result_quotient in range(0.9 to 0.1)
Again results are better in 0.8-0.9 range
Iteration 8:
I had pre-downloaded ontologies from a few sources:
Automobile Ontology:
https://github.com/mfhepp/schemaorg/blob/automotive/data/schema.rdfa
Vehicle Sales Ontology:
http://www.heppnetz.de/ontologies/vso/ns
Product Ontology:
http://www.productontology.org/dump.rdf
Mesh Thesaurus:
https://data.europa.eu/euodp/en/data/dataset/eurovoc
English thesaurus:
https://github.com/mromanello/skosifaurus
https://raw.githubusercontent.com/mromanello/skosifaurus/master/thesaurus.rdf
General Ontologies:
https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30
https://wiki.dbpedia.org/Downloads2015-04#dbpedia-ontology
https://lod-cloud.net/dataset/dbpedia
https://wiki.dbpedia.org/services-resources/ontology
https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
https://tools.wmflabs.org/wikidata-exports/rdf/
All the above ontologies are available in either ttl, n3 or rdf format. And holmes extractor (implicitly RDFlib) works on OWL syntax. Therefore, the conversion of the given formats to the OWL format for heavy files is another challenge.
I have tried loading the first 4 ontologies on Holmes after conversion to OWL. But the conversion process takes time as well. Also, loading big ontologies on Holmes itself is taking a good amount of time.
As per Holmes documentation, Holmes works best with the ontologies that have been built for specific subject domains and use cases.
What next
The challenges which I am facing here are:
Finding the proper Ontologies available around customer experience in the automobile domain
Making the conversion process of ontologies faster
How can we easily generate the domain-specific ontology from our existing data
Configuring the right set of parameters along with Ontology to get more accurate results for search phrase queries
Any help will be appreciated. Thanks a lot for the help in advance
Related
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model.
By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one
week must chart daily use medium radio television newspaper magazine
film video etc wake radio alarm listen traffic report commuting get
news watch sport soap opera watch tv use internet work home read book
see movie use data collect journal basis analysis examining
information using us gratification model discussed textbook us
gratification article provided perhaps carrying small notebook day
inputting material evening help stay organized smartphone use note app
track medium need turn diary trust tell tell immediately paper whether
actually kept one begin medium diary soon possible order give ample
time complete journal write paper completed diary need write page
paper use medium functional analysis theory say something best
understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio
television newspaper magazine film video etc wake radio alarm listen
traffic report commuting get news watch sport soap opera watch tv use
internet work home read book see movie use data collect journal basis
analysis examining information using us gratification model discussed
textbook us gratification article provided perhaps carrying small
notebook day inputting material evening help stay organized smartphone
use note app track medium need turn diary trust tell tell immediately
paper whether actually kept one begin medium diary soon possible order
give ample time complete journal write paper completed diary need
write page paper use medium functional analysis theory say something
best understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper
life work garry winogrand famous street photographer also influenced
street photography aim towards thoughtful imaginative treatment detail
referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page
essay tang dynasty essay discus buddhism tang dynasty name artifact
tang dynasty discus them history put heading paragraph information
tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data.
To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.
I have 1000 text files which have discharge summary for patients
SAMPLE_1
The patient was admitted on 21/02/99. he appeared to have pneumonia at the time
of admission so we empirically covered him for community-acquired pneumonia with
ceftriaxone and azithromycin until day 2 when his blood cultures grew
out strep pneumoniae that was pan sensitive so we stopped the
ceftriaxone and completed a 5 day course of azithromycin. But on day 4
he developed diarrhea so we added flagyl to cover for c.diff, which
did come back positive on day 6 so he needs 3 more days of that…” this
can be summarized more concisely as follows: “Completed 5 day course
of azithromycin for pan sensitive strep pneumoniae pneumonia
complicated by c.diff colitis. Currently on day 7/10 of flagyl and
c.diff negative on 9/21.
SAMPLE_2
The patient is an 56-year-old female with history of previous stroke; hypertension;
COPD, stable; renal carcinoma; presenting after
a fall and possible syncope. While walking, she accidentally fell to
her knees and did hit her head on the ground, near her left eye. Her
fall was not observed, but the patient does not profess any loss of
consciousness, recalling the entire event. The patient does have a
history of previous falls, one of which resulted in a hip fracture.
She has had physical therapy and recovered completely from that.
Initial examination showed bruising around the left eye, normal lung
examination, normal heart examination, normal neurologic function with
a baseline decreased mobility of her left arm. The patient was
admitted for evaluation of her fall and to rule out syncope and
possible stroke with her positive histories.
I also have a csv file which is 1000rows X 5columns. Each row has information entered manually for each of the text file.
So for example for the above two files, someone has manually entered these records in the csv file:
Sex, Primary Disease,Age, Date of admission,Other complications
M,Pneumonia, NA, 21/02/99, Diarhhea
F,(Hypertension,stroke), 56, NA, NA
My question is:
How do I represent use this information of text:labels to a machine learning algorithm
Do I need to do some manual labelling around the areas of interest in all the 1000 text files?
If yes then how and which method to use. (i.e. like <ADMISSION> was admitted on 21/02/99</ADMISSION>,
<AGE>56-year-old</AGE>)
So basically how do I use this text:labels data to automate the filling of labels.
As far as I can tell the point is not to mark up the texts, but to extract the information represented by the annotations. This is an information extraction problem, and you should read up on techniques for this. The CSV file contains the information you want to extract (your "gold standard", so you should start by splitting it into training (90%) and testing (10%) subsets.
There is a named entity recognition task in there: Recognize diseases, numbers, dates and gender. You could use an off-the shelf chunker, or find an annotated medical corpus and use it to train one. You can also use a mix of approaches; spotting words that reveal gender is something you could hand-code pretty easily, for example. Once you have all these words, you need some more work, for example, to distinguish the primary disease from the symptoms; the age from other numbers, and the date of admission from any other dates. This is probably best done as a separate classification task.
I recommend you now read through the nltk book, chapter by chapter, so that you have some idea of what the available techniques and tools are. It's the approach that matters, so don't get bogged down in comparisons of specific machine learning engines.
I'm afraid the algorithm that fills the gaps has not yet been invented. If the gaps were strongly correlated or had some sort of causality you might be able to model that with some sort of Bayesian model. Still with the amount of data you have this is pretty much impossible.
Now on the more practical side of things. You can take two approaches:
Treat the problem as a document-level task in which case you can just take all rows with a label and train on them and infer the labels/values of the rest. You should look at Naïve Bayes, Multi-class SVM, MaxEnt, etc. for the categorical columns and linear regression for predicting the numerical values.
Treat the problem as an information extraction task in which case you have to add the annotation you mentioned inside the text and train a sequence model. You should look at CRF, structured SVM, HMM, etc. Actually, you could look at some systems that adapt multiclass classifiers to sequence labeling tasks, e.g. SVMTool for POS tagging (can be adapted to most sequence labeling tasks).
Now about the problems, you will face. In 1. it is very unlikely that you will predict the date of the record with any algorithm. It might be possible to roughly predict the patient age as this is something that usually correlates with diseases, etc. And it's very very unlikely that you will be able to even set up the disease column as an entity extraction task.
If I have to solve your problem I would probably pick approach 2. which is imho the correct approach but could is also quite a bit of work. In that case, you will need to create markup annotations yourself. A good starting point is an annotation tool called brat. Once you have your annotations, you could develop a classifier in the style of CoNLL-2003.
What you are trying to achieve seems quite a bit, especially with 1000 records. I think (depending on your data) you may be better off using ready products instead of building them yourself. There are open source and commercial products that might be able to use -- lexigram.io has an API, MetaMap and Apache cTAKES are state-of-the-art open source tools for clinical entity extraction.
I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?
This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.
I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....
While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.
In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)
I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.