I'm trying to build a app that finds relevant sentences in a document based on keywords or statements that a user enters. Performing this manually using a needle in the haystack approach seems highly inefficient.
Is there a ideal approach or library that can tackle this problem?
The field that deals with problems like you described is called information retrieval.
A simplest method of doing such queries is based on bag of words model - you treat documents as vectors, such that their cosine similarity corresponds to containing similar words.
In Python you can do it for example using utilities from scikit-learn (this would be low level) or do stuff using more production-ready tools like whoosh - see Python for Humanities for a sample tutorial.
If you want to dig deeper, I encourage you to read Information Retrieval book, at least couple of first chapters.
Related
I am working on a project to extract a keyword from short texts (3-4 sentences). Using the spaCy library I extract noun phrases and NER and use them as keywords. However, I would like to sort them based on their importance wrt the original text.
I tried standard informational retrieval approaches, like tfidf, and even a couple of graph-based algorithms but having such short text the results weren't so great.
I was thinking that maybe using a NN with an attention mechanism could help me rank those keywords. Is there any way to use the pre-trained models that come with spaCy to do some kind of ranking?
How about something like maximal marginal relevance? http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf
I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.
I was working on a project where I need to search within all the documents(pdf/docs etc) present in the database relevant to any query.
I earlier used a simple relation where I stored relevant keywords associated with a document and if the query contains those keywords then I fetch those documents. But this method is not so reliable as those keywords might be misleading. I need to search within the documents and I am looking for a practical search algorithm that can scale well and has less time complexity.
Any suggestions and resources are most welcome.
Thank you.
Try Rabin-Karp (based on hashcode) search algorithm. As you have to search more than one pattern in many documents, it will get the hashcode of all patterns and will look for all patterns in one go.
Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.
What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.
I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field.
I can think of only one way to do it, which i think might be fast enough
Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for site search.
What are other ways o do this, which are fast enough, as this is going to be sent on tabout, so a large server side processing is not feasible for it.
I am just looking for the way to do this, but I am using mysql and DJango, so if your answer uses that, all the better.
[I cannot think of good tags for it, so please feel free to edit them]
You're looking at a content-based recommendation algorithm. AFAICT StackOverflow's looks at the tags and the words in the title, and finds questions that share some of these. It can be implemented as a nearest neighbour search in a space where documents are represented as TF-IDF vectors.
Implementation-wise, go with any Django search engine that supports stemming, stopwords, non-strict matches, and tf-idf weights. Algorithmic complexity isn't high (just a few index lookups), so it doesn't matter if it's written in Python.
If you don't find a search engine doing what you want, leave the stemming and stopwords to the search engine, call the search engine on individual words, and do your own tf-idf scoring with a score that favors similar tags.