How Lucene positional index works so efficiently?

How Lucene positional index works so efficiently? - python

Usually any search engine software creates inverted indexes to make searches faster. The basic format is:-
word: <docnum ,positions>, <docnum ,positions>, <docnum ,positions> .....
Whenever there is a search query inside quote like "Harry Potter Movies" it means there should be exact match of positions of word and in searches like within k word queries like hello /4 world it generally means that find the word world in the range of 4 word distance either in left or right from the word hello. My question is that we can employ solution like linearly checking the postings and calculating distances of words like in query, but if collection is really large we can't really search all the postings. So is there any other data structure or kind of optimisation lucene or solr uses?
One first solution can be only searching some k postings for each word. Other solution can be only searching top docs(usually called champion list sorted by tf-idf or similar during indexing), but more better docs can be ignored. Both solutions have some disadvantage, they both don't ensure quality. But in Solr server we get assured quality of results even in large collections. How?

The phrase query you are asking about here is actually really efficient to compute the positions of, because you're asking for the documents where 'Harry' AND 'Potter' AND 'Movies' occur.
Lucene is pretty smart, but the core of its algorithm for this is that it only needs to visit the positions lists of documents where all three of these terms even occur.
Lucene's postings are also sharded into multiple files:
Inside the counts-files are: (Document, TF, PositionsAddr)+
Inside the positions-files are: (PositionsArray)
So it can sweep across the (doc, tf, pos_addr) for each of these three terms, and only consult the PositionsArray when all three words occur in the specific document. Phrase queries have the opportunity to be really quick, because you only visit at most all the documents from the least-frequent term.
If you want to see a phrase query run slowly (and do lots of disk seeks!), try: "to be or not to be" ... here the AND part doesn't help much because all the terms are very common.

Related

Text classification using a sieve

I'm currently kinda lost with my Problem, hopefully I can describe it acurate enough.
What I'm trying to do:
I have a list of tiny text fragments (think twitter) and want to categorize them into different topics.
To do so, I check for the occurence of some keywords in the text - keywords that might contain wildcards. The current approach to use googles regular expressions 2 engine is real slow, because we have to check every keyword until one hits. -> linear search
What I'm looking for? Something like a postgresql trgm gin index where I can execute reverse searches - so instead of going through all the Keywords, I'd like some faster way of searching.
Actually using postgresql in this case does not work, because it shall be used as some kind of SAS.

Direction needed: finding terms in a corpus

My question isn't about a specific code issue, but rather about the best direction to take on a Natural Language Processing challenge.
I have a collection of several hundreds of Word and PDF files (from which I can then export the raw text) on the one hand, and a list of terms on the other. Terms can consist in one or more words. What I need to do is identify in which file(s) each term is used, applying stemming and lemmatization.
How would I best approach this? I know how to extract text, apply tokenization, lemmatization, etc., but I'm not sure how I could search for occurrences of terms in lemmatized form inside a corpus of documents.
Any hint would be most welcome.
Thanks!

I would suggest you create an inverted index of the documents, where you record the locations of each word in a list, with the word form as the index.
Then you create a secondary index, where you have as key the lemmatised forms, and as values a list of forms that belong to the lemma.
When you do a lookup of a lemmatised word, eg go, you go to the secondary index to retrieve the inflected forms, go, goes, going, went; next you go to the inverted index and get all the locations for each of the inflected forms.
For ways of implementing this in an efficient manner, look at Witten/Moffat/Bell, Managing Gigabytes, which is a fantastic book on this topic.
UPDATE: for multi-word units, you can work those out in the index. Looking for "software developer", look up "software" and "developer", and then merge the locations: everytime their location differs by 1, they are adjacent and you have found it. Discard all the ones where they are further apart.

Ways to search and tag a MongoDB database of academic papers

Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!

Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

How to do related questions autopopulate

I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field.
I can think of only one way to do it, which i think might be fast enough
Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for site search.
What are other ways o do this, which are fast enough, as this is going to be sent on tabout, so a large server side processing is not feasible for it.
I am just looking for the way to do this, but I am using mysql and DJango, so if your answer uses that, all the better.
[I cannot think of good tags for it, so please feel free to edit them]

You're looking at a content-based recommendation algorithm. AFAICT StackOverflow's looks at the tags and the words in the title, and finds questions that share some of these. It can be implemented as a nearest neighbour search in a space where documents are represented as TF-IDF vectors.
Implementation-wise, go with any Django search engine that supports stemming, stopwords, non-strict matches, and tf-idf weights. Algorithmic complexity isn't high (just a few index lookups), so it doesn't matter if it's written in Python.
If you don't find a search engine doing what you want, leave the stemming and stopwords to the search engine, call the search engine on individual words, and do your own tf-idf scoring with a score that favors similar tags.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.