Come up with a search algorithm to search within documents

Come up with a search algorithm to search within documents - python

I was working on a project where I need to search within all the documents(pdf/docs etc) present in the database relevant to any query.
I earlier used a simple relation where I stored relevant keywords associated with a document and if the query contains those keywords then I fetch those documents. But this method is not so reliable as those keywords might be misleading. I need to search within the documents and I am looking for a practical search algorithm that can scale well and has less time complexity.
Any suggestions and resources are most welcome.
Thank you.

Try Rabin-Karp (based on hashcode) search algorithm. As you have to search more than one pattern in many documents, it will get the hashcode of all patterns and will look for all patterns in one go.

Related

Text classification using a sieve

I'm currently kinda lost with my Problem, hopefully I can describe it acurate enough.
What I'm trying to do:
I have a list of tiny text fragments (think twitter) and want to categorize them into different topics.
To do so, I check for the occurence of some keywords in the text - keywords that might contain wildcards. The current approach to use googles regular expressions 2 engine is real slow, because we have to check every keyword until one hits. -> linear search
What I'm looking for? Something like a postgresql trgm gin index where I can execute reverse searches - so instead of going through all the Keywords, I'd like some faster way of searching.
Actually using postgresql in this case does not work, because it shall be used as some kind of SAS.

How Lucene positional index works so efficiently?

Usually any search engine software creates inverted indexes to make searches faster. The basic format is:-
word: <docnum ,positions>, <docnum ,positions>, <docnum ,positions> .....
Whenever there is a search query inside quote like "Harry Potter Movies" it means there should be exact match of positions of word and in searches like within k word queries like hello /4 world it generally means that find the word world in the range of 4 word distance either in left or right from the word hello. My question is that we can employ solution like linearly checking the postings and calculating distances of words like in query, but if collection is really large we can't really search all the postings. So is there any other data structure or kind of optimisation lucene or solr uses?
One first solution can be only searching some k postings for each word. Other solution can be only searching top docs(usually called champion list sorted by tf-idf or similar during indexing), but more better docs can be ignored. Both solutions have some disadvantage, they both don't ensure quality. But in Solr server we get assured quality of results even in large collections. How?

The phrase query you are asking about here is actually really efficient to compute the positions of, because you're asking for the documents where 'Harry' AND 'Potter' AND 'Movies' occur.
Lucene is pretty smart, but the core of its algorithm for this is that it only needs to visit the positions lists of documents where all three of these terms even occur.
Lucene's postings are also sharded into multiple files:
Inside the counts-files are: (Document, TF, PositionsAddr)+
Inside the positions-files are: (PositionsArray)
So it can sweep across the (doc, tf, pos_addr) for each of these three terms, and only consult the PositionsArray when all three words occur in the specific document. Phrase queries have the opportunity to be really quick, because you only visit at most all the documents from the least-frequent term.
If you want to see a phrase query run slowly (and do lots of disk seeks!), try: "to be or not to be" ... here the AND part doesn't help much because all the terms are very common.

Extract relevant sentences based on input words NLP

I'm trying to build a app that finds relevant sentences in a document based on keywords or statements that a user enters. Performing this manually using a needle in the haystack approach seems highly inefficient.
Is there a ideal approach or library that can tackle this problem?

The field that deals with problems like you described is called information retrieval.
A simplest method of doing such queries is based on bag of words model - you treat documents as vectors, such that their cosine similarity corresponds to containing similar words.
In Python you can do it for example using utilities from scikit-learn (this would be low level) or do stuff using more production-ready tools like whoosh - see Python for Humanities for a sample tutorial.
If you want to dig deeper, I encourage you to read Information Retrieval book, at least couple of first chapters.

Ways to search and tag a MongoDB database of academic papers

Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!

Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.

How to do related questions autopopulate

I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field.
I can think of only one way to do it, which i think might be fast enough
Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for site search.
What are other ways o do this, which are fast enough, as this is going to be sent on tabout, so a large server side processing is not feasible for it.
I am just looking for the way to do this, but I am using mysql and DJango, so if your answer uses that, all the better.
[I cannot think of good tags for it, so please feel free to edit them]

You're looking at a content-based recommendation algorithm. AFAICT StackOverflow's looks at the tags and the words in the title, and finds questions that share some of these. It can be implemented as a nearest neighbour search in a space where documents are represented as TF-IDF vectors.
Implementation-wise, go with any Django search engine that supports stemming, stopwords, non-strict matches, and tf-idf weights. Algorithmic complexity isn't high (just a few index lookups), so it doesn't matter if it's written in Python.
If you don't find a search engine doing what you want, leave the stemming and stopwords to the search engine, call the search engine on individual words, and do your own tf-idf scoring with a score that favors similar tags.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.