Building a full text search engine: where to start [closed]

Building a full text search engine: where to start [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to write a web application using Google App Engine (so the reference language would be Python). My application needs a simple search engine, so the users would be able to find data specifying keywords.
For example, if I have one table with those rows:
1 Office space 2 2001: A space
odyssey 3 Brazil
and the user queries for "space", rows 1 and 2 would be returned. If the user queries for "office space", the result should be rows 1 and 2 too (row 1 first).
What are the technical guidelines/algorithms to do this in a simple way?
Can you give me good pointers to the theory behind this?
Thanks.
Edit: I'm not looking for anything complex here (say, indexing tons of data).

Read Tim Bray's series of posts on the subject.
Background
Usage of search engines
Basics
Precision and recall
Search engne intelligence
Tricky search terms
Stopwords
Metadata
Internationalization
Ranking results
XML
Robots
Requirements list

I found these two books very useful when I used to build full-text search engines.
Information Retrieval
Managing Gigabytes

I would not build it yourself, if possible.
App Engine includes the basics of a Full Text searching engine, and there is a great blog post here that describes how to use it.
There is also a feature request in the bug tracker that seems to be getting some attention lately, so you may want to hold out, if you can, until that is implemented.

As always start in wikipedia. First start is usually building an inverted index.

Here's an original idea:
Don't build an index. Seriously.
I was faced with a similar progblem some time ago. I needed a fast method to search megs and megs of text that came from documentation. I needed to match not just words, but word proximity in large documents (is this word near that word). I just ended up writing it in C, and the speed of it surprised me. It was fast enough that it didn't need any optimizing or indexing.
With the speed of today's computers, if you write code that runs straight on the metal (compiled code), you often don't need an order log(n) type algorithm to get the performance you need.

Lucene or Autonomy! These are not out of the box solutions for you. You will have to write wrappers on top of their interfaces.
They certainly do take care of the stemming, grammar , relational operators etc

First build your index.
Go through the input, split into words
For each word check if it is already in the index, if it is add the current record number to the index list, if not add the word and record number.
To look up a word go to the (possibly sorted) index and return all the record numbers for that word.
It's very esy to do this for a reasoable size list using Python's builtin storage types.
As an extra refinement you only want to store the base part of a word, eg 'find' for 'finding' - look up stemming algorithms.

The book Introduction to Information Retrieval provides a good introduction to the field.
A dead-tree version is published by Cambridge University Press, but you can also find a free online edition (in HTML and PDF) following the link above.

See also a question I asked: How-to: Ranking Search Results.
Surely there are more approaches, but this is the one I'm using for now.

Honestly, smarter people than I have figured this stuff out. I'd load up the solr app and make json calls from my appengine app and let solr take care of indexing.

I just found this article this weekend: http://www.perl.com/pub/a/2003/02/19/engine.html
Looks not too complicated to do a simple one (though it would need heavy optimizing to be an enterprise type solution for sure). I plan on trying a proof of concept with some data from Project Gutenberg.
If you're just looking for something you can explore and learn from I think this is a good start.

Look into the book "Managing Gigabytes" it covers storage and retrieval of huge amounts of plain text data -- eg. both compression and actual searching, and a variety of the algorithms that can be used for each.
Also for plain text retrieval you're best off using a vector based search system rather than a keyword->document indexing system as vector based systems can be much faster, and, more importantly can provide relevancy ranking relatively trivially.

Try this:
Say the variable table is your list of search entries.
query = input("Query: ").strip().lower()#Or raw_input, for python 2
end = []
for item in table:
if query in item.strip().lower():
end.append(item)
print end #Narrowed results
It just iterates through all of the items to see if the query is in any one of them. It works for a simple in-app search function. Maybe not for the whole internet, though.

Related

AIML for Intelligent Answering Engine

I have heard about a programming language called AIML which can be used for programming Intelligent Robots.
I am a web developer and have a web crawler build using Python 2.7 and have indexed Wikipedia ...
So I wanted to build a answering engine using python which would use a string variable
(It is a HUGE variable containing the whole of Wikipedia) as a source of information and use AI to answer...
Finally, I wanted to put this up on my school website...
So can I do that in AIML?
Later on I also want to modify it so as to give my live scores answers to questions like:
"What is the age of ~someperson~?" etc.
For that I'll send my web crawler to index some score pages etc..
Can I program this sort of answering agent in AIML?
If yes please provide links to tutorials which tell me how to do that? (using string variables as a source of information to parse queries and answer like a human)
moreover, AIML uses syntax like:
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
<think><set name="topic">Me</set></think>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
Where pattern is the query and template is answer, so does that mean I have to sit and write these tags for all possible queries?
Or can I make it use its brains to figure out what the person wants and give them answers
using the string variable as its source of information.
Thank you.

AIML
It looks like AIML is a form of pattern matching. Moreover, it looks like this is mainly meant for dialog based agents. Therefore, to use AIML, you would likely need to manually generate every question and the correct response (answer).
Question answering
What it seems like you are really after is what we call a question answering system. Very briefly, a QA system generally has these components:
Question analysis.
Extract keywords.
(Sometimes) determine expected answer type (location, person, color, number, etc.).
Candidate document selection---doing a search on your knowledge base using an information retrieval system.
Candidate document analysis.
Answer extraction---select some part of the document (sentence(s), paragraph(s)).
Response generation.
Scores and ranks each answer.
Displays the most confident answer(s).
Research
If you're really want to dig deeply into this area, I'd suggest using Google Scholar and search for some of the terms I've mentioned, which will give you some research papers that go into detail about many of these topics. Some papers to get you started:
Natural language question answering: The view from here
Answering complex, list and context questions with LCC's Question-Answering Server
The structure and performance of an open-domain question answering system
Learning surface text patterns for a question answering system
Learning question classifiers
What is not in the Bag of Words for Why-QA?
Shameless plug
I've recently taken a course on natural language processing, and developed a rudimentary QA system that uses Wikipedia as a knowledge base. (Actually, I used the Simple English Wikipedia because it was much easier to work with; though the system does work with the full version just much more slowly.)
If you are interested in looking at some Python code as a reference, you may do so on the project's GitHub page: bwbaugh/causeofwhy. In addition, there is some more detailed documentation on what goes on in each step of the system components.
There is also a very basic working demo of the QA system in action that is (currently) available, however bear in mind the system is a proof-of-concept and can take upwards of 30 seconds to respond to a question (depending on the question).

Search Engine for a single DB column

I'm looking for a search engine that I can point to a column in my database that supports advanced functions like spelling correction and "close to" results.
Right now I'm just using
SELECT <column> from <table> where <colname> LIKE %<searchterm>%
and I'm missing some results particularly when users misspell items.
I've written some code to fix misspellings by running it through a spellchecker but thought there may be a better out-of-the box option to use. Google turns up lots of options for indexing and searching the entire site where I really just need to index and search this one table column.

Apache Solr is a great Search Engine that provides (1) N-Gram Indexing (search for not just complete strings but also for partial substrings, this helps greatly in getting similar results) (2) Provides an out of box Spell Corrector based on distance metric/edit distance (which will help you in getting a "did you mean chicago" when the user types in chicaog) (3) It provides you with a Fuzzy Search option out of box (Fuzzy Searches helps you in getting close matches for your query, for an example if a user types in GA-123 he would obtain VMDEO-123 as a result) (4) Solr also provides you with "More Like This" component which would help you out like the above options.
Solr (based on Lucene Search Library) is open source and is slowly rising to become the de-facto in the Search (Vertical) Industry and is excellent for database searches (As you spoke about indexing a database column, which is a cakewalk for Solr). Lucene and Solr are used by many Fortune 500 companies as well as Internet Giants.
Sphinx Search Engine is also great (I love it too as it has very low foot print for everything & is C++ based) but to put it simply Solr is much more popular.
Now Python support and API's are available for both. However Sphinx is an exe and Solr is an HTTP. So for Solr you simply have to call the Solr URL from your python program which would return results that you can send to your front end for rendering, as simple as that)
So far so good. Coming to your question:
First you should ask yourself that whether do you really require a Search Engine? Search Engines are good for all use cases mentioned above but are really made for searching across huge amounts of full text data or million's of rows of tabular data. The Algorithms like Did you Mean, Similar Records, Spell Correctors etc. can be written on top. Before zero-ing on Solr please also search Google for (1) Peter Norvig Spell Corrector & (2) N-Gram Indexing. Possibility is that just by writing few lines of code you may get really the stuff that you were looking out for.
I leave it up to you to decide :)

I would suggest looking into open source technologies like Sphynx Search.

Before going down the Solr/Sphinx route for full text indexing - which adds complexity and their own overhead - you can try the built-in full text engine in PostgreSQL if you are using that database. It's easy to setup and performs better than LIKE queries.
Check out https://github.com/hcarvalhoalves/django-tsearch2

Design help for static content with fixed keywords search framework

I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?

I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.

Using MongoDB on Django for real-time search?

I'm working on a project that is quite search-oriented. Basically, users will add content to the site, and this content should be immediately available in the search results. The project is still in development.
Up until now, I've been using Haystack with Xapian. One thing I'm worried about is the performance of the website once a lot of content is available. Indexing will have to occur very frequently if I want to emulate real-time search.
I was reading up on MongoDB recently. I haven't found a satisfying answer to my question, but I have the feeling that MongoDB might be of help for the real-time search indexing issue I expect to encounter. Is this correct? In other words, would the search functionality available in MongoDB be more suited for a real-time search function?
The content that will be available on the site is large unstructured text (including HTML) and related data (prices, tags, datetime info).
Thanks in advance,
Laundro

I don't know much about MongoDB, but I'm using with great success Sphinx Search - simple, powerful and very fast tool for full text indexing&search. It also provides Python wrapper out-of-the-box.
It would be easier to pick it up if Haystack provided bindings for it, unfortunately Sphinx bindings are still on a wish list.
Nevertheless, setting Spinx up is so quick (I did it in a few hours, for existing in-production Django-based CRM), that maybe you can give it a try before switching to a more generic solution.

MongoDB is not really a "dedicated full text search engine". Based on their full text search docs you can only create a array of tags that duplicates the string data or other columns, which with many elements (hundreds or thousands) can make inserts very expensive.
Agree with Tomasz, Sphinx Search can be used for what you need. Real time indexes if you want it to be really real time or Delta indexes if several seconds of delay are acceptable.

Automatically pick tags from context using Python

How can I pick tags from an article or a user's post using Python?
Is the following method ok?
Build a list of word frequency from the text and sort them.
Remove some common words and pick the top 10 words remained in the list as the tags.
If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?

Here's an article on removing stop words. The link to the stop word list in the article is broken but here's another one.

The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can't give you hands-on advice as I'm not familiar with this subject, but I think it's worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won't get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.

I'd suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.
But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.

Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I'd recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.

Instead of blacklisting words that shouldn't be tags, why don't you instead build a whitelist of words that would make for good tags?
Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.
When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.
Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.