Full text search: Whoosh Vs SOLR

Full text search: Whoosh Vs SOLR - python

I am working on a Django project, where I need to implement full text search. I have seen SOLR and found some good comments for the same. But as its implemented in Java and would need java enviroment to be installed on the system along with Python. Looking for the python equivalent for SOLR, I have seen Whoosh but I am not sure whether Whoosh is as efficient and strong as SOLR. Or shall I go with SOLR option only or are there any better options than Whoosh and SOLR with python?
Please suggest.
Thanks in advance

Whoosh is actually very fast for a python-only implementation. That said, it's still at least an order of magnitude slower. Depending on the amount of data you need to index and search and the requirements on the maximum allowable latency and concurrent searches, it may not be an option.
SOLR is a bit of a complicated beast, but it's by far the most comprehensive search solution. Mix it with solrpy for stunning results. Yes, you will need java hosting.
You might also want to check out the python bindings for xapian. Xapian is very very fast, but less of a complete solution than SOLR. They are GPL licensed though, so that may/may not be viable for you.

I have used Lucene and Lucene extensions like SOLR and Nutch, and I found out that lucene pretty much satisfies what I need. I've only tried Whoosh once but chose Lucene because
1) I am using Java
2) I had trouble making UTF-8 work with Whoosh (not sure if it works out of the box now). In Lucene, I had no trouble working with Chinese characters.
If you're using Python as your Programming Language and Whoosh satisfies your needs then I'd suggest you use it over Java alternatives for better integration, avoid external dependencies, faster customization if you need to code additional functionalities.
UPDATE: If you're interested in using Lucene, it has a Python wrapper: See http://lucene.apache.org/pylucene/

Related

Can Django work well with pandas and numpy?

I am trying to build a web application that requires intensive mathematical calculations. Can I use Django to populate python charts and pandas dataframe?

You can use any framework to do so. If you worked with Python before I can recommend using Django since you have the same (clear Python) syntax along your project. This is good because you keep the same logic everywhere but should not be your major concern when it comes to choosing the right framework for your needs. So for example if you are a top Ruby-On-Rails developer I would not suggest to learn Django just because of Pandas.
In general: A lot of packages/libraries are written in other languages but you will still be able to use them in Django/Python. So for example the famous "Elasticsearch" Searchbackend has its roots in JAVA but is still used in a lot of Django apps.
But it also goes the other way around "Celery" is written in Python but can be used in Node.js or PHP. There are hundreds of examples but I think you get the Point.
I hope that I brought some light into the darkness. If you have questions please leave them in the comments.

how can i use spellchecking in django (haystack with whoosh)?

I am working on a django website. I want to search from a lots of texts from django.models(texts is something like stackoverflow questions). I am doing search with Haystack+whoosh. It is very nice using it. Much better than django.object.filter(body_text__icontains="food")
So i would like to know whether i able to have Spelling Suggestions using whoosh or some other PUre python package available. i don't like solr(since it needs java, after every update i need to rebuild the index using java(solr))

Whoosh's documentation for version 2.4.1 indicates it does indeed have a pure-Python spelling suggestion module.

Search Engine for a single DB column

I'm looking for a search engine that I can point to a column in my database that supports advanced functions like spelling correction and "close to" results.
Right now I'm just using
SELECT <column> from <table> where <colname> LIKE %<searchterm>%
and I'm missing some results particularly when users misspell items.
I've written some code to fix misspellings by running it through a spellchecker but thought there may be a better out-of-the box option to use. Google turns up lots of options for indexing and searching the entire site where I really just need to index and search this one table column.

Apache Solr is a great Search Engine that provides (1) N-Gram Indexing (search for not just complete strings but also for partial substrings, this helps greatly in getting similar results) (2) Provides an out of box Spell Corrector based on distance metric/edit distance (which will help you in getting a "did you mean chicago" when the user types in chicaog) (3) It provides you with a Fuzzy Search option out of box (Fuzzy Searches helps you in getting close matches for your query, for an example if a user types in GA-123 he would obtain VMDEO-123 as a result) (4) Solr also provides you with "More Like This" component which would help you out like the above options.
Solr (based on Lucene Search Library) is open source and is slowly rising to become the de-facto in the Search (Vertical) Industry and is excellent for database searches (As you spoke about indexing a database column, which is a cakewalk for Solr). Lucene and Solr are used by many Fortune 500 companies as well as Internet Giants.
Sphinx Search Engine is also great (I love it too as it has very low foot print for everything & is C++ based) but to put it simply Solr is much more popular.
Now Python support and API's are available for both. However Sphinx is an exe and Solr is an HTTP. So for Solr you simply have to call the Solr URL from your python program which would return results that you can send to your front end for rendering, as simple as that)
So far so good. Coming to your question:
First you should ask yourself that whether do you really require a Search Engine? Search Engines are good for all use cases mentioned above but are really made for searching across huge amounts of full text data or million's of rows of tabular data. The Algorithms like Did you Mean, Similar Records, Spell Correctors etc. can be written on top. Before zero-ing on Solr please also search Google for (1) Peter Norvig Spell Corrector & (2) N-Gram Indexing. Possibility is that just by writing few lines of code you may get really the stuff that you were looking out for.
I leave it up to you to decide :)

I would suggest looking into open source technologies like Sphynx Search.

Before going down the Solr/Sphinx route for full text indexing - which adds complexity and their own overhead - you can try the built-in full text engine in PostgreSQL if you are using that database. It's easy to setup and performs better than LIKE queries.
Check out https://github.com/hcarvalhoalves/django-tsearch2

Using MongoDB on Django for real-time search?

I'm working on a project that is quite search-oriented. Basically, users will add content to the site, and this content should be immediately available in the search results. The project is still in development.
Up until now, I've been using Haystack with Xapian. One thing I'm worried about is the performance of the website once a lot of content is available. Indexing will have to occur very frequently if I want to emulate real-time search.
I was reading up on MongoDB recently. I haven't found a satisfying answer to my question, but I have the feeling that MongoDB might be of help for the real-time search indexing issue I expect to encounter. Is this correct? In other words, would the search functionality available in MongoDB be more suited for a real-time search function?
The content that will be available on the site is large unstructured text (including HTML) and related data (prices, tags, datetime info).
Thanks in advance,
Laundro

I don't know much about MongoDB, but I'm using with great success Sphinx Search - simple, powerful and very fast tool for full text indexing&search. It also provides Python wrapper out-of-the-box.
It would be easier to pick it up if Haystack provided bindings for it, unfortunately Sphinx bindings are still on a wish list.
Nevertheless, setting Spinx up is so quick (I did it in a few hours, for existing in-production Django-based CRM), that maybe you can give it a try before switching to a more generic solution.

MongoDB is not really a "dedicated full text search engine". Based on their full text search docs you can only create a array of tags that duplicates the string data or other columns, which with many elements (hundreds or thousands) can make inserts very expensive.
Agree with Tomasz, Sphinx Search can be used for what you need. Real time indexes if you want it to be really real time or Delta indexes if several seconds of delay are acceptable.

extract grammar features from sentence on Google App Engine

For my GAE app I need to do some natural language processing to extract the subject and object from an input sentence.
Apparently NLTK can't be installed (easily) on GAE so I am looking for another solution.
I noticed GAE comes with Antlr3 but from browsing their documentation it solves a different kind of grammar problem.
Any ideas?

You can easily build and NTLK RPC server on some machine and access it.
Another option is to find another web based service that already does that (such as opencalais).

With regards to the NLTK problem specifically, my solution would probably be to fix the weird imports that NLTK is doing, and use that as originally planned. When you're done, submit a patch of course.
That said, if this ultimately involves touching the data store, the answer is that it probably can't be done in a performant way, unless your data set is small or for some reason your NLP stuff doesn't need to hit some kind of full-text index. The GAE guys are working on it, but they have indicated that no one should be expecting a quick resolution to this particular issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.