Explain search (Sphinx/Haystack) in simple context? - python

Could you explain how search engines like Sphinx, Haystack, etc fit in to a web framework. If you could explain in a way that someone new to web development could understand that would help.
One example use case I made up for this question is a book search feature. Lets say I have a noSQL database that contains book objects, each containing author, title, ISBN, etc.; how does something like Sphinx/Haystack/other search engine fit in with my database to search for a books with a given ISBN?

Firstly, Haystack isn't a search engine, it's a library that provides a Django API to existing search engines like Solr and Whoosh.
That said, your example isn't really a very good one. You wouldn't use a separate search engine to search by ISBN, because your database would already have an index on the Book table which would efficiently do that search. Where a search engine would come in could be in two places. Firstly, you could index some or all of the book's contents to search on: databases are not very good at full-text search, but this is an area where search engines shine. Secondly, you could provide a search against multiple fields - say, author, title, publisher and description - in one go.
Also, search engines provide useful functionality like suggestions, faceting and so on that you won't get from a database.

Related

Advanced Search algorithm on Google Appengine datastore and indexes

I am using appengine with python (version 2.7) for a web application which deals with job listings and job search.
Backend consists of a "Job" table which consists of 20+ fields such as title,date,experience etc. I have the necessary composite indexes defined for each of the filter's permutation and combination. As you would have guessed, the number of indexes are high.
The front-end provides option for users to search for jobs and filter them using the columns.
This works as expected but with following drawbacks:
Slow Search Performance
The search is divided into two parts: inbuilt datastore filtering and then a custom filtering on top of the refined results. The custom filtering is required to further apply the complex filters which are not supported by appengine.
Exploding composite indexes
Some columns (5 for instance) accepts only a set of values, so filtering using them is pretty straightforward. While other fields can have user defined values and hence filtering through them requires custom python code.
Jinja is the templating engine which then renders the data into the html.
Advanced Search + Index References: https://cloud.google.com/appengine/articles/indexselection
Is there a better approach/algorithm for implementing the search and advanced search in the appengine?
You might want to consider using the Full Text Search API available in App Engine. In essence, when entities are created in Cloud Datastore, you would create a Document with the entity ID/Key and all searchable fields and send it to the Search API for indexing. Any updates to the Datastore entities would also need to update the corresponding Search document. Also, when entities are deleted, delete the corresponding Search document.
Modify your Application's search code to perform the Search on Indexed documents instead of Datastore queries. Retrieve a page (e.g. 50) of Document IDs. Fetch the data for the 50 entities using a Datastore Get and display the results.
Per the documentation -
The Search API lets your application perform Google-like full-text
searches over structured data, and supports Geolocation-based queries.
It can be useful in any application domain that benefits from
full-text search, such as:
This would definitely give a better Search experience for your application users when compared with Datastore queries.
Once you implement this, you might be able to just get rid of the composite indexes from Datastore.

How to use full-text search in sqlite3 database in django?

I am working on a django application with sqlite3 database, that has a fixed database content. By fixed I mean the content of the db won't change over time. The model is something like this:
class QScript(models.Model):
ch_no = models.IntegerField()
v_no = models.IntegerField()
v = models.TextField()
There are around 6500 records in the table. Given a text that may have some words missing, or some words misspelled, I need to determine its ch_no and v_no. For example, if there is a v field in db with text "This is an example verse", a given text like "This an egsample verse" should give me the ch_no and v_no from db. This can be done using Full text search I believe.
My queries are:
can full-text search do this? My guess from what I have studied, it can, as said in sqlite3 page: full-text searches is "what Google, Yahoo, and Bing do with documents placed on the World Wide Web". Cited in SO, I read this article too, along with many others, but didn't find anything that closely matches my requirements.
How to use FTS in django models? I read this but it didn't help. It seems too outdated. Read here that: "...requires direct manipulation of the database to add the full-text index". Searching gives mostly MySQL related info, but I need to do it in sqlite3. So how to do that direct manipulation in sqlite3?
Edit:
Is my choice of sticking to sqlite3 correct? Or should I use something different (like haystack+elasticsearch as said by Alex Morozov)? My db will not grow any larger, and I have studied that for small sized db, sqlite is almost always better (my situation matches the fourth in sqlite's when to use checklist).
SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.
In short, the SQLite FTS extension is fairly basic, and isn't meant to compete with, say, Google.
As for Django integration, I don't believe there is any. You will likely need to use Django's interface for raw SQL queries, for both creating and querying the FTS table.
I think that while sqlite is an amazing piece of software, its full-text search capabilities are quite limited. Instead you could index your database using Haystack Django app with some backend like Elasticsearch. Having this setup (and still being able to access your sqlite database) seems to me the most robust and flexible way in terms of FTS.
Elasticsearch has a fuzzy search based on the Levenshtein distance (in a nutshell, it would handle your "egsample" queries). So all you need is to make a right type of query:
from haystack.forms import SearchForm
from haystack.generic_views import SearchView
from haystack import indexes
class QScriptIndex(indexes.SearchIndex, indexes.Indexable):
v = indexes.CharField(document=True)
def get_model(self):
return QScript
class QScriptSearchForm(SearchForm):
text_fuzzy = forms.CharField(required=False)
def search(self):
sqs = super(QScriptSearchForm, self).search()
if not self.is_valid():
return self.no_query_found()
text_fuzzy = self.cleaned_data.get('text_fuzzy')
if text_fuzzy:
sqs = sqs.filter(text__fuzzy=text_fuzzy)
return sqs
class QScriptSearchView(SearchView):
form_class = QScriptSearchForm
Update: As long as PostgreSQL has the Levenshtein distance function, you could also leverage it as the Haystack backend as well as a standalone search engine. If you choose the second way, you'll have to implement a custom query expression, which is relatively easy if you're using a recent version of Django.

Google Apps Engine Datastore Search

I was wondering if there was any way to search the datastore for a entry. I have a bunch of entries for songs(title, artist,rating) but im not sure how to really search through them for both song title and artist. We take in a search term and are looking for all entries that "match." But we are lost :( any help is much appreciated!
We are using python
edit1: current code is useless, its an exact search but might help you see the issue
query = song.gql("SELECT * FROM song WHERE title = searchTerm OR artist = searchTerm")
The song data you work with sounds as a rather static data set (primarily inserts, no or few updates). In that case there is GAE technique called Relation Index Entity (RIE) which is an efficient way to implement keyword-based search.
But some preparation work required which is briefly:
build special RIE entity where you place all searchable keywords
from each song (one-to-one relationship).
RIE stores them in StringListProperty which supports searches like this:
keywords = 'SearchTerm'
(returns True if any of the values in the list keywords matches 'SearchTerm'`)
AND condition works immediately by adding multipe filters as above
OR condition needs more work by implementing in-memory merge from AND-only queries
You can find details on solution workflow and code samples in my blog Relation Index Entities with Python for Google Datastore.
http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine

Best full-text search for google-app-engine

did you know the best full-text search on gae ?
thanks
Read this blog post which details how to add full-text search to App Engine models.
It also details how to make only certain fields searchable, and turn on stemming.
Now we can use experimental Search API:
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
Documentation: https://developers.google.com/appengine/docs/python/search/overview
Early presentation: http://www.google.com/events/io/2011/sessions/full-text-search.html
Google App Engine - Full Text Search

Search functionality for Django

I'm developing a web app using Django, and I'll need to add search functionality soon. Search will be implemented for two models, one being an extension of the auth user class and another one with the fields name, tags, and description. So I guess nothing too scary here in context of searching text.
For development I am using SQLite and as no database specific work has been done, I am at liberty to use any database in production. I'm thinking of choosing between PostgreSQL or MySQL.
I have gone through several posts on Internet about search solutions, nevertheless I'd like to get opinions for my simple case. Here are my questions:
is full-text search an overkill in my case?
is it better to rely on the database's full-text search support? If so, which database should I use?
should I use an external search library, such as Whoosh, Sphinx, or Xapian? If so, which one?
EDIT:
tags is a Tagfield (from the django-tagging app) that sits on a m2m relationship. description is a field that holds HTML and has a max_length of 1024 bytes.
If that field tags means what I think it means, i.e. you plan to store a string which concatenates multiple tags for an item, then you might need full-text search on it... but it's a bad design; rather, you should have a many-many relationship between items and a tags table (in another table, ItemTag or something, with 2 foreign keys that are the primary keys of the items table and tags table).
I can't tell whether you need full-text search on description as I have no indication of what it is -- nor whether you need the reasonable but somewhat rudimentary full-text search that MySQL 5.1 and PostgreSQL 8.3 provide, or the more powerful one in e.g. sphinx... maybe talk a bit more about the context of your app and why you're considering full-text search?
Edit: so it seems the only possible need for full-text search might be on description, and that looks like it's probably limited enough that either MySQL 5.1 or PostgreSQL 8.3 will serve it well. Me, I have a sweet spot for PostgreSQL (even though I'm reasonably expert at MySQL too), but that's a general preference, not specifically connected to full-text search issues. This blog does provide one reason to prefer PostgreSQL: you can have full-text search and still be transactional, while in MySQL full-text indexing only work on MyISAM tables, not InnoDB [[except if you add sphinx, of course]] (also see this follow-on for a bit more on full-text search in PostgreSQL and Lucene). Still, there are of course other considerations involved in picking a DB, and I don't think you'll be doing terribly with either (unless having to add sphinx for full-text plus transaction is a big problem).
Django has full text searching support in its QuerySet filters. Right now, if you only have two models that need searching, just make a view that searches the fields on both:
search_string = "+Django -jazz Python"
first_models = FirstModel.objects.filter(headline__search=search_string)
second_models = SecondModel.objects.filter(headline__search=search_string)
You could further filter them to make sure the results are unique, if necessary.
Additionally, there is a regex filter that may be even better for dealing with your html fields and tags since the regex can instruct the filter on exactly how to process any delimiters or markup.
Whether you need an external library depends on your needs. How much traffic are we talking about? The external libraries are generally better when it comes to performance, but as always there are advantages and disadvantages. I am using Sphinx with django-sphinx plugin, and I would recommend it if you will be doing a lot of searching.
Haystack looks promising. And it supports Whoosh on the back end.

Categories

Resources