How to use full-text search in sqlite3 database in django?

How to use full-text search in sqlite3 database in django? - python

I am working on a django application with sqlite3 database, that has a fixed database content. By fixed I mean the content of the db won't change over time. The model is something like this:
class QScript(models.Model):
ch_no = models.IntegerField()
v_no = models.IntegerField()
v = models.TextField()
There are around 6500 records in the table. Given a text that may have some words missing, or some words misspelled, I need to determine its ch_no and v_no. For example, if there is a v field in db with text "This is an example verse", a given text like "This an egsample verse" should give me the ch_no and v_no from db. This can be done using Full text search I believe.
My queries are:
can full-text search do this? My guess from what I have studied, it can, as said in sqlite3 page: full-text searches is "what Google, Yahoo, and Bing do with documents placed on the World Wide Web". Cited in SO, I read this article too, along with many others, but didn't find anything that closely matches my requirements.
How to use FTS in django models? I read this but it didn't help. It seems too outdated. Read here that: "...requires direct manipulation of the database to add the full-text index". Searching gives mostly MySQL related info, but I need to do it in sqlite3. So how to do that direct manipulation in sqlite3?
Edit:
Is my choice of sticking to sqlite3 correct? Or should I use something different (like haystack+elasticsearch as said by Alex Morozov)? My db will not grow any larger, and I have studied that for small sized db, sqlite is almost always better (my situation matches the fourth in sqlite's when to use checklist).

SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.
In short, the SQLite FTS extension is fairly basic, and isn't meant to compete with, say, Google.
As for Django integration, I don't believe there is any. You will likely need to use Django's interface for raw SQL queries, for both creating and querying the FTS table.

I think that while sqlite is an amazing piece of software, its full-text search capabilities are quite limited. Instead you could index your database using Haystack Django app with some backend like Elasticsearch. Having this setup (and still being able to access your sqlite database) seems to me the most robust and flexible way in terms of FTS.
Elasticsearch has a fuzzy search based on the Levenshtein distance (in a nutshell, it would handle your "egsample" queries). So all you need is to make a right type of query:
from haystack.forms import SearchForm
from haystack.generic_views import SearchView
from haystack import indexes
class QScriptIndex(indexes.SearchIndex, indexes.Indexable):
v = indexes.CharField(document=True)
def get_model(self):
return QScript
class QScriptSearchForm(SearchForm):
text_fuzzy = forms.CharField(required=False)
def search(self):
sqs = super(QScriptSearchForm, self).search()
if not self.is_valid():
return self.no_query_found()
text_fuzzy = self.cleaned_data.get('text_fuzzy')
if text_fuzzy:
sqs = sqs.filter(text__fuzzy=text_fuzzy)
return sqs
class QScriptSearchView(SearchView):
form_class = QScriptSearchForm
Update: As long as PostgreSQL has the Levenshtein distance function, you could also leverage it as the Haystack backend as well as a standalone search engine. If you choose the second way, you'll have to implement a custom query expression, which is relatively easy if you're using a recent version of Django.

Related

How do you iterate over StringProperty as a string object?

My code is as follows:
class sample(ndb.Model):
text=ndb.StringProperty()
class sample2(webapp2.RequestHandler):
def get(self, item):
q = sample.query()
q1 = query.filter(item in sample.text)
I want to search for a specific word(item) in the text inside sample. How do I go about it? I tried this link, but it doesn't really answer my question- How do I get the value of a StringProperty in Python for Google App Engine?

Unfortunately, you can't do queries like that with the datastore (or likely any other database). Datastore queries are based on indices, and indices are not able to store complicated information, such as whether string has a certain substring.
App Engine also has a Search API that you can use to do more complicated text searching.

First of all, Python 2 is deprecated since 1st of January 2020 and the Google Cloud Platform strongly encourages the migration to newer runtimes. Also the ndb library is no longer recommended and the code relying on it would need to be updated before you could start using the Python 3 runtime. All these reasons makes me suggest you to directly switch before you spent much time on them.
Having said that, you could find the answer in the ndb library documentation about queries and properties. There are also a couple of examples in the Github code samples.
First you need to retrieve the entity with the query() method and then access it's data through python object property notation. More or less it will look like the following:
q = sample.query() # Retrieve sample entity.
substring = 'word' in q.text # Returns True if word is a substring of text

Explain search (Sphinx/Haystack) in simple context?

Could you explain how search engines like Sphinx, Haystack, etc fit in to a web framework. If you could explain in a way that someone new to web development could understand that would help.
One example use case I made up for this question is a book search feature. Lets say I have a noSQL database that contains book objects, each containing author, title, ISBN, etc.; how does something like Sphinx/Haystack/other search engine fit in with my database to search for a books with a given ISBN?

Firstly, Haystack isn't a search engine, it's a library that provides a Django API to existing search engines like Solr and Whoosh.
That said, your example isn't really a very good one. You wouldn't use a separate search engine to search by ISBN, because your database would already have an index on the Book table which would efficiently do that search. Where a search engine would come in could be in two places. Firstly, you could index some or all of the book's contents to search on: databases are not very good at full-text search, but this is an area where search engines shine. Secondly, you could provide a search against multiple fields - say, author, title, publisher and description - in one go.
Also, search engines provide useful functionality like suggestions, faceting and so on that you won't get from a database.

How to use High Replication Datastore

Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!

I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.

The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.

Google Apps Engine Datastore Search

I was wondering if there was any way to search the datastore for a entry. I have a bunch of entries for songs(title, artist,rating) but im not sure how to really search through them for both song title and artist. We take in a search term and are looking for all entries that "match." But we are lost :( any help is much appreciated!
We are using python
edit1: current code is useless, its an exact search but might help you see the issue
query = song.gql("SELECT * FROM song WHERE title = searchTerm OR artist = searchTerm")

The song data you work with sounds as a rather static data set (primarily inserts, no or few updates). In that case there is GAE technique called Relation Index Entity (RIE) which is an efficient way to implement keyword-based search.
But some preparation work required which is briefly:
build special RIE entity where you place all searchable keywords
from each song (one-to-one relationship).
RIE stores them in StringListProperty which supports searches like this:
keywords = 'SearchTerm'
(returns True if any of the values in the list keywords matches 'SearchTerm'`)
AND condition works immediately by adding multipe filters as above
OR condition needs more work by implementing in-memory merge from AND-only queries
You can find details on solution workflow and code samples in my blog Relation Index Entities with Python for Google Datastore.

http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine

Search functionality for Django

I'm developing a web app using Django, and I'll need to add search functionality soon. Search will be implemented for two models, one being an extension of the auth user class and another one with the fields name, tags, and description. So I guess nothing too scary here in context of searching text.
For development I am using SQLite and as no database specific work has been done, I am at liberty to use any database in production. I'm thinking of choosing between PostgreSQL or MySQL.
I have gone through several posts on Internet about search solutions, nevertheless I'd like to get opinions for my simple case. Here are my questions:
is full-text search an overkill in my case?
is it better to rely on the database's full-text search support? If so, which database should I use?
should I use an external search library, such as Whoosh, Sphinx, or Xapian? If so, which one?
EDIT:
tags is a Tagfield (from the django-tagging app) that sits on a m2m relationship. description is a field that holds HTML and has a max_length of 1024 bytes.

If that field tags means what I think it means, i.e. you plan to store a string which concatenates multiple tags for an item, then you might need full-text search on it... but it's a bad design; rather, you should have a many-many relationship between items and a tags table (in another table, ItemTag or something, with 2 foreign keys that are the primary keys of the items table and tags table).
I can't tell whether you need full-text search on description as I have no indication of what it is -- nor whether you need the reasonable but somewhat rudimentary full-text search that MySQL 5.1 and PostgreSQL 8.3 provide, or the more powerful one in e.g. sphinx... maybe talk a bit more about the context of your app and why you're considering full-text search?
Edit: so it seems the only possible need for full-text search might be on description, and that looks like it's probably limited enough that either MySQL 5.1 or PostgreSQL 8.3 will serve it well. Me, I have a sweet spot for PostgreSQL (even though I'm reasonably expert at MySQL too), but that's a general preference, not specifically connected to full-text search issues. This blog does provide one reason to prefer PostgreSQL: you can have full-text search and still be transactional, while in MySQL full-text indexing only work on MyISAM tables, not InnoDB [[except if you add sphinx, of course]] (also see this follow-on for a bit more on full-text search in PostgreSQL and Lucene). Still, there are of course other considerations involved in picking a DB, and I don't think you'll be doing terribly with either (unless having to add sphinx for full-text plus transaction is a big problem).

Django has full text searching support in its QuerySet filters. Right now, if you only have two models that need searching, just make a view that searches the fields on both:
search_string = "+Django -jazz Python"
first_models = FirstModel.objects.filter(headline__search=search_string)
second_models = SecondModel.objects.filter(headline__search=search_string)
You could further filter them to make sure the results are unique, if necessary.
Additionally, there is a regex filter that may be even better for dealing with your html fields and tags since the regex can instruct the filter on exactly how to process any delimiters or markup.

Whether you need an external library depends on your needs. How much traffic are we talking about? The external libraries are generally better when it comes to performance, but as always there are advantages and disadvantages. I am using Sphinx with django-sphinx plugin, and I would recommend it if you will be doing a lot of searching.

Haystack looks promising. And it supports Whoosh on the back end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.