How to implement full text search in Django? - python

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?

I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.

Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.

Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.

SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.

For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.

I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Related

Does gdata-python-client allow fulltext queries with multiple terms?

I'm attempting to search for contacts via the Google Contacts API, using multiple search terms. Searching by a single term works fine and returns contact(s):
query = gdata.contacts.client.ContactsQuery()
query.text_query = '1048'
feed = gd_client.GetContacts(q=query)
for entry in feed.entry:
# Do stuff
However, I would like to search by multiple terms:
query = gdata.contacts.client.ContactsQuery()
query.text_query = '1048 1049 1050'
feed = gd_client.GetContacts(q=query)
When I do this, no results are returned, and I've found so far that spaces are being replaced by + signs:
https://www.google.com/m8/feeds/contacts/default/full?q=3066+3068+3073+3074
I'm digging through the gdata-client-python code right now to find where it's building the query string, but wanted to toss the question out there as well.
According to the docs, both types of search are supported by the API, and I've seen some similar docs when searching through related APIs (Docs, Calendar, etc):
https://developers.google.com/google-apps/contacts/v3/reference#contacts-query-parameters-reference
Thanks!
Looks like I was mistaken in my understanding of the gdata query string functionality.
https://developers.google.com/gdata/docs/2.0/reference?hl=en#Queries
'The service returns all entries that match all of the search terms (like using AND between terms).'
Helps to read the docs and understand them!

GAE python: why is searching through the datastore so slow? What is a good search query algorithms?

I'm using google app engine datastore and have around 1500 blog posts in the datastore.
Using ndb
class BlogPost(ndb.Model):
title = ndb.StringProperty(required=True)
content = ndb.TextProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
So I'm using
words = self.request.get("q")
search_words = words.split()
query = libs.blogs_cache() # returns a list of blogs memcache
search_results = [blog for blog in query for word in search_words
if word.lower() in blog.title.lower()]
This is an example I use for the time being. But unfortunately, this is extremely slow (takes around 6 seconds) because you have to go thru every single data to find the results. If you use multiple words, it will multiply the number of search.
So my question is. What are some ways to speed up the search and google app engine? Any examples and directions to it would be grateful. Thanks in advance.
I think for this type of search , you should use google app engine search api.
https://cloud.google.com/appengine/docs/python/search/
just feed the data in the search documents and you can then query through them
If there are not too many words in search_words, you can make an IN query on the title:
search_words = [word.lower() for word in words.split()]
search_results = BlogPost.query(BlogPost.title.IN(search_words)).fetch()
Notice that this matches the title exactly which might not be what you want and if you need to query for lowercase blog titles, you probably also have to make a ComputerProperty for that.
I think #omair_77's answer is likely best, but an alternative to consider, if the blog posts and the search lists are small enough, is a computed property:
class BlogPost(ndb.Model):
title = ndb.StringProperty(required=True)
content = ndb.TextProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
words = ndb.ComputedProperty(lambda self: content.lower().split())
Now, BlogPost.words.IN(words.lower().split()) will give you the desired semantics -- all blogs containing at least one of the words in space-separated string words, case-insensitive.
If you need to ignore punctuation you'll likely want regular expressions instead (re.findall(r'\w+', whatever.lower()) instead of the simple split calls, but the general ideas in GAE terms are the same: computed properties can be used in queries, and that IN operator locates entities with at least one "hit" -- and it does so rapidly, using indices on the "back-end side" of things.

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Haystack + Xapian: Can't get autocomplete functionality working

I'm trying to get autocomplete working on my server for search. Here is an example of one of my indexer classes:
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
artist_name = indexes.CharField(model_attr='clean_artist_name', null=True)
submitted_date = indexes.DateTimeField(model_attr='submitted_date')
total_count = indexes.IntegerField(model_attr='total_count')
# This is used for autocomplete
content_auto = indexes.NgramField(use_template=True)
def get_model(self):
return Artist
def index_queryset(self, using=None):
""" Used when the entire index of a model is updated. """
return self.get_model().objects.filter(date_submitted__lte=datetime.now())
def get_updated_field(self):
return "last_data_change"
The text and content_auto fields are populated using templates which, in the case of Artsts, is just the artist name. According to the docs, something like this should work for autocomplete:
objResultSet = SearchQuerySet().models(Artist).autocomplete(content_auto=search_term)
However, trying this with the string "bill w" returns Bill Stephney as the top result and then Bill Withers as the second result. This is because Bill Stephney has more records in the database, but Stephney shouldn't be matching this query: once the "w" is detected it should only match Bill Withers (and other Bill Ws). I've also tried wildcards:
objResultSet = SearchQuerySet().models(Artist).filter(content_auto=search_term + '*')
and
objResultSet = SearchQuerySet().models(Artist).filter(text=AutoQuery(search_term + '*'))
but the wildcard seems to cause a load of problems, with the development server hanging and eventually stopping due to a Write Failed: Broken Pipe error with a cryptic stack trace, all of which is within the Python framework. Has anyone managed to get this working properly? Is NgramField the right type to use? I've tried using EdgeNgramField but that gave me similar results.
I believe the Haystack documentation recommends EdgeNgramField for "standard text," which I assume is English. They recommend NgramField for Asian languages or if you want to match across word boundaries. I.e., I think you want your content_auto to use EdgeNgramField:
content_auto = indexes.EdgeNgramField(use_template=True)
Also, since n-grams are not exactly wildcard searches (in the way we use * [the asterisk] in shell script glob matches, for example), you should not use * in your filter.
One thing I have found that makes a difference in the search results are the parameters you can tweak in the backend engine -- there are settings for the n-gram tokenizer and n-gram filter. Depending on the search engine backend you're using, changing the min_gram values will affect the results you get in your matches.
I've only used the elasticsearch backend so I don't know if other backends are as sensitive to these n-gram settings as the solr/elasticsearch ones. Basically, I created a custom backend based on the default one that comes with haystack and tweaked the min_gram values to test the matches. The higher value you set the more "accurate" the match is since it has to match a longer token.
See this question on using a backend with custom n-gram settings for elasticsearch:
EdgeNgramField min and max letters in django haystack

django-haystack - filter based on query along with query for search term

I am able to search using ?q='search term'. But my requirement is, among the searched terms, I should be able to order them by price etc. filter by another field etc.
Will provide more information if necessary.
You should look into faceting which enables you to search on other fields of a model. Basically it comes down to defining the facets and then enabling the user to search for them, in addition to textual search as you're doing now with keywords.
Assuming you are using a SearchView, override the get_results method to do the extra processing you need on the SearchQuerySet like:
Class MySearchView(SearchView)
#...
def get_results(self):
results = super(MySearchView, self).get_results()
order = self.request.GET.get('order')
if order:
results = results.order_by(order)
return results

Categories

Resources