Haystack + Xapian: Can't get autocomplete functionality working - python

I'm trying to get autocomplete working on my server for search. Here is an example of one of my indexer classes:
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
artist_name = indexes.CharField(model_attr='clean_artist_name', null=True)
submitted_date = indexes.DateTimeField(model_attr='submitted_date')
total_count = indexes.IntegerField(model_attr='total_count')
# This is used for autocomplete
content_auto = indexes.NgramField(use_template=True)
def get_model(self):
return Artist
def index_queryset(self, using=None):
""" Used when the entire index of a model is updated. """
return self.get_model().objects.filter(date_submitted__lte=datetime.now())
def get_updated_field(self):
return "last_data_change"
The text and content_auto fields are populated using templates which, in the case of Artsts, is just the artist name. According to the docs, something like this should work for autocomplete:
objResultSet = SearchQuerySet().models(Artist).autocomplete(content_auto=search_term)
However, trying this with the string "bill w" returns Bill Stephney as the top result and then Bill Withers as the second result. This is because Bill Stephney has more records in the database, but Stephney shouldn't be matching this query: once the "w" is detected it should only match Bill Withers (and other Bill Ws). I've also tried wildcards:
objResultSet = SearchQuerySet().models(Artist).filter(content_auto=search_term + '*')
and
objResultSet = SearchQuerySet().models(Artist).filter(text=AutoQuery(search_term + '*'))
but the wildcard seems to cause a load of problems, with the development server hanging and eventually stopping due to a Write Failed: Broken Pipe error with a cryptic stack trace, all of which is within the Python framework. Has anyone managed to get this working properly? Is NgramField the right type to use? I've tried using EdgeNgramField but that gave me similar results.

I believe the Haystack documentation recommends EdgeNgramField for "standard text," which I assume is English. They recommend NgramField for Asian languages or if you want to match across word boundaries. I.e., I think you want your content_auto to use EdgeNgramField:
content_auto = indexes.EdgeNgramField(use_template=True)
Also, since n-grams are not exactly wildcard searches (in the way we use * [the asterisk] in shell script glob matches, for example), you should not use * in your filter.
One thing I have found that makes a difference in the search results are the parameters you can tweak in the backend engine -- there are settings for the n-gram tokenizer and n-gram filter. Depending on the search engine backend you're using, changing the min_gram values will affect the results you get in your matches.
I've only used the elasticsearch backend so I don't know if other backends are as sensitive to these n-gram settings as the solr/elasticsearch ones. Basically, I created a custom backend based on the default one that comes with haystack and tweaked the min_gram values to test the matches. The higher value you set the more "accurate" the match is since it has to match a longer token.
See this question on using a backend with custom n-gram settings for elasticsearch:
EdgeNgramField min and max letters in django haystack

Related

Django-Haystack(elasticsearch) Autocomplete giving results for substring in search term

I have a search index with elasticsearch as backend:
class MySearchIndex(indexes.SearchIndex, indexes.Indexable):
...
name = indexes.CharField(model_attr='name')
name_auto = indexes.NgramField(model_attr='name')
...
Suppose I have following values in elasticsearch:
Cable
Magnet
Network
Internet
Switch
When I execute search for netw, it returned Magnet & Internet also along with Network. Using some other test cases I think haystack is searching for substring also, like net in netw as you see in above example.
Here is the code:
sqs = sqs.filter(category='cat_name').using(using)
queried = sqs.autocomplete(name_auto=q)
Also tried with:
queried = sqs.autocomplete(name_auto__contains=q)
How can I resolve this and make it working to return only those results that contains exact search term ?
Using django-haystack==2.4.1 Django==1.9.1 elasticsearch==1.9.0
Customize your elasticsearch backend settings with django-hesab
The default settings of django-hesab will return the exact search result.

django-haystack - filter based on query along with query for search term

I am able to search using ?q='search term'. But my requirement is, among the searched terms, I should be able to order them by price etc. filter by another field etc.
Will provide more information if necessary.
You should look into faceting which enables you to search on other fields of a model. Basically it comes down to defining the facets and then enabling the user to search for them, in addition to textual search as you're doing now with keywords.
Assuming you are using a SearchView, override the get_results method to do the extra processing you need on the SearchQuerySet like:
Class MySearchView(SearchView)
#...
def get_results(self):
results = super(MySearchView, self).get_results()
order = self.request.GET.get('order')
if order:
results = results.order_by(order)
return results

Django-haystack: How to enable highlighting in my setup?

I'm currently adding search functionality to my Django application using django-haystack v2.0.0-beta and Whoosh as the back end. Creating the index and returning the search results works fine so far. Now I want to enable the highlighting feature but I don't get it to work.
I'm using a highly customized setup for which the haystack documentation is not a great help. My Django application is a pure AJAX application, i.e., all requests between client and server are handled asynchronously by using jQuery and $.ajax(). That's why I have written a custom Django view that creates the haystack search queryset manually and dumps the search results into a JSON object. All of this works fine, but the addition of highlighting does not work. Here is my code that I have so far:
search_indexes.py
class CrawledWebpageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return CrawledWebpage # This is my Django model
forms.py
class HaystackSearchForm(forms.Form):
q = forms.CharField(
max_length=100,
label='Enter your search query')
views.py (I adopted some code from this post as it looked reasonable to me but it's probably wrong)
def return_search_results_ajax(request):
haystack_search_form = HaystackSearchForm(request.POST)
response = {}
if haystack_search_form.is_valid():
search_query = haystack_search_form.cleaned_data['q']
sqs = SearchQuerySet().filter(content=search_query)
highlighted_search_form = HighlightedSearchForm(request.POST, searchqueryset=sqs, load_all=True)
search_results = highlighted_search_form.search()
# Here I extract those fields of my model that should be displayed as results
webpage_urls = [result.object.url for result in search_results[:10]]
response['webpage_urls'] = webpage_urls
return HttpResponse(json.dumps(response), mimetype='application/json')
This code works fine as far as the search results are returned properly. But when I try to access the highlighted text snippet for a search result, for example for the first one:
print search_results[0].highlighted
Then I always get an empty string as the result: {'text': ['']}
Can anyone help me to get the highlighting feature working? Thank you very much in advance.
It looks like this is possibly a Haystack bug that has gone unresolved for a long time: http://github.com/toastdriven/django-haystack/issues/310
http://github.com/toastdriven/django-haystack/issues/273
http://github.com/toastdriven/django-haystack/issues/582
As an alternative, you could use Haystack's highlighting functionality instead of Whoosh's to highlight the results yourself. For example, once you get your search results in sqs, you could do
from haystack.utils import Highlighter
highlighter = Highlighter(search_query)
print highlighter.highlight(sqs[0].text)
to get the highlighted text of the first result. See http://django-haystack.readthedocs.org/en/latest/highlighting.html for the documentation.
I'm not familiar with Haystack but could it be because you're using HaystackSearchForm in one place and HighlightedSearchForm in another?

Putting together Haystack records without templates

The haystack documentation (link below) makes this statement:
Additionally, we're providing use_template=True on the text field.
This allows us to use a data template (rather than error prone
concatenation) to build the document the search engine will use in
searching.
How would one go about using concatenation to build the document? I couldn't find an example.
It may have something to do with overriding the prepare method (second link). But in the example given in the documentation the prepare method is used together with a template, so the two might also be orthogonal.
https://github.com/toastdriven/django-haystack/blob/master/docs/tutorial.rst
http://django-haystack.readthedocs.org/en/latest/searchindex_api.html#advanced-data-preparation
You can see how it works in the Haystack source. Basically, the default implementation of the prepare method on SearchField (the base class for Haystack's fields) calls prepare_template if use_template is True.
If you don't want to use a template, you can indeed use concatenation - it's as simple as just joining the data you want together, separated by something (here I've used a newline):
def prepare_myfield(self, obj):
return self.cleaned_data['field1'] + '\n' + self.cleaned_data['field2']
etc.

How to implement full text search in Django?

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?
I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.
Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.
Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.
SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.
For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.
I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Categories

Resources