Django-haystack: How to enable highlighting in my setup? - python

I'm currently adding search functionality to my Django application using django-haystack v2.0.0-beta and Whoosh as the back end. Creating the index and returning the search results works fine so far. Now I want to enable the highlighting feature but I don't get it to work.
I'm using a highly customized setup for which the haystack documentation is not a great help. My Django application is a pure AJAX application, i.e., all requests between client and server are handled asynchronously by using jQuery and $.ajax(). That's why I have written a custom Django view that creates the haystack search queryset manually and dumps the search results into a JSON object. All of this works fine, but the addition of highlighting does not work. Here is my code that I have so far:
search_indexes.py
class CrawledWebpageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return CrawledWebpage # This is my Django model
forms.py
class HaystackSearchForm(forms.Form):
q = forms.CharField(
max_length=100,
label='Enter your search query')
views.py (I adopted some code from this post as it looked reasonable to me but it's probably wrong)
def return_search_results_ajax(request):
haystack_search_form = HaystackSearchForm(request.POST)
response = {}
if haystack_search_form.is_valid():
search_query = haystack_search_form.cleaned_data['q']
sqs = SearchQuerySet().filter(content=search_query)
highlighted_search_form = HighlightedSearchForm(request.POST, searchqueryset=sqs, load_all=True)
search_results = highlighted_search_form.search()
# Here I extract those fields of my model that should be displayed as results
webpage_urls = [result.object.url for result in search_results[:10]]
response['webpage_urls'] = webpage_urls
return HttpResponse(json.dumps(response), mimetype='application/json')
This code works fine as far as the search results are returned properly. But when I try to access the highlighted text snippet for a search result, for example for the first one:
print search_results[0].highlighted
Then I always get an empty string as the result: {'text': ['']}
Can anyone help me to get the highlighting feature working? Thank you very much in advance.

It looks like this is possibly a Haystack bug that has gone unresolved for a long time: http://github.com/toastdriven/django-haystack/issues/310
http://github.com/toastdriven/django-haystack/issues/273
http://github.com/toastdriven/django-haystack/issues/582
As an alternative, you could use Haystack's highlighting functionality instead of Whoosh's to highlight the results yourself. For example, once you get your search results in sqs, you could do
from haystack.utils import Highlighter
highlighter = Highlighter(search_query)
print highlighter.highlight(sqs[0].text)
to get the highlighted text of the first result. See http://django-haystack.readthedocs.org/en/latest/highlighting.html for the documentation.

I'm not familiar with Haystack but could it be because you're using HaystackSearchForm in one place and HighlightedSearchForm in another?

Related

Elastic search DSL python issue

I have been using the ElasticSearch DSL python package to query my elastic search database. The querying method is very intuitive but I'm having issues retrieving the documents. This is what I have tried:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch(hosts=[{"host":'xyz', "port":9200}],timeout=400)
s = Search(using=es,index ="xyz-*").query("match_all")
response = s.execute()
for hit in response:
print hit.title
The error I get :
AttributeError: 'Hit' object has no attribute 'title'
I googled the error and found another SO : How to access the response object using elasticsearch DSL for python
The solution mentions:
for hit in response:
print hit.doc.firstColumnName
Unfortunately, I had the same issue again with 'doc'. I was wondering what the correct way to access my document was?
Any help would really be appreciated!
I'm running into the same issues as I've found different versions of this, but it seems to depend on the version of the elasticsearch-dsl library you're using. You might explore the response object, and it's sub-objects. For instance, using version 5.3.0, I see the expected data using the below loop.
for hit in RESPONSE.hits._l_:
print(hit)
or
for hit in RESPONSE.hits.hits:
print(hit)
NOTE these are limited to 10 data elements for some strange reason.
print(len(RESPONSE.hits.hits))
10
print(len(RESPONSE.hits._l_))
10
This doesn't match the amount of overall hits if I print the number of hits using print('Total {} hits found.\n'.format(RESPONSE.hits.total))
Good luck!
From version 6 onwards the response does not return your populated Document class anymore, meaning that your fields are just an AttrDict which is basically a dictionary.
To solve this you need to have a Document class representing the document you want to parse. Then you need to parse the hit dictionary with your document class using the .from_es() method.
Like I answered here.
https://stackoverflow.com/a/64169419/5144029
Also have a look at the Document class here
https://elasticsearch-dsl.readthedocs.io/en/7.3.0/persistence.html

Haystack + Xapian: Can't get autocomplete functionality working

I'm trying to get autocomplete working on my server for search. Here is an example of one of my indexer classes:
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
artist_name = indexes.CharField(model_attr='clean_artist_name', null=True)
submitted_date = indexes.DateTimeField(model_attr='submitted_date')
total_count = indexes.IntegerField(model_attr='total_count')
# This is used for autocomplete
content_auto = indexes.NgramField(use_template=True)
def get_model(self):
return Artist
def index_queryset(self, using=None):
""" Used when the entire index of a model is updated. """
return self.get_model().objects.filter(date_submitted__lte=datetime.now())
def get_updated_field(self):
return "last_data_change"
The text and content_auto fields are populated using templates which, in the case of Artsts, is just the artist name. According to the docs, something like this should work for autocomplete:
objResultSet = SearchQuerySet().models(Artist).autocomplete(content_auto=search_term)
However, trying this with the string "bill w" returns Bill Stephney as the top result and then Bill Withers as the second result. This is because Bill Stephney has more records in the database, but Stephney shouldn't be matching this query: once the "w" is detected it should only match Bill Withers (and other Bill Ws). I've also tried wildcards:
objResultSet = SearchQuerySet().models(Artist).filter(content_auto=search_term + '*')
and
objResultSet = SearchQuerySet().models(Artist).filter(text=AutoQuery(search_term + '*'))
but the wildcard seems to cause a load of problems, with the development server hanging and eventually stopping due to a Write Failed: Broken Pipe error with a cryptic stack trace, all of which is within the Python framework. Has anyone managed to get this working properly? Is NgramField the right type to use? I've tried using EdgeNgramField but that gave me similar results.
I believe the Haystack documentation recommends EdgeNgramField for "standard text," which I assume is English. They recommend NgramField for Asian languages or if you want to match across word boundaries. I.e., I think you want your content_auto to use EdgeNgramField:
content_auto = indexes.EdgeNgramField(use_template=True)
Also, since n-grams are not exactly wildcard searches (in the way we use * [the asterisk] in shell script glob matches, for example), you should not use * in your filter.
One thing I have found that makes a difference in the search results are the parameters you can tweak in the backend engine -- there are settings for the n-gram tokenizer and n-gram filter. Depending on the search engine backend you're using, changing the min_gram values will affect the results you get in your matches.
I've only used the elasticsearch backend so I don't know if other backends are as sensitive to these n-gram settings as the solr/elasticsearch ones. Basically, I created a custom backend based on the default one that comes with haystack and tweaked the min_gram values to test the matches. The higher value you set the more "accurate" the match is since it has to match a longer token.
See this question on using a backend with custom n-gram settings for elasticsearch:
EdgeNgramField min and max letters in django haystack

wikitools, wikipedia and python

Does anybody have experience in getting a wikipedia page using wikitools for python (and django)? I am trying to get the article but I get a few first lines and that's it. I need to fetch the whole article and I can't seem to figure it out. The documentation is not very helpful either. My code is:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php?title=Some_Title&action=raw&maxlag=-1")
wikipage = page.Page(wikiobj, url, section='content')
wikidata = wikipage.getWikiText(True).decode('utf-8', 'replace')
Any help will be appreciated.
I'm using wikitools im my project, not for getting text on the page, but I initialize wiki object in a different way:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php")
wikipage = page.Page(wikiobj, title="Some_Title")
You don't need to supply any query to after api.php in the Wiki class.
Next, look at the definition of Page class:
__init__(self, site, title=False, check=True, followRedir=True, section=False, sectionnumber=False, pageid=False, namespace=False)
So you need to supply title to the constructor of the Page class (you supplied some unknown url param).

whoosh MultifieldParser field search or query parser concatenation

I'm trying to use whoosh to add search functionality to my blogapp on appengine but I don't understand some stuff.
The blogentries are indexed with title, content and status fields.
I would like to have different type of results on the public page then on the admin page but without the need to have multiple indexes.
On the frontpage I want visitors to be able to search on visible entries only on the title and content fields and in the admin I want to search also on draft entries.
Can i concatenate searches using QueryParser so I can search on multiple fields?
How could I filter on status:visible with MultifieldParser?
EDIT
didn't test it yet but i got an answer on the whoosh mailing list:
# Create a parser that will search in title and content
qp = qparser.MultifieldParser(["title", "content"], ix.schema)
# Parse the user query
q = qp.parse(user_query_string)
# If request is not admin, filter on status:visible
filterq = query.Term("status", u"visible") if not is_admin else None
# Get search results
results = searcher.search(q, filter=filterq)
I know this is not strictly an answer but Google added a full text search api similar to whoosh. Perhaps you should try it.
https://developers.google.com/appengine/docs/python/search/overview

How to implement full text search in Django?

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?
I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.
Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.
Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.
SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.
For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.
I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Categories

Resources