Integrate extracted PDF content with django-haystack

Integrate extracted PDF content with django-haystack - python

I have extracted PDF/DOCX content with Solr and I've suceeded to establish some search queries using the following Solr URL dedicated to this :
http://localhost:8983/solr/select?q=Lycee
I would like to establish a such query with django-haystack. I have found this link which is talking about the issue :
https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst
But there is no "FileIndex" class with django-haystack (2.0.0-beta). How can I integrate a such search within django-haystack ?

The "FileIndex" referenced in the documentation is a hypothetical subclass of haystack.indexes.SearchIndex. Here is an example:
from haystack import indexes
from myapp.models import MyFile
class FileIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
title = indexes.CharField(model_attr='title')
owner = indexes.CharField(model_attr='owner__name')
def get_model(self):
return MyFile
def index_queryset(self, using=None):
return self.get_model().objects.all()
def prepare(self, obj):
data = super(FileIndex, self).prepare(obj)
# This could also be a regular Python open() call, a StringIO instance
# or the result of opening a URL. Note that due to a library limitation
# file_obj must have a .name attribute even if you need to set one
# manually before calling extract_file_contents:
file_obj = obj.the_file.open()
extracted_data = self.backend.extract_file_contents(file_obj)
# Now we'll finally perform the template processing to render the
# text field with *all* of our metadata visible for templating:
t = loader.select_template(('search/indexes/myapp/myfile_text.txt', ))
data['text'] = t.render(Context({'object': obj,
'extracted': extracted_data}))
return data
So extracted_data would be replaced with whatever process you came up with to extract the PDF/DOCX content. You would then update your template to include that data.

Related

Django-Haystack not finding any field

I am trying to write a little search engine with django-haystac and whoosh.
I adapted their tutorial, I've created my index from a JSON file and query it successfully with QueryParser and now I'm trying to use their view.
when I try to access the search url at: http://127.0.0.1:8000/my_search/ I get the error:
The index 'PaperIndex' must have one (and only one) SearchField with document=True.
If I remove the search_indexes.py I can access the search page, but then of course it does not work as it doesn't have any indecies to search.
By debugging it seems it does not pickup any fields, but it does see the class.
I tried several things but nothing worked
my search_indexes.py:
from haystack import indexes
from my_search.models import Paper
class PaperIndex(indexes.SearchIndex, indexes.Indexable):
"""
This is a search index
"""
title = indexes.CharField(model_attr='title'),
abstract = indexes.CharField(document=True,
use_template=False, model_attr='abstract'),
def get_model(self):
return Paper
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects # .filter(
pub_date__lte=datetime.datetime.now())
my models.py:
from django.db import models
class Paper(models.Model):
paper_url = models.CharField(max_length=200),
title = models.CharField(max_length=200),
abstract = models.TextField()
authors = models.CharField(max_length=200),
date = models.DateTimeField(max_length=200),
def __unicode__(self):
return self.title
thanks!

Haystack uses an extra field for the document=True field.
The text = indexes.CharField(document=True) is not on the model, and haystack dumps a bunch of search-able text in there.
Haystack provides a helper method prepare_text() to populate this field. Alternatively, the template method can be used, which is simply a txt file with django template style model properties on it.
class PaperIndex(indexes.SearchIndex, indexes.Indexable):
"""
This is a search index
"""
text = indexes.CharField(document=True)
title = indexes.CharField(model_attr='title'),
abstract = indexes.CharField(model_attr='abstract'),
def get_model(self):
return Paper
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects # .filter(
pub_date__lte=datetime.datetime.now())

Django Haystack: How to index field from another class

I have a django model Story which I am successfully able to index using templates. However there is another model Reviews which has a static method which takes Story object and returns ratings as Integer. How can I index Story on ratings also.
{{ object.story_name }}
{{Reviews.ratings(object)}}
I tried to call this method in template story_text.txt, but that results in an error.
django.template.exceptions.TemplateSyntaxError: Could not parse the remainder: '(object)'....
Edit:
I tried using below in template, it doesn't give any error while building the index. But how can I now refer to this field while searching using SearchQuerySet
Reviews.average_start_rating( {{object}} )

I am confused. I don't think that you can use syntax like {{ Reviews.rating object }} with template engine in Django. If it is possible, that is what I didn't know.
Why don't you pass what you want to show in template in Context in the first place?
{{ object }} could be rendered because it has object in Context. For example, if you use UpdateView(class based view), It contains object in Context automatically.
class Example(UpdateView):
model = yourClass
form_class = yourFormClass
template_name = yourhtml
success_url = URL redirect page after success
you can use {{object}} in yourhtml.html because of UpdateView. you give pk number in url conf like (?P<pk>[0-9]+).
you can do like this without UpdateView
class anotherExample(View):
def get(self, request, *args, **kwargs):
render(request, 'yourhtml.html', {"object": Class.objects.get(id=self.kwargs['pk'])})
in form view, you can use
def get_context_data(self, **kwargs):
context = super().get_context_data(**kwargs)
context['object'] = Class.objects.get(id= ... )
return context
my idea is passing story object and review object which has FK of story object together in context.

I was able to get it working using haystack advanced-data-preparation.
Advanced Data Preparation
Using an additional field one can have a prepare method for that. However only issue is I can order the data using this field but can't search using it.
class StoryIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
ratings = indexes.FloatField()
def prepare_ratings(self, obj):
return Reviews.ratings(obj)
def get_model(self):
return Story

Instead of using a template for the text field, here you can use the prepare or prepare_FOO methods:
class StoryIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True)
# text = indexes.CharField(document=True, use_template=True)
# ratings = indexes.FloatField()
def prepare_text(self, obj):
return "\n".join(f"{col}" for col in [obj.story_name, Reviews.ratings(obj)])
def get_model(self):
return Story

mentions/internal links in Django

I have a bunch of models. All these models has a method get_absolute_url and a field text. I want to make internal links in the text field just like wikipedia does.
Wikipedia's internal links in pages only refer to other pages. I need to link to all my models.
I could make a pattern for internal links and replacing this pattern with a hardcoded url to an url but it's really not a good idea because the links can change. So it would be best if I could refer to get_absolute_url.
Another option would be to use a template tag to change a specific pattern to links.
How should it be done? Are there any open source projects in which this has already been done?

I wanted to answer this same problem just a few days ago, and I did it with a template filter. My links are relative URLs, not absolute, but you could tweak that pretty easily, and you could also tweak the regex pattern to match whatever link markup you prefer.
Using the filter, the link is only looked up at display time, so if your view's URL has changed, that should automatically update with the reverse() lookup.
I also use Markdown to process my description fields, so I make the link return a markdown-formatted link instead of HTML, but you could tweak that too. If you use Markdown, you'd want to put this filter first.
So to display a description TextField with internal links, in the template would be something like this:
{{ entity.description|internal_links|markdown }}
(See the Django docs on writing your own custom filters for more details on writing and registering filters.)
As for the specific filter itself, I did it like this:
from django import template
from django.core.urlresolvers import reverse
from my.views import *
register = template.Library()
#register.filter
def internal_links(value):
"""
Takes a markdown textfield, and filters
for internal links in the format:
{{film:alien-1979}}
...where "film" is the designation for a link type (model),
and "alien-1979" is the slug for a given object
NOTE: Process BEFORE markdown, as it will resolve
to a markdown-formatted linked name:
[Alien](http://opticalpodcast.com/cinedex/film/alien-1979/)
:param value:
:return:
"""
try:
import re
pattern = '{{\S+:\S+}}'
p = re.compile(pattern)
#replace the captured pattern(s) with the markdown link
return p.sub(localurl, value)
except:
# If the link lookup fails, just display the original text
return value
def localurl(match):
string = match.group()
# Strip off the {{ and }}
string = string[2:-2]
# Separate the link type and the slug
link_type, link_slug = string.split(":")
link_view = ''
# figure out what view we need to display
# for the link type
if(link_type == 'film'):
link_view = 'film_detail'
elif(link_type == 'person'):
link_view = 'person_detail'
else:
raise Exception("Unknown link type.")
link_url = reverse(link_view, args=(link_slug,))
entity = get_object_or_404(Entity, slug=link_slug)
markdown_link = "[" + entity.name + "](" + link_url + ")"
return markdown_link

Django Haystack - Indexing single field

I am using Django Haystack for search.
I only want to target the title field of my model when searching for results.
At present however, it returns results if the search term is in any of the fields in my model.
For example: searching xyz gives results where xyz is in the bio field.
This should not happen, I only want to return results where xyz is in the title field. Totally ignoring all other fields other than Artist.title for searching on.
artists/models.py :
class Artist(models.Model):
title = models.CharField(max_length=255)
slug = models.SlugField(max_length=100)
strapline = models.CharField(max_length=255)
image = models.ImageField(upload_to=get_file_path, storage=s3, max_length=500)
bio = models.TextField()
artists/search_indexes.py
from haystack import indexes
from app.artists.models import Artist
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True, model_attr='title')
def get_model(self):
return Artist
I guess thinking of it like a SQL query:
SELECT * FROM artists WHERE title LIKE '%{search_term}%'
UPDATE
Following suggestion to remove use_template=True, my search_indexes.py now looks like:
from haystack import indexes
from app.artists.models import Artist
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, model_attr='title')
title = indexes.CharField(model_attr='title')
def get_model(self):
return Artist
But I am having the same problem. (Have tried python manage.py rebuild_index)
This is my Haystack settings if that makes any difference:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'haystack.backends.simple_backend.SimpleEngine',
},
}

model_attr and use_template don't work together. In this case, as you're querying for a single model attribute there's no need to use a template. Templates in search indexes are purely meant to group data.
Thus, you end up with:
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, model_attr='title')
def get_model(self):
return Artist

If you don't have any other use case for your index (ie searches that should match terms elsewhere) you just have to not use_template at all (set the use_template param to False and just ditch your search template) and you'll be done. FWIW note that when passing True for use_template the model_attr param is ignored. Also, you may not have a use for a full text search engine then, you could possibly just use Django's standard QuerySet lookup API, ie Artist.objects.filter(title__icontains=searchterm).
Else - if you still need a 'full' document index for other searches and only want to restrict this one search to the title you can as well add another index.CharField (with document=False, model_attr='title') for the title and only search on this field. How to do so is fully documented in Haystack's SearchQuerySet API doc.

From the Docs
Additionally, we’re providing use_template=True on the text field. This allows us to use a data template (rather than error prone concatenation) to build the document the search engine will use in searching. You’ll need to create a new template inside your template directory called search/indexes/myapp/note_text.txt and place the following inside:
{{ object.title }}
{{ object.user.get_full_name }}
{{ object.body }}
So I guess in this template you can declare which fields should be indexed/ searched upon
Other way is to override the def prepare(self, object) of Index class and explicitly define fields that need to be indexed/ searched upon.
OR just use model_attr

Basically your search_indexes.py file is written wrong. It should be like:-
from haystack import indexes
from app.artists.models import Artist
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
title= indexes.CharField(model_attr='title',null=True)
def get_model(self):
return Artist
def index_queryset(self, using=None):
return self.get_model().objects.all()
Then you have to create a template in your app. The directory structure would be like:
templates/search/indexes/artists/artist_text.txt
and add the following code to the artist_text.txt file:
{{ object.title }}
Now do python manage.py rebuild_index.
Now It will return result only for title.

GAE + NDB + Blobstore + Google High Performance Image Serving

I'm making an app to upload text and images. I've readed a lot about blobstore and Google High Performance Image Serving and finally I got a way to implement it all together.
What I want to know is if all is well done or can be do in a better way, and also if it is better to save the serving_url in the model or must be calculated every time I want to print the images in the page.
There is a User and a Picture only.
This is the code (summarized, forget about my custom.PageHandler, that only have functions to render the pages easily, and the stuff for check forms values, etc.):
class User(ndb.Model):
""" A User """
username = ndb.StringProperty(required=True)
password = ndb.StringProperty(required=True)
email = ndb.StringProperty(required=True)
class Picture(ndb.Model):
""" All pictures that a User has uploaded """
title = ndb.StringProperty(required=True)
description = ndb.StringProperty(required=True)
blobKey = ndb.BlobKeyProperty(required=True)
servingUrl = ndb.StringProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
user = ndb.KeyProperty(kind=User)
# This class shows the user pics
class List(custom.PageHandler):
def get(self):
# Get the actual user pics
pics = Picture.by_user(self.user.key)
for pic in pics:
pic.servingUrl = images.get_serving_url(pic.blobKey, size=90, crop=True)
self.render_page("myPictures.htm", data=pics)
# Get and post for the send page
class Send(custom.PageHandler, blobstore_handlers.BlobstoreUploadHandler):
def get(self):
uploadUrl = blobstore.create_upload_url('/addPic')
self.render_page("addPicture.htm", form_action=uploadUrl)
def post(self):
# Create a dictionary with the values, we will need in case of error
templateValues = self.template_from_request()
# Test if all data form is valid
testErrors = check_fields(self)
if testErrors[0]:
# No errors, save the object
try:
# Get the file and upload it
uploadFiles = self.get_uploads('picture')
# Get the key returned from blobstore, for the first element
blobInfo = uploadFiles[0]
# Add the key to the template
templateValues['blobKey'] = blobInfo.key()
# Save all
pic = Picture.save(self.user.key, **templateValues)
if pic is None:
logging.error('Picture save error.')
self.redirect("/myPics")
except:
self.render_page("customMessage.htm", custom_msg=_("Problems while uploading the picture."))
else:
# Errors, render the page again, with the values, and showing the errors
templateValues = custom.prepare_errors(templateValues, testErrors[1])
# The session for upload a file must be new every reload page
templateValues['form_action'] = blobstore.create_upload_url('/addPic')
self.render_page("addPicture.htm", **templateValues)
Basically, I list all the pics, showing the image in a jinja2 template with this line:
{% for line in data %}
<tr>
<td class="col-center-data"><img src="{{ line.servingUrl }}"></td>
So in the List class I calculate each serving url and add it temporarily to the Model. I don't know exactly if will be good to save it directly in the Model, because I don't know if the url can change with the time. Will be the url permanent for the image? In that case I can save it instead of calculate, true?
The Send class only shows a form to upload the image and saves the data to the Model. I always generate a new form_action link in the case of re-render the page, because docs talk about it. Is it right?
The code is working, but I want to know which is the better way to do it, in terms of performance and resource-saving.

You're right. You do want to save the get_serving_url() instead of calling it repeatedly. It stays the same.
Note that there's a delete_serving_url() when you're done with the url, like when you delete the blob.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Integrate extracted PDF content with django-haystack - python

Related

Django-Haystack not finding any field

Django Haystack: How to index field from another class

mentions/internal links in Django

Django Haystack - Indexing single field

GAE + NDB + Blobstore + Google High Performance Image Serving

Categories

Resources