I'm trying to use whoosh to add search functionality to my blogapp on appengine but I don't understand some stuff.
The blogentries are indexed with title, content and status fields.
I would like to have different type of results on the public page then on the admin page but without the need to have multiple indexes.
On the frontpage I want visitors to be able to search on visible entries only on the title and content fields and in the admin I want to search also on draft entries.
Can i concatenate searches using QueryParser so I can search on multiple fields?
How could I filter on status:visible with MultifieldParser?
EDIT
didn't test it yet but i got an answer on the whoosh mailing list:
# Create a parser that will search in title and content
qp = qparser.MultifieldParser(["title", "content"], ix.schema)
# Parse the user query
q = qp.parse(user_query_string)
# If request is not admin, filter on status:visible
filterq = query.Term("status", u"visible") if not is_admin else None
# Get search results
results = searcher.search(q, filter=filterq)
I know this is not strictly an answer but Google added a full text search api similar to whoosh. Perhaps you should try it.
https://developers.google.com/appengine/docs/python/search/overview
Related
I am struggling to understand how to create a link in django's templates.
in my views.py file I have a list of lists (so a table). One of the fields may or may not contain links.
Views.py
#tableToView is a pd.DataFrame()
actualTable = []
for i in range(tableToView.shape[0]):
temp = tableToView.iloc[i]
actualTable.append(dict(temp))
#Returning the 'actualTable' list of lists to be printed in table format in the template
return render(response, "manual/manualinputs.html", {'defaultDates':defaultDates, 'prevFilterIn':prevFilterIn, 'actualTable':actualTable, 'DbFailure':DbFailure})
so imagine my actualtable has a field 'News' that may or may not contain a link, e.g.:
'hello there, go to https://www.google.com/'
How do I manage the fact that I want in my template to have the same string printed out, but with the address actually being a link?
I will also extend this to non-addresses (say twitter hashtags by parsing text).
Should I act on views or templates?
I did try to go the views way:
I substitute the
'hello there, go to https://www.google.com/'
with
'hello there, go to https://www.google.com/'
But I get the actual tags printed out, which is not what I want...
Any ideas?
If you change it on the view, you need to print it with |safe on the template.
So it would be
{{table.column|safe}}, this way your link will be a link and not a string
I am trying to create a search page where buttons can be clicked which will filter the posts like in this page [Splice Sounds][2] (i think you need an account to view this so ill add screenshots).
To do this i think i need to pass a list so that i can filter by that list but i can't find a way to do this.
having a GET form for each genre (which is being created by a for loop) would allow me to filter by one genre at a time but i want to filter by multple genres at once so that won't work
in the site that i linked to: they pass the genres/tags into the url so how could i do this in django?
Also: i could make seperate url paths and link to those but then i would have to do this for every combination of genres/tags which would be too much so i can't do that.
the link shows a site which passes tags through url like this https://splice.com/sounds/search?sound_type=sample&tag=drums,kicks
here is some relevant code:
this is how i want to filter which is why i need to pass a list of args
for arg in args:
Posts = Posts.filter(genres=arg)
urls
urlpatterns = [
path('', views.find, name='find'),
path('searchgenres=<genres_selected>', views.find_search, name='find_search'),
]
EDIT: I have tried this many ways such as using ajax but i couldn't get that to work well
EDIT 2: i have changed the question to How To Pass Only Selected Arguments Through URL
To pass a list into a request you could:
Use html checkboxes and in views aggregate them into a list
Use a single textbox and parse in views
If you obtain the request as a list, you could use Post.objects.filter(genre__in=genres).
It might also be helpful to know that Django allows for complex lookups with Q objects from django.db.models import Q. The | character represents OR. This allows complex filtering. For instance:
Posts.objects.filter(Q(genre='Pop') | Q(genre='Rock') | Q(genre='Jazz'))
I'm currently adding search functionality to my Django application using django-haystack v2.0.0-beta and Whoosh as the back end. Creating the index and returning the search results works fine so far. Now I want to enable the highlighting feature but I don't get it to work.
I'm using a highly customized setup for which the haystack documentation is not a great help. My Django application is a pure AJAX application, i.e., all requests between client and server are handled asynchronously by using jQuery and $.ajax(). That's why I have written a custom Django view that creates the haystack search queryset manually and dumps the search results into a JSON object. All of this works fine, but the addition of highlighting does not work. Here is my code that I have so far:
search_indexes.py
class CrawledWebpageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return CrawledWebpage # This is my Django model
forms.py
class HaystackSearchForm(forms.Form):
q = forms.CharField(
max_length=100,
label='Enter your search query')
views.py (I adopted some code from this post as it looked reasonable to me but it's probably wrong)
def return_search_results_ajax(request):
haystack_search_form = HaystackSearchForm(request.POST)
response = {}
if haystack_search_form.is_valid():
search_query = haystack_search_form.cleaned_data['q']
sqs = SearchQuerySet().filter(content=search_query)
highlighted_search_form = HighlightedSearchForm(request.POST, searchqueryset=sqs, load_all=True)
search_results = highlighted_search_form.search()
# Here I extract those fields of my model that should be displayed as results
webpage_urls = [result.object.url for result in search_results[:10]]
response['webpage_urls'] = webpage_urls
return HttpResponse(json.dumps(response), mimetype='application/json')
This code works fine as far as the search results are returned properly. But when I try to access the highlighted text snippet for a search result, for example for the first one:
print search_results[0].highlighted
Then I always get an empty string as the result: {'text': ['']}
Can anyone help me to get the highlighting feature working? Thank you very much in advance.
It looks like this is possibly a Haystack bug that has gone unresolved for a long time: http://github.com/toastdriven/django-haystack/issues/310
http://github.com/toastdriven/django-haystack/issues/273
http://github.com/toastdriven/django-haystack/issues/582
As an alternative, you could use Haystack's highlighting functionality instead of Whoosh's to highlight the results yourself. For example, once you get your search results in sqs, you could do
from haystack.utils import Highlighter
highlighter = Highlighter(search_query)
print highlighter.highlight(sqs[0].text)
to get the highlighted text of the first result. See http://django-haystack.readthedocs.org/en/latest/highlighting.html for the documentation.
I'm not familiar with Haystack but could it be because you're using HaystackSearchForm in one place and HighlightedSearchForm in another?
I'm trying to parse an html form using mechanize. The form itself has an arbitrary number of hidden fields and the field names and id's are randomly generated so I have no obvious way to directly select them. Clearly using a name or id is out, and due to the random number of hidden fields I cannot select them based on the sequence number since this always changes too.
However there are always two TextControl fields right after each other, and then below that is a TextareaControl. These are the 3 fields I need access too, basically I need to parse their names and all is well. I've been looking through the mechanize documentation for the past couple hours and haven't come up with anything that seems to be able to do this, however simple it should seem to be (to me anyway).
I have come up with an alternate solution that involves making a list of the form controls, iterating through it to find the controls that contain the string 'Text' returning a new list of those, and then finally stripping out the name using a regular expression. While this works it seems unnecessary and I'm wondering if there's a more elegant solution. Thanks guys.
edit: Here's what I'm currently doing to extract that info if anyone's curious. I think I'm probably just going to stick with this. It seems unnecessary but it gets the job done and it's nothing intensive so I'm not worried about efficiency or anything.
def formtextFieldParse(browser):
'''Expects a mechanize.Browser object with a form already selected. Parses
through the fields returning a tuple of the name of those fields. There
SHOULD only be 3 fields. 2 text followed by 1 textarea corresponding to
Posting Title, Specific Location, and Posting Description'''
import re
pattern = '\(.*\)'
fields = str(browser).split('\n')
textfields = []
for field in fields:
if 'Text' in field: textfields.append(field)
titleFieldName = re.findall(pattern, textfields[0])[0][1:-2]
locationFieldName = re.findall(pattern, textfields[1])[0][1:-2]
descriptionFieldName = re.findall(pattern, textfields[2])[0][1:-2]
I don't think mechanize has the exact functionality you require; could you use mechanize to get the HTML page, then parse the latter for example with BeautifulSoup?
I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?
I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.
Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.
Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.
SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.
For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.
I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))