django haystack or sphinx for simple search?

django haystack or sphinx for simple search? - python

I am creating a food search. I want to simply be able to type a food into a search box and have it return results. I also want to be able to add priority to certain terms so that they show up. For example, searching for "orange" would bring up the fruit first as opposed to the juice.
I haven't been able to determine the better search solution for this scenario in django.
Let me know which is the better solution for this scenario.

I'm the current maintainer of Django-SphinxQL, an implementation for Sphinx in Django, and maintainer of the Xapian backend for Haystack.
I recommend using Haystack:
Haystack allows you to choose between different backends, support most standard features of search (e.g. highlight), and already stood the test of time on search engines for Django.
Django-SphinxQL is in pre-alpha (other implementations such has Django-Sphinx have stalled), and only support a minimal set of functionality.
The only reason I see to choose Sphinx search in detriment of Haystack (e.g. using Django-SphinxQL) is if you specifically have a use case where Sphinx is clearly superior to any Haystack backend.
For instance, Sphinx is known to be very fast indexing in the plain index, but it requires you to re-index everything when you update the database. This particular setup is very convenient for me because I'm using it to index a database that only changes once a day.

Related

Python: Gateway ORM for Models from third party REST APIs

For working with data from a database inside of Python programmes, we generally use Object Relational Mappers, to translate database entries into python objects we can work with, with sqlAlchemy and Django Models probably being the most common and advanced ORMs.
Are there ORMs that do not connect to a database but to a third party (JSON) REST API instead? I would like to have a framework which lets me deal with Python objects to perform CRUD operations on the API. This should have all the well-established standard functionalities of an ORM, including Unit of Work and Lazy Loading. Actually, I would want my python code to be agnostic about whether the model is stored in a database or being fetched from a third party API.
It is hard for me to imagine that such a thing does not yet exist. But I am not able to find it. Maybe I am not knowing the right words to search for it?

ORMs Frameworks are frameworks that connect to databases. From your description you are talking about a DAO pattern, not about a Framework. This is a common programming pattern in other languages such as Java.
The right words or searches would be:
Search for the DAO pattern, what to expect from it and how to code it.
Check a couple of links on examples of DAO patterns in python such as this one, or this other one
Analyze your specific problem. You might not need all the code other solutions offer you. And you might be better off coding yourself your own class adjusted to your needs.
Remember KISS and DRY.
PS: Different languages use different paradigms, it is a common error to try to extrapolate patterns and coding uses from one language to another. So something that is solved in e.g. Java in a way, might not be the best option for Python. Keep that in mind too.

What are the current options for storing, indexing, and querying geospatial data on Google App Engine?

I've been 'away' from the GAE platform and community for a while, and recent new features look rather compelling, but I haven't been able to figure out what to do about geodata on GAE-Python. There are several open source libraries available:
geomodel
geodatastore
mutiny
...but they aren't being actively maintained and haven't been updated in quite a while, so I am left with several questions:
Do any of these libraries work with NDB? Is there something else I should try to use? What is the current best practice for geodata on GAE?
BTW, for my own project, I don't need to store anything other than points, and the sort of queries I need to make (at least initially) are 'X closest results to location Y' and 'all points within 1 mile of location Y'.
UPDATED: Based on comments, it looks like geomodel has been patched to work with NDB, and it seems that the new experimental Search API allows for the kinds of searches I need. However, that leads to a new Q: Will the Search API simply obsolete geomodel and similar libraries, or will they still have a use?

To amplify one of the comments above -- the Search API now supports Geosearch.
It can retrieve results within a given radius and sort them by distance, so it should work well for what you want to do.

Building a DSL query language

i'm working on a project (written in Django) which has only a few entities, but many rows for each entity.
In my application i have several static "reports", directly written in plain SQL. The users can also search the database via a generic filter form. Since the target audience is really tech-savvy and at some point the filter doesn't fit their needs, i think about creating a query language for my database like YQL or Jira's advanced search.
I found http://sourceforge.net/projects/littletable/ and http://www.quicksort.co.uk/DeeDoc.html, but it seems that they only operate on in-memory objects. Since the database can be too large for holding it in-memory, i would prefer that the query is translated in SQL (or better a Django query) before doing the actual work.
Are there any library or best practices on how to do this?

Writing such a DSL is actually surprisingly easy with PLY, and what ho—there's already an example available for doing just what you want, in Django. You see, Django has this fancy thing called a Q object which make the Django querying side of things fairly easy.
At DjangoCon EU 2012, Matthieu Amiguet gave a session entitled Implementing Domain-specific Languages in Django Applications in which he went through the process, right down to implementing such a DSL as you desire. His slides, which include all you need, are available on his website. The final code (linked to from the last slide, anyway) is available at http://www.matthieuamiguet.ch/media/misc/djangocon2012/resources/compiler.html.
Reinout van Rees also produced some good comments on that session. (He normally does!) These cover a little of the missing context.
You see in there something very similar to YQL and JQL in the examples given:
groups__name="XXX" AND NOT groups__name="YYY"
(modified > 1/4/2011 OR NOT state__name="OK") AND groups__name="XXX"
It can also be tweaked very easily; for example, you might want to use groups.name rather than groups__name (I would). This modification could be made fairly trivially (allow . in the FIELD token, by modifying t_FIELD, and then replacing . with __ before constructing the Q object in p_expression_ID).
So, that satisfies simple querying; it also gives you a good starting point should you wish to make a more complex DSL.

I've faced exactly this problem - a large database which needs searching. I made some static reports and several fancy filters using django (very easy with django) just like you have.
However the power users were clamouring for more. I decided that there already was a DSL that they all knew - SQL. The question was how to make it secure enough.
So I used django permissions to give the power users permission to make SQL queries in a new table. I then made a view for the not-quite-so-power users to use these queries. I made them take optional parameters. The queries were run using Python's lower level DB-API which django is using under the hood for its ORM anyway.
The real trick was opening a read only database connection to run these queries just to make sure that no updates were ever run. I made a read only connection by creating a different user in the database with lower permissions and opening a specific connection for that in the view.
TL;DR - SQL is the way to go!

Depending on the form of your data, the types of queries your users need to use, and the frequency that your data is updated, an alternative to the pure SQL solution suggested by Nick Craig-Wood is to index your data in Solr and then run queries against it.
Solr is an added layer of complexity (configuration, data synchronization) but it is super-fast, can handle large datasets, and provides a (relatively) intuitive query language.

You could write your own SQL-ish language using pyparsing, actually. There is even pretty verbose example you could extend.

Django i18n: is there a gettext alternative?

I'm looking for a way to translate my Django project. Built in mechanism provided with Django is great, but has several weak points which made me go looking for an alternative.
Project owner must be able to edit every translation including English (original translation). With gettext it is possible to edit translations with tools like Pootle, but the original strings stay hardcoded inside file sources or templates. There is no way that product owner can change them.
Possible solution is to make gettext translate some unique identifiers, and just translate them to all languages including English, like this:
_('form_sumbit_button')
But this makes tools like pootle almost impossible to use for translators.
Question: are there any tools for Django project translation that could fit my needs?

If you use some message IDs, they would either be incomprehensible ("message_2215") or you'd be forced to synchronise the message IDs to the actual messages ("Please press any key" = "please_press_any_key" => "Any key to continue" = "any_key_to_continue"). Either way, real strings are better for the programmers and for the tools.
However, if you employ a separate proof-reader for your strings, you can do the following:
Create an English "translation" file (yes, this works)
Let your proof-reader "translate" from English to English using Pootle or any other tool
Make sure your programmers keep that translation file untranslated by updating the strings in code.
(optional) Create a way to deploy translations independently of your main code so you can fix a typo quickly.

You may be able to use Pootle with the _("message_id") approach, depending on how easy Pootle is to customise (I don't know the internals so I can't say, but IIUC it uses Django where template changes are usually straightforward).
For example, Pootle's translation screens have "Original" and "Translation" sections; you could perhaps adapt the templates to show, under the "Original" section, a "Reference" section which displays some canonical translation using a specific reference language (e.g. English).
Or you may be able to use Pootle's alternative source language functionality, without needing to customise Pootle. You could store the canonical versions of the translations using an unused language code (or a made-up one).

Using identifiers is definitely possible with Gettext and there are tools which support this. However it might be unusual for some translators as they are used to downloading only .po file for offline translation, what does not work with monolingual translations.
For example Weblate supports monolingual Gettext files just fine (I'm author of this tool): https://docs.weblate.org/en/latest/formats.html#monolingual-gettext

Search functionality for Django

I'm developing a web app using Django, and I'll need to add search functionality soon. Search will be implemented for two models, one being an extension of the auth user class and another one with the fields name, tags, and description. So I guess nothing too scary here in context of searching text.
For development I am using SQLite and as no database specific work has been done, I am at liberty to use any database in production. I'm thinking of choosing between PostgreSQL or MySQL.
I have gone through several posts on Internet about search solutions, nevertheless I'd like to get opinions for my simple case. Here are my questions:
is full-text search an overkill in my case?
is it better to rely on the database's full-text search support? If so, which database should I use?
should I use an external search library, such as Whoosh, Sphinx, or Xapian? If so, which one?
EDIT:
tags is a Tagfield (from the django-tagging app) that sits on a m2m relationship. description is a field that holds HTML and has a max_length of 1024 bytes.

If that field tags means what I think it means, i.e. you plan to store a string which concatenates multiple tags for an item, then you might need full-text search on it... but it's a bad design; rather, you should have a many-many relationship between items and a tags table (in another table, ItemTag or something, with 2 foreign keys that are the primary keys of the items table and tags table).
I can't tell whether you need full-text search on description as I have no indication of what it is -- nor whether you need the reasonable but somewhat rudimentary full-text search that MySQL 5.1 and PostgreSQL 8.3 provide, or the more powerful one in e.g. sphinx... maybe talk a bit more about the context of your app and why you're considering full-text search?
Edit: so it seems the only possible need for full-text search might be on description, and that looks like it's probably limited enough that either MySQL 5.1 or PostgreSQL 8.3 will serve it well. Me, I have a sweet spot for PostgreSQL (even though I'm reasonably expert at MySQL too), but that's a general preference, not specifically connected to full-text search issues. This blog does provide one reason to prefer PostgreSQL: you can have full-text search and still be transactional, while in MySQL full-text indexing only work on MyISAM tables, not InnoDB [[except if you add sphinx, of course]] (also see this follow-on for a bit more on full-text search in PostgreSQL and Lucene). Still, there are of course other considerations involved in picking a DB, and I don't think you'll be doing terribly with either (unless having to add sphinx for full-text plus transaction is a big problem).

Django has full text searching support in its QuerySet filters. Right now, if you only have two models that need searching, just make a view that searches the fields on both:
search_string = "+Django -jazz Python"
first_models = FirstModel.objects.filter(headline__search=search_string)
second_models = SecondModel.objects.filter(headline__search=search_string)
You could further filter them to make sure the results are unique, if necessary.
Additionally, there is a regex filter that may be even better for dealing with your html fields and tags since the regex can instruct the filter on exactly how to process any delimiters or markup.

Whether you need an external library depends on your needs. How much traffic are we talking about? The external libraries are generally better when it comes to performance, but as always there are advantages and disadvantages. I am using Sphinx with django-sphinx plugin, and I would recommend it if you will be doing a lot of searching.

Haystack looks promising. And it supports Whoosh on the back end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.