SQLALCHEMY ignore accents on query

SQLALCHEMY ignore accents on query - python

Considering my users can save data as "café" or "cafe", I need to be able to search on that fields with an accent-insensitive query.
I've found https://github.com/djcoin/django-unaccent/, but I have no idea if it is possible to implement something similar on sqlalchemy.
I'm using PostgreSQL, so if the solution is specific to this database is good to me. If it is generic solution, it is much much better.
Thanks for your help.

First install the unaccess extension in PostgreSQL: create extension unaccent;
Next, declare the SQL function unaccent in Python:
from sqlalchemy.sql.functions import ReturnTypeFromArgs
class unaccent(ReturnTypeFromArgs):
pass
and use it like this:
for place in session.query(Place).filter(unaccent(Place.name) == "cafe").all():
print place.name
Make sure you have the correct indexes if you have a large table, otherwise this will result in a full table scan.

A simple and database agnostic solution is to write the field(s) that can have accents twice, once with and once without accents. Then you can conduct your searches on the unaccented version.
To generate the unaccented vesrsion of a string you can use Unidecode.
To automatically assign the unaccented version to the database when a record is inserted or updated you can use the default and onupdate clauses in the Column definition. For example, using Flask-SQLAlchemy you could do something like this:
from unidecode import unidecode
def unaccent(context):
return unidecode(context.current_parameters['some_string'])
class MyModel(db.Model):
id = Column(db.Integer, primary_key=True)
some_string = db.Column(db.String(128))
some_string_unaccented = db.Column(db.String(128), default=unaccent, onupdate=unaccent, index=True)
Note how I only indexed the unaccented field, because that is the one on which the searches will be made.
Of course before you can search you also have to unaccent the value you are searching for. For example:
def search(text):
return MyModel.query.filter_by(some_string_unaccented = unaccent(text)).all()
You can apply the same technique to full text search, if necessary.

Related

Django with Oracle DB - ORA-19011: Character string buffer too small

I have the following model for an Oracle database, which is not a part of my Django project:
class ResultsData(models.Model):
RESULT_DATA_ID = models.IntegerField(primary_key=True, db_column="RESULT_DATA_ID")
RESULT_XML = models.TextField(blank=True, null=True, db_column="RESULT_XML")
class Meta:
managed = False
db_table = '"schema_name"."results_data"'
The RESULT_XML field in the database itself is declared as XMLField. I chose to represent it as TextField in Django model, due to no character limit.
When I do try to download some data with that model, I get the following error:
DatabaseError: ORA-19011: Character string buffer too small
I figure, it is because of the volume of data stored in RESULT_XML field, since when I try to just pull a record with .values("RESULT_DATA_ID"), it pulls fine.
Any ideas on how I can work around this problem? Googling for answers did not yield anything so far.

UPDATED ANSWER
I have found a much better way of dealing with that issue - I wrote a custom field value Transform object, which generates an Oracle SQL query I was after:
OracleTransforms.py
from django.db.models import TextField
from django.db.models.lookups import Transform
class CLOBVAL(Transform):
'''
Oracle-specific transform for XMLType field, which returns string data exceeding
buffer size (ORA-19011: Character string buffer too small) as a character LOB type.
'''
function = None
lookup_name = 'clobval'
def as_oracle(self, compiler, connection, **extra_context):
return super().as_sql(
compiler, connection,
template='(%(expressions)s).GETCLOBVAL()',
**extra_context
)
# Needed for CLOBVAL to work as a .values('field_name__clobval') lookup in Django ORM queries
TextField.register_lookup(CLOBVAL)
With the above, I can now just write a query as follows:
from .OracleTransforms import CLOBVAL
ResultsData.objects.filter(RESULT_DATA_ID=some_id).values('RESULT_DATA_ID', 'RESULT_XML__clobval')
or
ResultsData.objects.filter(RESULT_DATA_ID=some_id).values('RESULT_DATA_ID', XML = CLOBVAL('RESULT_XML'))
This is the best solution for me, as I do get to keep using QuerySet, instead of RawQuerySet.
The only limitation I see with this solution for now, is that I need to always specify .values(CLOBVAL('RESULT_XML')) in my ORM queries, or Oracle DB will report ORA-19011 again, but I guess this still is a good outcome.
OLD ANSWER
So, I have found a way around the problem, thanks to Christopher Jones suggestion.
ORA-19011 is an error which Oracle DB replies with, when the amount of data it would be sending back as a string exceeds allowed buffer. Therefore, it needs to be sent back as a character LOB object instead.
Django does not have a direct support for that Oracle-specific method (at least I did not find one), so an answer to the problem was a raw Django query:
query = 'select a.RESULT_DATA_ID, a.RESULT_XML.getClobVal() as RESULT_XML FROM SCHEMA_NAME.RESULTS_DATA a WHERE a.RESULT_DATA_ID=%s'
data = ResultsData.objects.raw(query, [id])
This way, you get back a RawQuerySet, which if this less known, less liked cousin of Django's QuerySet. You can iterate through the answer, and RESULT_XML will contain a LOB field, which when interrogated will convert to a String type.
Handling a String type-encoded XML data is problematic, so I also employed XMLTODICT Python package, to get it into a bit more civilized shape.
Next, I should probably look for a way to modify Django's getter for the RESULT_XML field only, and have it generate a query to Oracle DB with .getClobVal() method in it, but I will touch on that in a different StackOverflow question: Django - custom getter for 1 field in model

Django querysets optimization - preventing selection of annotated fields

Let's say I have following models:
class Invoice(models.Model):
...
class Note(models.Model):
invoice = models.ForeignKey(Invoice, related_name='notes', on_delete=models.CASCADE)
text = models.TextField()
and I want to select Invoices that have some notes. I would write it using annotate/Exists like this:
Invoice.objects.annotate(
has_notes=Exists(Note.objects.filter(invoice_id=OuterRef('pk')))
).filter(has_notes=True)
This works well enough, filters only Invoices with notes. However, this method results in the field being present in the query result, which I don't need and means worse performance (SQL has to execute the subquery 2 times).
I realize I could write this using extra(where=) like this:
Invoice.objects.extra(where=['EXISTS(SELECT 1 FROM note WHERE invoice_id=invoice.id)'])
which would result in the ideal SQL, but in general it is discouraged to use extra / raw SQL.
Is there a better way to do this?

You can remove annotations from the SELECT clause using .values() query set method. The trouble with .values() is that you have to enumerate all names you want to keep instead of names you want to skip, and .values() returns dictionaries instead of model instances.
Django internaly keeps the track of removed annotations in
QuerySet.query.annotation_select_mask. So you can use it to tell Django, which annotations to skip even wihout .values():
class YourQuerySet(QuerySet):
def mask_annotations(self, *names):
if self.query.annotation_select_mask is None:
self.query.set_annotation_mask(set(self.query.annotations.keys()) - set(names))
else:
self.query.set_annotation_mask(self.query.annotation_select_mask - set(names))
return self
Then you can write:
invoices = (Invoice.objects
.annotate(has_notes=Exists(Note.objects.filter(invoice_id=OuterRef('pk'))))
.filter(has_notes=True)
.mask_annotations('has_notes')
)
to skip has_notes from the SELECT clause and still geting filtered invoice instances. The resulting SQL query will be something like:
SELECT invoice.id, invoice.foo FROM invoice
WHERE EXISTS(SELECT note.id, note.bar FROM notes WHERE note.invoice_id = invoice.id) = True
Just note that annotation_select_mask is internal Django API that can change in future versions without a warning.

Ok, I've just noticed in Django 3.0 docs, that they've updated how Exists works and can be used directly in filter:
Invoice.objects.filter(Exists(Note.objects.filter(invoice_id=OuterRef('pk'))))
This will ensure that the subquery will not be added to the SELECT columns, which may result in a better performance.
Changed in Django 3.0:
In previous versions of Django, it was necessary to first annotate and then filter against the annotation. This resulted in the annotated value always being present in the query result, and often resulted in a query that took more time to execute.
Still, if someone knows a better way for Django 1.11, I would appreciate it. We really need to upgrade :(

We can filter for Invoices that have, when we perform a LEFT OUTER JOIN, no NULL as Note, and make the query distinct (to avoid returning the same Invoice twice).
Invoice.objects.filter(notes__isnull=False).distinct()

This is best optimize code if you want to get data from another table which primary key reference stored in another table
Invoice.objects.filter(note__invoice_id=OuterRef('pk'),)

We should be able to clear the annotated field using the below method.
Invoice.objects.annotate(
has_notes=Exists(Note.objects.filter(invoice_id=OuterRef('pk')))
).filter(has_notes=True).query.annotations.clear()

inner joins vs joinloads in sqlalchemy

I read about sqlalchemy joinloads like mentioned here and I little confused about the benefits or special usages over simply joining two tables like mentioned here
I would like to know about when to use each method, currently I don't see any benefit for using joinloads for now, can you please explain the difference? And the use cases to prefer joinloads

Sqlalchemy docs says joinedload() is not a replacement for join() and joinedload() doesn't affect the query result :
Query.join()
Query.options(joinedload())
Let's say if you wants to get same date that already related with data you are querying, but when you get this related data it won't change the result of the query it is like an attachment. Better to look sqlalchemy docs joinedload
class User(db.Model):
...
addresses = relationship('Address', backref='user')
class Address(db.Model):
...
user_id = Column(Integer, ForeignKey('users.id'))
The code below query user filter and return that user and optionally you can getting that user addresses.
user = db.session.query(User).options(joinedload(User.addresses)).filter(id==1).one()
Now lets look at join:
user = db.session.query(User).join(Address).filter(User.id==Address.user_id).one()
Conclusion
The query with joinedload() get that user addresses.
Other query, query on both table, check for user id on both table, so the result depend on this. But joinedload() if user doesn't have any address you will have user but no address. in join() if user doesn't have address there will not result.

Having trouble using SQLAlchemy's ilike method with a property getter

I am implementing a search feature for user names. Some names have accented characters, but I want to be able to search for them with the nearest ascii character approximation. For example: Vû Trån would be searchable with Vu Tran.
I found a Python library, called unidecode to handle this conversion. It works as expected and takes my unicode string Vû Trån and returns Vu Tran. Perfect.
The issue arises when I start querying my database – I use SQLAlchemy and Postgres.
Here's my Python query:
Person.query.filter(Person.ascii_name.ilike("%{q}%".format(q=query))).limit(25).all()
ascii_name is the getter for my name column, implemented as such
class Person(Base, PersonUtil):
"""
My abbreviated Person class
"""
__tablename__ = 'person'
id = Column(BigInteger, ForeignKey('user.id'), primary_key=True)
first_name = Column(Unicode, nullable=False)
last_name = Column(Unicode, nullable=False)
name = column_property(first_name + " " + last_name)
ascii_name = synonym('name', descriptor=property(fget=PersonUtil._get_ascii_name))
class PersonUtil(object):
def _get_ascii_name(self):
return unidecode(unicode(self.name))
My intent behind this code is that because I store the unicode version of the first and last names in my database, I need to have a way to call unidecode(unicode(name)) when I retrieve the person's name. Hence, I use the descriptor=property(fget=...) so that whenever I call Person.ascii_name, I retrieve the "unidecoded" name attribute. That way, I can simply write Person.ascii_name.ilike("%{my_query}%")... and match the nearest ascii_name to the search query, which is also just ascii characters.
This doesn't fully work. The ilike method with ascii_name works when I do not have any converted characters in the query. For example, the ilike query will work for the name "Bob Smith", but it will not work for "Bøb Smíth". It fails when it encounters the first converted character, which in the case of "Bøb Smíth" is the letter "ø".
I am not sure why this is happening. The ascii_name getter returns my expected string of "Bob Smith" or "Vu Tran", but when coupled with the ilike method, it doesn't work.
Why is this happening? I've not been able to find anything about this issue.
How can I either fix my existing code to make this work, or is there a better way to do this that will work? I would prefer not to have to change my DB schema.
Thank you.

What you want to do simply won't work because ilike only works on real columns in the database. The column_property and synonym are just syntactic sugars provided by sqlalchemy to help with making the front end easy. If you want to leverage the backend to query with LIKE in the way you intended you need the actual values there. I am afraid you have to generate/store the ascii full name into the database which means you need to change your schema to include ascii_name as a real column, and make sure they are inserted. To verify this yourself, you should dump out the data in the table, and see if your manually constructed queries can work.

What checks should be performed on user data from forms?

I'm writing an app engine app, that has some input fields.
Are there any concerns I need to take into account about something like this?

You should validate that any input from your users meets your requirements. For example, if you need an positive integer, then make sure that's what you got.
As far as strings, you don't have to worry about SQL (or GQL in this case) injection as long as you don't construct the queries by hand. Instead use the GqlQuery.bind() method, or the methods provided by Query to pass the values (e.g., Query.filter()). Then these classes will take care of formulating the query so you don't need to worry about the syntax (or injection).
Examples (adapted from the docs linked to previously):
# this basic string query is safe
query = Song.all()
query.filter('title =', self.request.get('title'))
# a GqlQuery version of the previous example
query = GqlQuery("SELECT x FROM Song WHERE title = :1",self.request.get('title'))
# sanitize/validate when you have requirements: e.g., year must be a number
query = Song.all()
try:
year = int(self.request.get('year')) # make sure we got a number
except:
show error msg
query.filter('year =', year)

There are a number of forms libraries that do most of the hard work for you - you should use one of them. Django's newforms library is included with App Engine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQLALCHEMY ignore accents on query - python

Related

Django with Oracle DB - ORA-19011: Character string buffer too small

Django querysets optimization - preventing selection of annotated fields

inner joins vs joinloads in sqlalchemy

Having trouble using SQLAlchemy's ilike method with a property getter

What checks should be performed on user data from forms?

Categories

Resources