Searching across multiple tables (best practices) - python

I have property management application consisting of tables:
tenants
landlords
units
properties
vendors-contacts
Basically I want one search field to search them all rather than having to select which category I am searching. Would this be an acceptable solution (technology wise?)
Will searching across 5 tables be OK in the long run and not bog down the server? What's the best way of accomplishing this?
Using PostgreSQL

Why not create a view which is a union of the tables which aggregates the columns you want to search on into one, and then search on that aggregated column?
You could do something like this:
select 'tenants:' + ltrim(str(t.Id)), <shared fields> from Tenants as t union
select 'landlords:' + ltrim(str(l.Id)), <shared fields> from Tenants as l union
...
This requires some logic to be embedded from the client querying; it has to know how to fabricate the key that it's looking for in order to search on a single field.
That said, it's probably better if you just have a separate column which contains a "type" value (e.g. landlord, tenant) and then filter on both the type and the ID, as it will be computationally less expensive (and can be optimized better).

You want to use the built-in full text search or a separate product like Lucene. This is optimised for unstructured searches over heterogeneous data.
Also, don't forget that normal indices cannot be used for something LIKE '%...%'. Using a full text search engine will also be able to do efficient substring searches.

I would suggest using a specialized full-text indexing tool like Lucene for this. It will probably be easier to get up and running, and the result is faster and more featureful too. Postgres full text indexes will be useful if you also need structured search capability on top of this or transactionality of your search index is important.
If you do want to implement this in the database, something like the following scheme might work, assuming you use surrogate keys:
for each searchable table create a view that has the primary key column of that table, the name of the table and a concatenation of all the searchable fields in that table.
create a functional GIN or GiST index on the underlying over the to_tsvector() of the exact same concatenation.
create a UNION ALL over all the views to create the searchable view.
After that you can do the searches like this:
SELECT id, table_name, ts_rank_cd(body, query) AS rank
FROM search_view, to_tsquery('search&words') query
WHERE query ## body
ORDER BY rank DESC
LIMIT 10;

You should be fine, and there's really no other good (easy) way to do this. Just make sure the fields you are searching on are properly indexed though.

Related

tag search on postgres via sqlalchemy

I'm trying to query a tags column (currently modelled as a character varying array). I would like to find any rows in which the tags column contains the query string as a left anchored substring, and would like to do so using sqlalchemy. My research has led me to learn about different ways of optimizing text search, but several lookup message still require usage of 'unnest'. I am open to changing the column from character varying array to something else (or having a separate, related table for tags), but am also curious on using unnest in sqlalchemy.
eg.
SELECT * FROM batches, UNNEST(tags) t WHERE t like 'poe%';
works and will find a row where tags column is ['math', 'poetry'].
I haven't found the right way to use unnest in sqlalchemy's python orm. Any help appreciated.

Dynamodb - query if a list contains

I'm fairly new to NoSQL. Using Python/Boto but this is a fairly general question. Currently trying to switch a project from MongoDB to DynamoDB and seeking some advice on DynamoDB and it's capacity to query if a list contains a certain string. I have been searching for the past day or so but I'm starting to worry that it doesn't have this facility, other than to use scan which is terribly slow considering the db will be queries thousands of times on updates. Similar unanswered question here
I understand primary keys can only be N, S or B and not something like String Set (SS) which would have been useful.
The data is fairly simple and would look something like this. I'm looking for the most efficient way to query the db based on the tag attribute for entries that include 'string1' OR 'string2'. Again, I don't want to use scan but am willing to consider normalization of the data structure if there is a best practice in dynamodb.
{
id: <some number used as a primary key>,
tags: ['string1', 'string2'...],
data: {some JSON object}
}
From what I've read, even using global secondary indexes, this doesn't seem possible which is strange since that would make dynamodb only useful for the most simple queries. Hoping I'm missing something.
In MongoDB, you have multikey indices, but not in DynamoDB.
I'd think you'd need to solve it like you would in a relational database: create a many-to-many relation table with tag as your hash key and entry id as your sort key. And find some way to keep your relation table in sync with your entry table.

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Quicker way of updating subdocuments

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.
What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.
Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.
There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Categories

Resources