Building dynamic SQL queries with psycopg2 and postgresql - python

I'm not really sure the best way to go about this or if i'm just asking for a life that's easier than it should be. I have a backend for a web application and I like to write all of the queries in raw SQL. For instance getting a specific user profile, or a number of users I have a query like this:
SELECT accounts.id,
accounts.username,
accounts.is_brony,
WHERE accounts.id IN %(ids)s;
This is really nice because I can get one user profile, or many user profiles with the same query. Now my real query is actually almost 50 lines long. It has a lot of joins and other conditions for this profile.
Lets say I want to get all of the same information from a user profile but instead of getting a specific user ID i want to get a single random user? I don't think it makes sense to copy and paste 50 lines of code just to modify two lines at the end.
SELECT accounts.id,
accounts.username,
accounts.is_brony,
ORDER BY Random()
LIMIT 1;
Is there some way to use some sort of inheritance in building queries, so that at the end I can modify a couple of conditions while keeping the core similarities the same?
I'm sure I could manage it by concatenating strings and such, but I was curious if there's a more widely accepted method for approaching such a situation. Google has failed me.

The canonical answer is to create a view and use that with different WHERE and ORDER BY clauses in queries.
But, depending on your query and your tables, that might not be a good solution for your special case.
A query that is blazingly fast with WHERE accounts.id IN (1, 2, 3) might perform abysmally with ORDER BY random() LIMIT 1. In that case you'll have to come up with a different query for the second requirement.

Related

Django: Fastest way to random query one record using filter

What is the fastest way to query one record from database that is satisfy my filter query.
mydb.objects.filter(start__gte='2017-1-1', status='yes').order_by('?')[:1]
This statement will first query thousands of records and then select one, and it is very slow, but I only need one, a random one. what is the fastest one to get?
Well, I'm not sure you will be able to do exactly what you want. I was running into a similar issue a few months ago and I ended up redesigning my implementation of my backend to make it work.
Essentially, you want to make the query time shorter by having it choose a random record that fulfills both requirements (start__gte='2017-1-1', status='yes'), but like you say in order for the query to do so, it needs to filter your entire database. This means that you can cannot get a "true" random record from the database that also fulfills the filter requirements, because filtering inherently needs to look through all of your records (otherwise it wouldn't be really random, it would just be the first one it finds that fulfills your requirements).
Instead, consider putting all records that have a status='yes' in a separate relation, so that you can pull a random record from there and join with the larger relation. That would make the query time considerably faster (and it's the type of solution I implemented to get my code to work).
If you really want a random record with the correct filter information, you might need to employ some convoluted means.
You could use a custom manager in Django to have it find only one random record, something like this:
class UsersManager(models.Manager):
def random(self):
count = self.aggregate(count=Count('id'))['count']
random_index = randint(0, count - 1)
return self.all()[random_index]
class User(models.Model):
objects = UsersManager()
#Your fields here (whatever they are, it seems start__gte and status are some)!
objects = UserManager()
Which you can invoke then just by using:
User.objects.random()
This could be repeated with a check in your code until it returns a random record that fulfills your requirements. I don't think this is necessarily the cleanest or programmatically correct way of implementing this, but I don't think a faster solution exists for your specific issue.
I used this site as a source for this answer, and it has a lot more solid information about using this custom random method! You'll likely have to change the custom manager to serve your own needs, but if you add the random() method to your existing custom manager it should be able to do what you need of it!
Hope it helps!
Using order_by('?') will cause you a great performance issue. A better way is to use something like this: Getting a random row from a relational database.
count = mydb.objects.filter(start__gte='2017-1-1', status='yes').aggregate(count=Count('id'))['count']
random_index = randint(0, count - 1)
result= mydb.objects.filter(start__gte='2017-1-1', status='yes')[random_index]

Creating an archive - Save results or request them every time?

I'm working on a project that allows users to enter SQL queries with parameters, that SQL query will be executed over a period of time they decide (say every 2 hours for 6 months) and then get the results back to their email address.
They'll get it in the form of an HTML-email message, so what the system basically does is run the queries, and generate HTML that is then sent to the user.
I also want to save those results, so that a user can go on our website and look at previous results.
My question is - what data do I save?
Do I save the SQL query with those parameters (i.e the date parameters, so he can see the results relevant to that specific date). This means that when the user clicks on this specific result, I need to execute the query again.
Save the HTML that was generated back then, and simply display it when the user wishes to see this result?
I'd appreciate it if somebody would explain the pros and cons of each solution, and which one is considered the best & the most efficient.
The archive will probably be 1-2 months old, and I can't really predict the amount of rows each query will return.
Thanks!
Specifically regarding retrieving the results from queries that have been run previously I would suggest saving the results to be able to view later rather than running the queries again and again. The main benefits of this approach are:
You save unnecessary computational work re-running the same queries;
You guarantee that the result set will be the same as the original report. For example if you save just the SQL then the records queried may have changed since the query was last run or records may have been added / deleted.
The disadvantage of this approach is that it will probably use more disk space, but this is unlikely to be an issue unless you have queries returning millions of rows (in which case html is probably not such a good idea anyway).
If I would create such type of application then
I will have some common queries like get by current date,current time , date ranges, time ranges, n others based on my application for the user to select easily.
Some autocompletions for common keywords.
If the data gets changed frequently there is no use saving html, generating new one is good option
The crucial difference is that if data changes, new query will return different result than what was saved some time ago, so you have to decide if the user should get the up to date data or a snapshot of what the data used to be.
If relevant data does not change, it's a matter of whether the queries will be expensive, how many users will run them and how often, then you may decide to save them instead of re-running queries, to improve performance.

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Django Table with Million of rows

I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?
Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.
ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.
You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.
I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.
Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.
A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.
You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Categories

Resources