This is simply a question of curiosity. I have a script that loads a specific queryset without evaluating it, and then I print the count(). I understand that count has to go through so depending on the size it could potentially take some time, but it took over a minute to return 0 as the count of an empty queryset Why is that taking so long? is it Django or my server?
notes:
the queryset was all one type.
It all depends on the query that you're running. If you're running a SELECT COUNT(*) FROM foo on a table that has ten rows, it's going to be very fast; but if your query involves a dozen joins, sub-selects, filters on un-indexed rows--or if the target table simply has a lot of rows--the query can take an arbitrary amount of time. In all likelihood, the bottleneck is not Django (although its ORM has some quirks), but rather the database and your query. Just because no rows meet the criteria doesn't mean that the database didn't need to deal with the other rows in the table.
Related
I have two Django models joined using a one-to-one link. Django generated this query for sorting and limit (PER_PAGE=20) objects in ChangeList:
SELECT
"test_model_object"."id",
"test_model_object"."sorting_field_asc",
"test_model_object"."sorting_field_desc",
... about fifteen fields ...
"test_model_one_to_one_element"."id",
"test_model_one_to_one_element"."number"
FROM "test_model_object"
INNER JOIN "test_model_one_to_one_element"
ON ("test_model_object"."test_model_one_to_one_element_id" = "test_model_one_to_one_element"."id")
ORDER BY "test_model_object"."sorting_field_asc" ASC, "test_model_object"."sorting_field_desc" DESC LIMIT 20;
But execution is very slow in PostgreSQL for 1.5 million objects (about six second). I assumed uuid introduced some unwanted adjustments to sorting process, but test model with int indexes shows that's not it. What are some solution (maybe postgres settings) to speed up this Django PostgreSQL query?
What indexes have you created on the PostgreSQL table?
execution is very slow in PostgreSQL for 1.5 million objects
That is to be expected on millions of records, yes.
This is only a problem if you need the operation to be fast. PostgreSQL and Django can't tell which operations you want to be fast, so by default no indexes are created.
So, because you've asked the question, we assume you would rather have that operation be faster. One way is to trade speed when writing records, to get speed when querying them.
You can create an index on particular fields, as a way of trading some slower write operations for faster query operations. The index will be used also when sorting with ORDER BY.
If it's as simple as wanting to create an index on one field of your Django model, the db_index field option specifies that.
I know that QuerySets are lazy and they are evaluated only on certain conditions to avoid hitting the databases all the times.
What I don't know is if given a generic query set (retrieving all the items) and then using it to construct a more refined queryset (adding a filter for example) would lead to multiple sql queries or not?
Example:
all_items = MyModel.objects.all()
subset1 = all_items.filter(**some_conditions)
subset2 = subset1.filter(**other_condition)
1) Would this create 3 different sql queries?
Or it all depends if the 3 variable are evaluated (for example iterating over them)?
2) Is this efficient or would it be better to fetch all the items, then convert them into a list and filter them in python?
1) If you enumerate only the final query set subset2 then only one database query request is executed, that is optimal.
2) Avoid premature optimization (before measurement on appropriate amount of data after most of application code is written.). You never know what will be finally the most important problem if the database gets bigger. E.g. if you ask for a subset then the query is usually faster thanks to caching in the database. The amount of memory is in opposition to other optimizations. Maybe you can't hold later all data in the memory and users will access them only by a page of data. A clean readable code is more important for a later possible optimization than an optimization by 20% that must be removed later to can continue.
Other important paragraphs about (lazy) evaluation of queries:
When QuerySets are evaluated
QuerySets are lazy
Laziness in Django
I'm trying to store some measurement data into my postgresql db using Python Django.
So far all good, i've made a docker container with django, and another one with the postgresql server.
However, i am getting close to 2M rows in my measurement table, and queries start to get really slow, while i'm not really sure why, i'm not doing very intense queries.
This query
SELECT ••• FROM "measurement" WHERE "measurement"."device_id" = 26 ORDER BY "measurement"."measure_timestamp" DESC LIMIT 20
for example takes between 3 and 5 seconds to run, depending on which device i query.
I would expect this to run a lot faster, since i'm not doing anything fancy.
The measurement table
id INTEGER
measure_timestamp TIMESTAMP WITH TIMEZONE
sensor_height INTEGER
device_id INTEGER
with indices on id and measure_timestamp.
The server doesn't look too busy, even though it's only 512M memory, i have plenty left during queries.
I configured the postgresql server with shared_buffers=256MB and work_mem=128MB.
The total database is just under 100MB, so it should easily fit.
If i run the query in PgAdmin, i'm seeing a lot of Block I/O, so i suspect it has to read from disk, which is obviously slow.
Could anyone give me a few pointers in the right direction how to find the issue?
EDIT:
Added output of explain analyze on a query. I now added index on the device_id, which helped a lot, but i would expect even quicker query times.
https://pastebin.com/H30JSuWa
Do you have indexes on measure_timestamp and device_id? If the queries always take that form, you might also like multi-column indexes.
Please look at the distribution key of your table. It is possible that the data is sparsely populated hence it affects the performance. Selecting a proper distribution key is very important when you have data of 2M records. For more details read this on why distribution key is important
I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.
I'm working on a project that allows users to enter SQL queries with parameters, that SQL query will be executed over a period of time they decide (say every 2 hours for 6 months) and then get the results back to their email address.
They'll get it in the form of an HTML-email message, so what the system basically does is run the queries, and generate HTML that is then sent to the user.
I also want to save those results, so that a user can go on our website and look at previous results.
My question is - what data do I save?
Do I save the SQL query with those parameters (i.e the date parameters, so he can see the results relevant to that specific date). This means that when the user clicks on this specific result, I need to execute the query again.
Save the HTML that was generated back then, and simply display it when the user wishes to see this result?
I'd appreciate it if somebody would explain the pros and cons of each solution, and which one is considered the best & the most efficient.
The archive will probably be 1-2 months old, and I can't really predict the amount of rows each query will return.
Thanks!
Specifically regarding retrieving the results from queries that have been run previously I would suggest saving the results to be able to view later rather than running the queries again and again. The main benefits of this approach are:
You save unnecessary computational work re-running the same queries;
You guarantee that the result set will be the same as the original report. For example if you save just the SQL then the records queried may have changed since the query was last run or records may have been added / deleted.
The disadvantage of this approach is that it will probably use more disk space, but this is unlikely to be an issue unless you have queries returning millions of rows (in which case html is probably not such a good idea anyway).
If I would create such type of application then
I will have some common queries like get by current date,current time , date ranges, time ranges, n others based on my application for the user to select easily.
Some autocompletions for common keywords.
If the data gets changed frequently there is no use saving html, generating new one is good option
The crucial difference is that if data changes, new query will return different result than what was saved some time ago, so you have to decide if the user should get the up to date data or a snapshot of what the data used to be.
If relevant data does not change, it's a matter of whether the queries will be expensive, how many users will run them and how often, then you may decide to save them instead of re-running queries, to improve performance.