I just traced back performance issues to query optimization in sqlite3 (I'm using the Python interface but I'm not sure it's relevant, the problem is probably about the sqlite API).
I describe such an example below.
My table looks like this with millions of records:
CREATE TABLE articles (
id integer primary key,
title text,
)
CREATE TABLE "tags" (
tag text,
article_id integer not null,
foreign key(article_id) references articles(id),
)
Here is the request I want to run:
SELECT tag
FROM "tags" AS t1
INNER JOIN
(
SELECT id
FROM "articles"
LIMIT (?)
) t2
ON t1.article_id = t2.id
I observed that if the limit parameter is hardcoded, the query is way faster (0.01s vs 15s).
Using set_trace_callback, I saw that the query was exactly the same.
However, explain didn't give the same result.
I concluded that the SQLite query optimizer runs before parameter replacement.
Indeed, if I modify the query to be
SELECT t1.tag FROM
(
SELECT id
FROM "articles"
LIMIT (?)
) as t2 CROSS JOIN
"tags" as t1
WHERE
t1.article_id = t2.id
it works with the same speed as hardcoded limit, as CROSS JOIN allows to control the order of the for loops (link).
Thus, I conclude that the query optimizer does not run the nested loop in the same order.
In this toy example, there is an obvious order that is better. However, in other cases, I might want the query optimizer to run after parameter replacement.
Is there a way to do it while still using the safe methods that prevent SQL injection?
The trivial solution would be to validate that limits are integers and use simple string replacement.
Related
This question is a bit related to another question: Get List of Primary Key Columns in Snowflake.
Since INFORMATION_SCHEMA.COLUMNS does not provide the required information regarding the primary keys. And the method proposed by Snowflake itself, where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel.
I was thinking about using SHOW PRIMARY KEYs IN DATABASE. This works great when querying the database from within Snowflake. But as soon as I try to do this in python, I get results for the column name like 'Built-in function id'. Which is not useful when dynamically generating sql statements.
The code I am using is as follows:
SQL_PK = "SHOW PRIMARY KEYS IN DATABASE;"
snowflake_service = SnowflakeService(username=cred["username"], password=cred["password"])
snowflake_service.connect(database=DATABASE,role=ROLE, warehouse=WAREHOUSE)
curs = snowflake_service.cursor
primary_keys = curs.execute(SQL_PK).fetchall()
curs.close()
snowflake_service.connection.close()
Is there something I am doing wrong? Is it even possible to do it like this?
Or is the solution that Snowflake provides reliable enough, when sending these queries as one string? Although with many tables, there will be many round trips required to get all the data needed.
where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel
You could search for specific query run using information_schema.query_history_by_session and then refer to resultset using retrieved QUERY_ID.
SHOW PRIMARY KEYS IN DATABASE;
-- find the newest occurence of `SHOW PRIMARY KEYS`:
SET queryId = (SELECT QUERY_ID
FROM TABLE(information_schema.query_history_by_session())
WHERE QUERY_TEXT LIKE '%SHOW PRIMARY KEYS IN DATABASE%'
ORDER BY ENDTIME DESC LIMIT 1);
SELECT * FROM TABLE(RESULT_SCAN($queryId));
I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.
I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I've tried to recreate it with SQLAlchemy this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT
values.`Serial` AS `values_Serial`,
values.`DatumOrig` AS `values_DatumOrig`,
values.`Glucose` AS `values_Glucose`
FROM values)
AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?
First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.
Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).\
select_from(Value).\
scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count
from sqlalchemy import func, distinct as _distinct
def _comma_list(exprs):
# NOTE: Magic number alert, the precedence value must be large enough to avoid
# producing parentheses around the "comma list" when passed to distinct()
ps = count(10 + len(exprs), -1)
exprs = iter(exprs)
cl = next(exprs)
for p, e in zip(ps, exprs):
cl = cl.op(',', precedence=p)(e)
return cl
def distinct(*exprs):
return _distinct(_comma_list(exprs))
session.query(func.count(distinct(
Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()
My application is very database intensive so I'm trying to reduce the load on the database. I am using PostgreSQL as rdbms and python is the programming language.
To reduce the load I am already using a caching mechanism in the application. The caching type I used is a server cache, browser cache.
Currently I'm tuning the PostgreSQL query cache to get it in line with the characteristics of queries being run on the server.
Questions:
Is it possible to fine tune query cache on a per database level?
Is it possible to fine tune query cache on a per table basis?
please provide tutorial to learn query cache in PostgreSQL.
Tuning PostgreSQL is far more than just tuning caches. In fact, the primary high level things are "shared buffers" (think of this as the main data and index cache), and the work_mem.
The shared buffers help with reading and writing. You want to give it a decent size, but it's for the entire cluster.. and you can't really tune it on a per table or especially query basis. Importantly, it's not really storing query results.. it's storing tables, indexes and other data. In an ACID compliant database, it's not very efficient or useful to cache query results.
The "work_mem" is used to sort query results in memory and not have to resort to writing to disk.. depending on your query, this area could be as important as the buffer cache, and easier to tune. Before running a query that needs to do a larger sort, you can issue the set command like "SET work_mem = '256MB';"
As others have suggested you can figure out WHY a query is running slowly using "explain". I'd personally suggest learning the "access path" postgresql is taking to get to your data. That's far more involved and honestly a better use of resources than simply thinking of "caching results".
You can honestly improve things a lot with data design as well and using features such as partitioning, functional indexes, and other techniques.
One other thing is that you can get better performance by writing better queries.. things like "with" clauses can prevent postgres' optimizer from optimizing queries fully.
The optimizer itself also has parameters that can be adjusted-- so that the DB will spend more (or less) time optimizing a query prior to executing it.. which can make a difference.
You can also use certain techniques to write queries to help the optimizer. One such technique is to use bind variables (colon variables)--- this will result in the optimizer getting the same query over and over with different data passed in. This way, the structure doesn't have to be evaluated over and over.. query plans can be cached in this way.
Without seeing some of your queries, your table and index designs, and an explain plan, it's hard to make specific recommendation.
In general, you need to find queries that aren't as performant as you feel they should be and figure out where the contention is. Likely it's disk access, however,the cause is ultimately the most important part.. is it having to go to disk to hold the sort? Is it internally choosing a bad path to get to the data, such that it's reading data that could easily be eliminated earlier in the query process... I've been an oracle certified DBA for over 20 years, and PostgreSQL is definitely different, however, many of the same techniques are used when it comes to diagnosing a query's performance issues. Although you can't really provide hints, you can still rewrite queries or tune certain parameters to get better performace.. in general, I've found postgresql to be easier to tune in the long run. If you can provide some specifics, perhaps a query and explain info, I'd be happy to give you specific recommendations. Sadly, though, "cache tuning" is likely to provide you the speed you're wanting all on its own.
I developed a system for caching results, to speed-up results queried from a web-based solution. I reproduced below in essence what it did:
The following are the generic caching handling tables and functions.
CREATE TABLE cached_results_headers (
cache_id serial NOT NULL PRIMARY KEY,
date timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_access timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
relid regclass NOT NULL,
query text NOT NULL,
rows int NOT NULL DEFAULT 0
);
CREATE INDEX ON cached_results_headers (relid, md5(query));
CREATE TABLE cached_results (
cache_id int NOT NULL,
row_no int NOT NULL
);
CREATE OR REPLACE FUNCTION f_get_cached_results_header (p_cache_table text, p_source_relation regclass, p_query text, p_max_lifetime interval, p_clear_old_data interval) RETURNS cached_results_headers AS $BODY$
DECLARE
_cache_id int;
_rows int;
BEGIN
IF p_clear_old_data IS NOT NULL THEN
DELETE FROM cached_results_headers WHERE date < CURRENT_TIMESTAMP - p_clear_old_data;
END IF;
_cache_id := cache_id FROM cached_results_headers WHERE relid = p_source_relation AND md5(query) = md5(p_query) AND query = p_query AND date > CURRENT_TIMESTAMP - p_max_lifetime;
IF _cache_id IS NULL THEN
INSERT INTO cached_results_headers (relid, query) VALUES (p_source_relation, p_query) RETURNING cache_id INTO _cache_id;
EXECUTE $$ INSERT INTO $$||p_cache_table||$$ SELECT $1, row_number() OVER (), r.r FROM ($$||p_query||$$) r $$ USING _cache_id;
GET DIAGNOSTICS _rows = ROW_COUNT;
UPDATE cached_results_headers SET rows = _rows WHERE cache_id = _cache_id;
ELSE
UPDATE cached_results_headers SET last_access = CURRENT_TIMESTAMP;
END IF;
RETURN (SELECT h FROM cached_results_headers h WHERE cache_id = _cache_id);
END;
$BODY$ LANGUAGE PLPGSQL SECURITY DEFINER;
The following is an example of how to use the tables and functions above, for a given view named my_view with a field key to be selected within a range of integer values. You would replace all the following with your particular needs, and replace my_view with either a table, a view, or a function. Also replace the filtering parameters as required.
CREATE VIEW my_view AS SELECT ...; -- create a query with your data, with one of the integer columns in the result as "key" to filter by
CREATE TABLE cached_results_my_view (
row my_view NOT NULL,
PRIMARY KEY (cache_id, row_no),
FOREIGN KEY (cache_id) REFERENCES cached_results_headers ON DELETE CASCADE
) INHERITS (cached_results);
CREATE OR REPLACE FUNCTION f_get_my_view_cached_rows (p_filter1 int, p_filter2 int, p_row_from int, p_row_to int) RETURNS SETOF my_view AS $BODY$
DECLARE
_cache_id int;
BEGIN
_cache_id := cache_id
FROM f_get_cached_results_header('cached_results_my_view', 'my_view'::regclass,
'SELECT r FROM my_view r WHERE key BETWEEN '||p_filter1::text||' AND '||p_filter2::text||' ORDER BY key',
'15 minutes'::interval, '1 day'::interval); -- cache for 15 minutes max since creation time; delete all cached data older than 1 day old
RETURN QUERY
SELECT (row).*
FROM cached_results_my_view
WHERE cache_id = _cache_id AND row_no BETWEEN p_row_from AND p_row_to
ORDER BY row_no;
END;
$BODY$ LANGUAGE PLPGSQL;
Example: Retrieve rows from 1 to 2000 from cached my_view results filtered by key BETWEEN 30044 AND 10610679. Run a first time and the results of the query will be cached into table cached_results_my_view, and the first 2000 records will be returned. Run it again shortly after and the results will be retrieved from the table cached_results_my_view directly without executing the query.
SELECT * FROM f_get_my_view_cached_rows(30044, 10610679, 1, 2000);
In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.