I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design:
CREATE TABLE public.variant_fact (
variant_id bigint NOT NULL,
ref_allele text NOT NULL,
allele text NOT NULL,
variant_name text NOT NULL,
start bigint,
stop bigint,
variant_attributes jsonb
);
ALTER TABLE public.variant_fact
ADD CONSTRAINT variant_fact_unique UNIQUE (variant_name, start, stop, allele, ref_allele)
INCLUDE (ref_allele, allele, variant_name, start, stop);
Where "start" and "stop" are foreign keys and "variant_id" is an auto-incrementing primary key. I am running into issues with the loading speed because in order to perform the UPSERT, I need to check the table to see whether an element exists for each element I upload. I am performing the operation in python using psycopg2 using the execute_values method.
insert_query = """
INSERT INTO variant_fact AS v (variant_id, ref_allele, allele, variant_name, start, stop, variant_attributes)
VALUES %s
ON CONFLICT ON CONSTRAINT variant_fact_unique DO UPDATE
SET variant_attributes = excluded.variant_attributes || v.variant_attributes
RETURNING variant_id;
"""
inserted = psycopg2.extras.execute_values(cur=cursor, sql=sql, argslist=argslist, template=None, page_size=50000, fetch=fetch)
In my case, argslist is a list of tuples to insert to the database. I have tried to milk this python script for speed, but this UPSERT block is not very performant. Outside of a different schema (perhaps without atomic element records), are there any ways to boost performance for upload? I have already turned off WAL for the table and removed the foreign key constraints for "start" and "stop". Am I missing anything obvious here?
Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.
Turning off WAL (setting the table UNLOGGED) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE later to change it to a LOGGED table, know that this operation will dump the whole table into WAL, so you won't win anything.
For a simple statement like that on an unlogged table, the only way to speed it up are:
drop all indexes, triggers and constraints except variant_fact_unique – but creating them again will be expensive, so you might not win overall
make sure you have fast storage and enough RAM
Related
I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.
I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:
Add new records to the database table
Compare the old set of data with the new set of data and update the table where necessary
Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
Add any new columns of data to the database for all records
I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:
rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
UPDATE tracking.test_table
SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
FROM temptable
WHERE temptable.app_id = tracking.test_table.app_id;
""");
cursor.rowcount
conn.commit()
cursor.close()
conn.close()
However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.
I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.
I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks
A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).
As you add your record to the table you do 3 things:
The New record's from-effective-date gets the transaction file's date
The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.
This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:
If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.
The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.
If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.
You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.
This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).
I can think of two likely candidates to add to your natural key: date, or batch.
Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.
If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.
Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.
I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/
You do NOT want this:
CREATE TABLE test_table (
app_id INTEGER PRIMARY KEY,
date DATE,
active BOOLEAN
);
Instead, you want this:
CREATE TABLE test_table (
app_id INTEGER,
date DATE,
active BOOLEAN,
PRIMARY KEY (app_id, date)
);
I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)
For step 2)
First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.
For step 3)
For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.
You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.
This question is a bit related to another question: Get List of Primary Key Columns in Snowflake.
Since INFORMATION_SCHEMA.COLUMNS does not provide the required information regarding the primary keys. And the method proposed by Snowflake itself, where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel.
I was thinking about using SHOW PRIMARY KEYs IN DATABASE. This works great when querying the database from within Snowflake. But as soon as I try to do this in python, I get results for the column name like 'Built-in function id'. Which is not useful when dynamically generating sql statements.
The code I am using is as follows:
SQL_PK = "SHOW PRIMARY KEYS IN DATABASE;"
snowflake_service = SnowflakeService(username=cred["username"], password=cred["password"])
snowflake_service.connect(database=DATABASE,role=ROLE, warehouse=WAREHOUSE)
curs = snowflake_service.cursor
primary_keys = curs.execute(SQL_PK).fetchall()
curs.close()
snowflake_service.connection.close()
Is there something I am doing wrong? Is it even possible to do it like this?
Or is the solution that Snowflake provides reliable enough, when sending these queries as one string? Although with many tables, there will be many round trips required to get all the data needed.
where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel
You could search for specific query run using information_schema.query_history_by_session and then refer to resultset using retrieved QUERY_ID.
SHOW PRIMARY KEYS IN DATABASE;
-- find the newest occurence of `SHOW PRIMARY KEYS`:
SET queryId = (SELECT QUERY_ID
FROM TABLE(information_schema.query_history_by_session())
WHERE QUERY_TEXT LIKE '%SHOW PRIMARY KEYS IN DATABASE%'
ORDER BY ENDTIME DESC LIMIT 1);
SELECT * FROM TABLE(RESULT_SCAN($queryId));
I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.
My application is very database intensive so I'm trying to reduce the load on the database. I am using PostgreSQL as rdbms and python is the programming language.
To reduce the load I am already using a caching mechanism in the application. The caching type I used is a server cache, browser cache.
Currently I'm tuning the PostgreSQL query cache to get it in line with the characteristics of queries being run on the server.
Questions:
Is it possible to fine tune query cache on a per database level?
Is it possible to fine tune query cache on a per table basis?
please provide tutorial to learn query cache in PostgreSQL.
Tuning PostgreSQL is far more than just tuning caches. In fact, the primary high level things are "shared buffers" (think of this as the main data and index cache), and the work_mem.
The shared buffers help with reading and writing. You want to give it a decent size, but it's for the entire cluster.. and you can't really tune it on a per table or especially query basis. Importantly, it's not really storing query results.. it's storing tables, indexes and other data. In an ACID compliant database, it's not very efficient or useful to cache query results.
The "work_mem" is used to sort query results in memory and not have to resort to writing to disk.. depending on your query, this area could be as important as the buffer cache, and easier to tune. Before running a query that needs to do a larger sort, you can issue the set command like "SET work_mem = '256MB';"
As others have suggested you can figure out WHY a query is running slowly using "explain". I'd personally suggest learning the "access path" postgresql is taking to get to your data. That's far more involved and honestly a better use of resources than simply thinking of "caching results".
You can honestly improve things a lot with data design as well and using features such as partitioning, functional indexes, and other techniques.
One other thing is that you can get better performance by writing better queries.. things like "with" clauses can prevent postgres' optimizer from optimizing queries fully.
The optimizer itself also has parameters that can be adjusted-- so that the DB will spend more (or less) time optimizing a query prior to executing it.. which can make a difference.
You can also use certain techniques to write queries to help the optimizer. One such technique is to use bind variables (colon variables)--- this will result in the optimizer getting the same query over and over with different data passed in. This way, the structure doesn't have to be evaluated over and over.. query plans can be cached in this way.
Without seeing some of your queries, your table and index designs, and an explain plan, it's hard to make specific recommendation.
In general, you need to find queries that aren't as performant as you feel they should be and figure out where the contention is. Likely it's disk access, however,the cause is ultimately the most important part.. is it having to go to disk to hold the sort? Is it internally choosing a bad path to get to the data, such that it's reading data that could easily be eliminated earlier in the query process... I've been an oracle certified DBA for over 20 years, and PostgreSQL is definitely different, however, many of the same techniques are used when it comes to diagnosing a query's performance issues. Although you can't really provide hints, you can still rewrite queries or tune certain parameters to get better performace.. in general, I've found postgresql to be easier to tune in the long run. If you can provide some specifics, perhaps a query and explain info, I'd be happy to give you specific recommendations. Sadly, though, "cache tuning" is likely to provide you the speed you're wanting all on its own.
I developed a system for caching results, to speed-up results queried from a web-based solution. I reproduced below in essence what it did:
The following are the generic caching handling tables and functions.
CREATE TABLE cached_results_headers (
cache_id serial NOT NULL PRIMARY KEY,
date timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_access timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
relid regclass NOT NULL,
query text NOT NULL,
rows int NOT NULL DEFAULT 0
);
CREATE INDEX ON cached_results_headers (relid, md5(query));
CREATE TABLE cached_results (
cache_id int NOT NULL,
row_no int NOT NULL
);
CREATE OR REPLACE FUNCTION f_get_cached_results_header (p_cache_table text, p_source_relation regclass, p_query text, p_max_lifetime interval, p_clear_old_data interval) RETURNS cached_results_headers AS $BODY$
DECLARE
_cache_id int;
_rows int;
BEGIN
IF p_clear_old_data IS NOT NULL THEN
DELETE FROM cached_results_headers WHERE date < CURRENT_TIMESTAMP - p_clear_old_data;
END IF;
_cache_id := cache_id FROM cached_results_headers WHERE relid = p_source_relation AND md5(query) = md5(p_query) AND query = p_query AND date > CURRENT_TIMESTAMP - p_max_lifetime;
IF _cache_id IS NULL THEN
INSERT INTO cached_results_headers (relid, query) VALUES (p_source_relation, p_query) RETURNING cache_id INTO _cache_id;
EXECUTE $$ INSERT INTO $$||p_cache_table||$$ SELECT $1, row_number() OVER (), r.r FROM ($$||p_query||$$) r $$ USING _cache_id;
GET DIAGNOSTICS _rows = ROW_COUNT;
UPDATE cached_results_headers SET rows = _rows WHERE cache_id = _cache_id;
ELSE
UPDATE cached_results_headers SET last_access = CURRENT_TIMESTAMP;
END IF;
RETURN (SELECT h FROM cached_results_headers h WHERE cache_id = _cache_id);
END;
$BODY$ LANGUAGE PLPGSQL SECURITY DEFINER;
The following is an example of how to use the tables and functions above, for a given view named my_view with a field key to be selected within a range of integer values. You would replace all the following with your particular needs, and replace my_view with either a table, a view, or a function. Also replace the filtering parameters as required.
CREATE VIEW my_view AS SELECT ...; -- create a query with your data, with one of the integer columns in the result as "key" to filter by
CREATE TABLE cached_results_my_view (
row my_view NOT NULL,
PRIMARY KEY (cache_id, row_no),
FOREIGN KEY (cache_id) REFERENCES cached_results_headers ON DELETE CASCADE
) INHERITS (cached_results);
CREATE OR REPLACE FUNCTION f_get_my_view_cached_rows (p_filter1 int, p_filter2 int, p_row_from int, p_row_to int) RETURNS SETOF my_view AS $BODY$
DECLARE
_cache_id int;
BEGIN
_cache_id := cache_id
FROM f_get_cached_results_header('cached_results_my_view', 'my_view'::regclass,
'SELECT r FROM my_view r WHERE key BETWEEN '||p_filter1::text||' AND '||p_filter2::text||' ORDER BY key',
'15 minutes'::interval, '1 day'::interval); -- cache for 15 minutes max since creation time; delete all cached data older than 1 day old
RETURN QUERY
SELECT (row).*
FROM cached_results_my_view
WHERE cache_id = _cache_id AND row_no BETWEEN p_row_from AND p_row_to
ORDER BY row_no;
END;
$BODY$ LANGUAGE PLPGSQL;
Example: Retrieve rows from 1 to 2000 from cached my_view results filtered by key BETWEEN 30044 AND 10610679. Run a first time and the results of the query will be cached into table cached_results_my_view, and the first 2000 records will be returned. Run it again shortly after and the results will be retrieved from the table cached_results_my_view directly without executing the query.
SELECT * FROM f_get_my_view_cached_rows(30044, 10610679, 1, 2000);
In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.