attach one unrelated row to MySQL query - python

I have a massive SQL query that returns a very large table of results.
How can I attach one entry to that result, with a simple different query, to skip extra access to the database?
e.g. for columns [a,b,c,d,e,f] say I get a result of 3 rows, I wish to add a row for only when e = 0, and bring its f value, in this example it's 9, ignoring all the other many conditions on the other columns, and leave the rest of the row NULL, the result should be something like that:
|a|b|c|d|e|f|
|1|2|3|4|5|6|
|6|5|4|3|2|1|
|1|3|4|2|5|6|
|N|N|N|N|0|9|

This would probably make more sense to do that in the application that consumes the data.
In MySQL, you could use union all to add the required record to your resultset (which you seem to be aware of, since you tag your question with union).
So something like:
select -- your massive query here
union all
select null, null, null, null, 0, null
union all will generate the record regardless of what the massive query returns. If there is a change that the massive query might also generate the same record, and you want to avoid duplicates, you can use union(but that means more work for your RDMBS, that will need to search both unioned resultsets for duplicates).

You might consider a UNION after your existing query
<your query>
UNION
SELECT 'N','N','N','N',0,'N'
Or if you really mean NULL
<your query>
UNION
SELECT NULL,NULL,NULL,NULL,0,NULL
So after clarification:
<your query>
UNION
SELECT NULL,NULL,NULL,NULL,e,f
FROM <table> t
WHERE t.e = 0
<may require other clauses in the where>

Related

Compare contents of 2 tables with same column names and insert/update on the basis of primary key in Postgres [duplicate]

Several months ago I learned from an answer on Stack Overflow how to perform multiple updates at once in MySQL using the following syntax:
INSERT INTO table (id, field, field2) VALUES (1, A, X), (2, B, Y), (3, C, Z)
ON DUPLICATE KEY UPDATE field=VALUES(Col1), field2=VALUES(Col2);
I've now switched over to PostgreSQL and apparently this is not correct. It's referring to all the correct tables so I assume it's a matter of different keywords being used but I'm not sure where in the PostgreSQL documentation this is covered.
To clarify, I want to insert several things and if they already exist to update them.
PostgreSQL since version 9.5 has UPSERT syntax, with ON CONFLICT clause. with the following syntax (similar to MySQL)
INSERT INTO the_table (id, column_1, column_2)
VALUES (1, 'A', 'X'), (2, 'B', 'Y'), (3, 'C', 'Z')
ON CONFLICT (id) DO UPDATE
SET column_1 = excluded.column_1,
column_2 = excluded.column_2;
Searching postgresql's email group archives for "upsert" leads to finding an example of doing what you possibly want to do, in the manual:
Example 38-2. Exceptions with UPDATE/INSERT
This example uses exception handling to perform either UPDATE or INSERT, as appropriate:
CREATE TABLE db (a INT PRIMARY KEY, b TEXT);
CREATE FUNCTION merge_db(key INT, data TEXT) RETURNS VOID AS
$$
BEGIN
LOOP
-- first try to update the key
-- note that "a" must be unique
UPDATE db SET b = data WHERE a = key;
IF found THEN
RETURN;
END IF;
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO db(a,b) VALUES (key, data);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;
END;
$$
LANGUAGE plpgsql;
SELECT merge_db(1, 'david');
SELECT merge_db(1, 'dennis');
There's possibly an example of how to do this in bulk, using CTEs in 9.1 and above, in the hackers mailing list:
WITH foos AS (SELECT (UNNEST(%foo[])).*)
updated as (UPDATE foo SET foo.a = foos.a ... RETURNING foo.id)
INSERT INTO foo SELECT foos.* FROM foos LEFT JOIN updated USING(id)
WHERE updated.id IS NULL;
See a_horse_with_no_name's answer for a clearer example.
Warning: this is not safe if executed from multiple sessions at the same time (see caveats below).
Another clever way to do an "UPSERT" in postgresql is to do two sequential UPDATE/INSERT statements that are each designed to succeed or have no effect.
UPDATE table SET field='C', field2='Z' WHERE id=3;
INSERT INTO table (id, field, field2)
SELECT 3, 'C', 'Z'
WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3);
The UPDATE will succeed if a row with "id=3" already exists, otherwise it has no effect.
The INSERT will succeed only if row with "id=3" does not already exist.
You can combine these two into a single string and run them both with a single SQL statement execute from your application. Running them together in a single transaction is highly recommended.
This works very well when run in isolation or on a locked table, but is subject to race conditions that mean it might still fail with duplicate key error if a row is inserted concurrently, or might terminate with no row inserted when a row is deleted concurrently. A SERIALIZABLE transaction on PostgreSQL 9.1 or higher will handle it reliably at the cost of a very high serialization failure rate, meaning you'll have to retry a lot. See why is upsert so complicated, which discusses this case in more detail.
This approach is also subject to lost updates in read committed isolation unless the application checks the affected row counts and verifies that either the insert or the update affected a row.
With PostgreSQL 9.1 this can be achieved using a writeable CTE (common table expression):
WITH new_values (id, field1, field2) as (
values
(1, 'A', 'X'),
(2, 'B', 'Y'),
(3, 'C', 'Z')
),
upsert as
(
update mytable m
set field1 = nv.field1,
field2 = nv.field2
FROM new_values nv
WHERE m.id = nv.id
RETURNING m.*
)
INSERT INTO mytable (id, field1, field2)
SELECT id, field1, field2
FROM new_values
WHERE NOT EXISTS (SELECT 1
FROM upsert up
WHERE up.id = new_values.id)
See these blog entries:
Upserting via Writeable CTE
WAITING FOR 9.1 – WRITABLE CTE
WHY IS UPSERT SO COMPLICATED?
Note that this solution does not prevent a unique key violation but it is not vulnerable to lost updates.
See the follow up by Craig Ringer on dba.stackexchange.com
In PostgreSQL 9.5 and newer you can use INSERT ... ON CONFLICT UPDATE.
See the documentation.
A MySQL INSERT ... ON DUPLICATE KEY UPDATE can be directly rephrased to a ON CONFLICT UPDATE. Neither is SQL-standard syntax, they're both database-specific extensions. There are good reasons MERGE wasn't used for this, a new syntax wasn't created just for fun. (MySQL's syntax also has issues that mean it wasn't adopted directly).
e.g. given setup:
CREATE TABLE tablename (a integer primary key, b integer, c integer);
INSERT INTO tablename (a, b, c) values (1, 2, 3);
the MySQL query:
INSERT INTO tablename (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
becomes:
INSERT INTO tablename (a, b, c) values (1, 2, 10)
ON CONFLICT (a) DO UPDATE SET c = tablename.c + 1;
Differences:
You must specify the column name (or unique constraint name) to use for the uniqueness check. That's the ON CONFLICT (columnname) DO
The keyword SET must be used, as if this was a normal UPDATE statement
It has some nice features too:
You can have a WHERE clause on your UPDATE (letting you effectively turn ON CONFLICT UPDATE into ON CONFLICT IGNORE for certain values)
The proposed-for-insertion values are available as the row-variable EXCLUDED, which has the same structure as the target table. You can get the original values in the table by using the table name. So in this case EXCLUDED.c will be 10 (because that's what we tried to insert) and "table".c will be 3 because that's the current value in the table. You can use either or both in the SET expressions and WHERE clause.
For background on upsert see How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
I was looking for the same thing when I came here, but the lack of a generic "upsert" function botherd me a bit so I thought you could just pass the update and insert sql as arguments on that function form the manual
that would look like this:
CREATE FUNCTION upsert (sql_update TEXT, sql_insert TEXT)
RETURNS VOID
LANGUAGE plpgsql
AS $$
BEGIN
LOOP
-- first try to update
EXECUTE sql_update;
-- check if the row is found
IF FOUND THEN
RETURN;
END IF;
-- not found so insert the row
BEGIN
EXECUTE sql_insert;
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing and loop
END;
END LOOP;
END;
$$;
and perhaps to do what you initially wanted to do, batch "upsert", you could use Tcl to split the sql_update and loop the individual updates, the preformance hit will be very small see http://archives.postgresql.org/pgsql-performance/2006-04/msg00557.php
the highest cost is executing the query from your code, on the database side the execution cost is much smaller
There is no simple command to do it.
The most correct approach is to use function, like the one from docs.
Another solution (although not that safe) is to do update with returning, check which rows were updates, and insert the rest of them
Something along the lines of:
update table
set column = x.column
from (values (1,'aa'),(2,'bb'),(3,'cc')) as x (id, column)
where table.id = x.id
returning id;
assuming id:2 was returned:
insert into table (id, column) values (1, 'aa'), (3, 'cc');
Of course it will bail out sooner or later (in concurrent environment), as there is clear race condition in here, but usually it will work.
Here's a longer and more comprehensive article on the topic.
I use this function merge
CREATE OR REPLACE FUNCTION merge_tabla(key INT, data TEXT)
RETURNS void AS
$BODY$
BEGIN
IF EXISTS(SELECT a FROM tabla WHERE a = key)
THEN
UPDATE tabla SET b = data WHERE a = key;
RETURN;
ELSE
INSERT INTO tabla(a,b) VALUES (key, data);
RETURN;
END IF;
END;
$BODY$
LANGUAGE plpgsql
Personally, I've set up a "rule" attached to the insert statement. Say you had a "dns" table that recorded dns hits per customer on a per-time basis:
CREATE TABLE dns (
"time" timestamp without time zone NOT NULL,
customer_id integer NOT NULL,
hits integer
);
You wanted to be able to re-insert rows with updated values, or create them if they didn't exist already. Keyed on the customer_id and the time. Something like this:
CREATE RULE replace_dns AS
ON INSERT TO dns
WHERE (EXISTS (SELECT 1 FROM dns WHERE ((dns."time" = new."time")
AND (dns.customer_id = new.customer_id))))
DO INSTEAD UPDATE dns
SET hits = new.hits
WHERE ((dns."time" = new."time") AND (dns.customer_id = new.customer_id));
Update: This has the potential to fail if simultaneous inserts are happening, as it will generate unique_violation exceptions. However, the non-terminated transaction will continue and succeed, and you just need to repeat the terminated transaction.
However, if there are tons of inserts happening all the time, you will want to put a table lock around the insert statements: SHARE ROW EXCLUSIVE locking will prevent any operations that could insert, delete or update rows in your target table. However, updates that do not update the unique key are safe, so if you no operation will do this, use advisory locks instead.
Also, the COPY command does not use RULES, so if you're inserting with COPY, you'll need to use triggers instead.
Similar to most-liked answer, but works slightly faster:
WITH upsert AS (UPDATE spider_count SET tally=1 WHERE date='today' RETURNING *)
INSERT INTO spider_count (spider, tally) SELECT 'Googlebot', 1 WHERE NOT EXISTS (SELECT * FROM upsert)
(source: http://www.the-art-of-web.com/sql/upsert/)
I custom "upsert" function above, if you want to INSERT AND REPLACE :
`
CREATE OR REPLACE FUNCTION upsert(sql_insert text, sql_update text)
RETURNS void AS
$BODY$
BEGIN
-- first try to insert and after to update. Note : insert has pk and update not...
EXECUTE sql_insert;
RETURN;
EXCEPTION WHEN unique_violation THEN
EXECUTE sql_update;
IF FOUND THEN
RETURN;
END IF;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION upsert(text, text)
OWNER TO postgres;`
And after to execute, do something like this :
SELECT upsert($$INSERT INTO ...$$,$$UPDATE... $$)
Is important to put double dollar-comma to avoid compiler errors
check the speed...
According the PostgreSQL documentation of the INSERT statement, handling the ON DUPLICATE KEY case is not supported. That part of the syntax is a proprietary MySQL extension.
I have the same issue for managing account settings as name value pairs.
The design criteria is that different clients could have different settings sets.
My solution, similar to JWP is to bulk erase and replace, generating the merge record within your application.
This is pretty bulletproof, platform independent and since there are never more than about 20 settings per client, this is only 3 fairly low load db calls - probably the fastest method.
The alternative of updating individual rows - checking for exceptions then inserting - or some combination of is hideous code, slow and often breaks because (as mentioned above) non standard SQL exception handling changing from db to db - or even release to release.
#This is pseudo-code - within the application:
BEGIN TRANSACTION - get transaction lock
SELECT all current name value pairs where id = $id into a hash record
create a merge record from the current and update record
(set intersection where shared keys in new win, and empty values in new are deleted).
DELETE all name value pairs where id = $id
COPY/INSERT merged records
END TRANSACTION
CREATE OR REPLACE FUNCTION save_user(_id integer, _name character varying)
RETURNS boolean AS
$BODY$
BEGIN
UPDATE users SET name = _name WHERE id = _id;
IF FOUND THEN
RETURN true;
END IF;
BEGIN
INSERT INTO users (id, name) VALUES (_id, _name);
EXCEPTION WHEN OTHERS THEN
UPDATE users SET name = _name WHERE id = _id;
END;
RETURN TRUE;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
For merging small sets, using the above function is fine. However, if you are merging large amounts of data, I'd suggest looking into http://mbk.projects.postgresql.org
The current best practice that I'm aware of is:
COPY new/updated data into temp table (sure, or you can do INSERT if the cost is ok)
Acquire Lock [optional] (advisory is preferable to table locks, IMO)
Merge. (the fun part)
UPDATE will return the number of modified rows. If you use JDBC (Java), you can then check this value against 0 and, if no rows have been affected, fire INSERT instead. If you use some other programming language, maybe the number of the modified rows still can be obtained, check documentation.
This may not be as elegant but you have much simpler SQL that is more trivial to use from the calling code. Differently, if you write the ten line script in PL/PSQL, you probably should have a unit test of one or another kind just for it alone.
Edit: This does not work as expected. Unlike the accepted answer, this produces unique key violations when two processes repeatedly call upsert_foo concurrently.
Eureka! I figured out a way to do it in one query: use UPDATE ... RETURNING to test if any rows were affected:
CREATE TABLE foo (k INT PRIMARY KEY, v TEXT);
CREATE FUNCTION update_foo(k INT, v TEXT)
RETURNS SETOF INT AS $$
UPDATE foo SET v = $2 WHERE k = $1 RETURNING $1
$$ LANGUAGE sql;
CREATE FUNCTION upsert_foo(k INT, v TEXT)
RETURNS VOID AS $$
INSERT INTO foo
SELECT $1, $2
WHERE NOT EXISTS (SELECT update_foo($1, $2))
$$ LANGUAGE sql;
The UPDATE has to be done in a separate procedure because, unfortunately, this is a syntax error:
... WHERE NOT EXISTS (UPDATE ...)
Now it works as desired:
SELECT upsert_foo(1, 'hi');
SELECT upsert_foo(1, 'bye');
SELECT upsert_foo(3, 'hi');
SELECT upsert_foo(3, 'bye');
PostgreSQL >= v15
Big news on this topic as in PostgreSQL v15, it is possible to use MERGE command. In fact, this long awaited feature was listed the first of the improvements of the v15 release.
This is similar to INSERT ... ON CONFLICT but more batch-oriented. It has a powerful WHEN MATCHED vs WHEN NOT MATCHED structure that gives the ability to INSERT, UPDATE or DELETE on such conditions.
It not only eases bulk changes, but it even adds more control that tradition UPSERT and INSERT ... ON CONFLICT
Take a look at this very complete sample from official page:
MERGE INTO wines w
USING wine_stock_changes s
ON s.winename = w.winename
WHEN NOT MATCHED AND s.stock_delta > 0 THEN
INSERT VALUES(s.winename, s.stock_delta)
WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN
UPDATE SET stock = w.stock + s.stock_delta
WHEN MATCHED THEN
DELETE;
PostgreSQL v9, v10, v11, v12, v13, v14
If version is under v15 and over v9.5 , probably best choice is to use UPSERT syntax, with ON CONFLICT clause
Here is the example how to do upsert with params and without special sql constructions
if you have special condition (sometimes you can't use 'on conflict' because you can't create constraint)
WITH upd AS
(
update view_layer set metadata=:metadata where layer_id = :layer_id and view_id = :view_id returning id
)
insert into view_layer (layer_id, view_id, metadata)
(select :layer_id layer_id, :view_id view_id, :metadata metadata FROM view_layer l
where NOT EXISTS(select id FROM upd WHERE id IS NOT NULL) limit 1)
returning id
maybe it will be helpful

sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "product_pkey" [duplicate]

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

How to avoid explicit casting NULL during INSERT in Postgresql

I am writing python scripts to sychronize tables from a MSSQL database to a Postgresql DB. The original author tends to use super wide tables with a lot of regional consecutive NULL holes in them.
For insertion speed, I serialized the records in bulk to string in the following form before execute()
INSERT INTO A( {col_list} )
SELECT * FROM ( VALUES (row_1), (row_2),...) B( {col_list} )
During the row serialization, its not possbile to determin the data type of NULL or None in python. This makes the job complicated. All NULL values in timestamp columns, integer columns etc need explicit type cast into proper types, or Pg complains about it.
Currently I am checking the DB API connection.description property and compare column type_code, for every column and add type casting like ::timestamp as needed.
But this feels cumbersome, with the extra work: the driver already converted the data from text to proper python data type, now I have to redo it for column with those many Nones.
Is there any better way to work around this with elegancy & simplicity ?
If you don't need the SELECT, go with #Nick's answer.
If you need it (like with a CTE to use the input rows multiple times), there are workarounds depending on the details of your use case.
Example, when working with complete rows:
INSERT INTO A -- complete rows
SELECT * FROM (
VALUES ((NULL::A).*), (row_1), (row_2), ...
) B
OFFSET 1;
{col_list} is optional noise in this particular case, since we need to provide complete rows anyway.
Detailed explanation:
Casting NULL type when updating multiple rows
Instead of inserting from a SELECT, you can attach a VALUES clause directly to the INSERT, i.e.:
INSERT INTO A ({col_list})
VALUES (row_1), (row_2), ...
When you insert from a query, Postgres examines the query in isolation when trying to infer the column types, and then tries to coerce them to match the target table (only to find out that it can't).
When you insert directly from a VALUES list, it knows about the target table when performing the type inference, and can then assume that any untyped NULL matches the corresponding column.
You could try to create json from data and then rowset from json using json_populate_record(..).
postgres=# create table js_test (id int4, dat timestamp, val text);
CREATE TABLE
postgres=# insert into js_test
postgres-# select (json_populate_record(null::js_test,
postgres(# json_object(array['id', 'dat', 'val'], array['5', null, 'test']))).*;
INSERT 0 1
postgres=# select * from js_test;
id | dat | val
----+-----+------
5 | | test
You can use json_populate_recordset(..) to do the same with multiple rows in one go. You just pass json value that is array of json. Make sure it isn't array of json.
So this is OK: '[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]'::json
This is not: array['{"id":1,"dat":null,"val":6}'::json,'{"id":3,"val":"tst"}'::json]
select *
from json_populate_recordset(null::js_test,
'[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]')

MySQL: Set "diff" bit when insert different value on duplicate key?

I have a cronjob (J1) which calculate ~1M customers' product category preference every night. Most customers' preference are stable. But there are exceptions and there are new customers every day. I want to know these changes by setting a "diff" bit to 1. Then another cronjob (J2) can do something (e.g. send notification to them) on such customers and set them back to 0.
The table looks like:
CREATE TABLE customers (
customer_id VARCHAR(255),
preference VARCHAR(255),
diff TINYINT(1),
PRIMARY KEY (customer_id),
KEY (diff)
);
AFAIK, INSERT .. ON DUPLICATE KEY doesn't know about whether a non-key value is different. So you can't use something similar to the following, right?
INSERT customers AS ("sql for J1") ON DUPLICATE KEY
_AND_PREFERENCE_DIFFERS_ SET diff=1;
So what's the best way to do it?
a) Rename table customers to customer_yesterday. Create a new table customers by running J1. LEFT JOIN two tables and set diff bit of customers. (Pros: faster? Cons: need to handle all diffs correctly, e.g. cases when a customer doesn't present in today's output)
b) Loop through output of J1 (using python mysql connector), query customer by customer_id, and insert only when value is different or it's a new customer. (Pros: easy to understand logic; Cons: slow?)
Any better solutions?
Update:
As #Barmar asked, let's say sql for J1 is a transaction grouping sql, e.g.
SELECT
customer_id,
GROUP_CONCAT(DISTINCT product_category SEPARATOR ',')
FROM transaction
WHERE date between _30_days_ago_ and _today_;
Make SQL for J1 a query that uses a LEFT JOIN to filter out customers whose preference hasn't changed.
INSERT INTO customers (customer_id, preference)
SELECT t1.*
FROM (
SELECT customer_id,
GROUP_CONCAT(DISTINCT product_category ORDER BY product_category SEPARATOR ',') AS preference
FROM transaction
WHERE date BETWEEN _30_days_ago_ AND _today_) AS t1
LEFT JOIN customers AS c ON t1.customer_id = c.customer_id AND t1.preference = c.preference
WHERE t1.customer_id IS NULL
ON DUPLICATE KEY UPDATE preference = VALUES(preference), diff = 1
I've added an ORDER BY option to GROUP_CONCAT so that it will always return the categoris in a consistent order. Otherwise, it may result in false positives when the order changes.
I feel obliged to point out that storing comma-separated values in a table column is generally poor database design. You should use a many-to-many relationship table instead.

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Categories

Resources