I am writing python scripts to sychronize tables from a MSSQL database to a Postgresql DB. The original author tends to use super wide tables with a lot of regional consecutive NULL holes in them.
For insertion speed, I serialized the records in bulk to string in the following form before execute()
INSERT INTO A( {col_list} )
SELECT * FROM ( VALUES (row_1), (row_2),...) B( {col_list} )
During the row serialization, its not possbile to determin the data type of NULL or None in python. This makes the job complicated. All NULL values in timestamp columns, integer columns etc need explicit type cast into proper types, or Pg complains about it.
Currently I am checking the DB API connection.description property and compare column type_code, for every column and add type casting like ::timestamp as needed.
But this feels cumbersome, with the extra work: the driver already converted the data from text to proper python data type, now I have to redo it for column with those many Nones.
Is there any better way to work around this with elegancy & simplicity ?
If you don't need the SELECT, go with #Nick's answer.
If you need it (like with a CTE to use the input rows multiple times), there are workarounds depending on the details of your use case.
Example, when working with complete rows:
INSERT INTO A -- complete rows
SELECT * FROM (
VALUES ((NULL::A).*), (row_1), (row_2), ...
) B
OFFSET 1;
{col_list} is optional noise in this particular case, since we need to provide complete rows anyway.
Detailed explanation:
Casting NULL type when updating multiple rows
Instead of inserting from a SELECT, you can attach a VALUES clause directly to the INSERT, i.e.:
INSERT INTO A ({col_list})
VALUES (row_1), (row_2), ...
When you insert from a query, Postgres examines the query in isolation when trying to infer the column types, and then tries to coerce them to match the target table (only to find out that it can't).
When you insert directly from a VALUES list, it knows about the target table when performing the type inference, and can then assume that any untyped NULL matches the corresponding column.
You could try to create json from data and then rowset from json using json_populate_record(..).
postgres=# create table js_test (id int4, dat timestamp, val text);
CREATE TABLE
postgres=# insert into js_test
postgres-# select (json_populate_record(null::js_test,
postgres(# json_object(array['id', 'dat', 'val'], array['5', null, 'test']))).*;
INSERT 0 1
postgres=# select * from js_test;
id | dat | val
----+-----+------
5 | | test
You can use json_populate_recordset(..) to do the same with multiple rows in one go. You just pass json value that is array of json. Make sure it isn't array of json.
So this is OK: '[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]'::json
This is not: array['{"id":1,"dat":null,"val":6}'::json,'{"id":3,"val":"tst"}'::json]
select *
from json_populate_recordset(null::js_test,
'[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]')
Related
I'm trying to figure out how to add a timestamp to my database table. df2 doesn't include any column for time so i'm trying to create the value either in values_ or when I execute to sql. I want to use the GETDATE() redshift function
values_ = ', '.join([f"('{str(i.columnA)}','{str(i.columnB)}','{str(i.columnC)}','{str(i.columnD)}', 'GETDATE()')" for i in df2.itertuples()])
sqlexecute(f'''insert into table.table2 (columnA, columnB, columnC, columnD, time_)
values
({values_})
;
''')
This is one of several errors I get depending on where I put GETDATE()
FeatureNotSupported: ROW expression, implicit or explicit, is not supported in target list
The "INSERT ... VALUES (...)" construct is for inserting literals into a table and getdate() is not a literal. However, there are a number of ways to get this to work. A couple of easy ways are:
You can make the default value of the column 'time_' be getdate() and then just use the key work default in the insert values statement. This will tell Redshift to use the default for the column (getdate())
insert into values ('A', 'B', 3, default)
You could switch to a "INSERT ... SELECT ..." construct which will allow you to have a mix of literals and function calls.
insert into table (select 'A', 'B', 3, getdate())
NOTE: inserting row by row into a table in Redshift can slow and make a mess of the table if the number of rows being inserted is large. This can be compounded if auto-commit is on as each insert will be committed which will need to work its way through the commit queue. If you are inserting a large amount of data you should do this through writing an S3 object and COPYing it to Redshift. Or at least bundling up 100+ rows of data into a single insert statement (with auto-commit off and explicitly commit the changes at the end).
When I created the table I added a time_log column using timestamp.
drop table if exists table1;
create table table1(
column1 varchar (255),
column2 varchar(255),
time_log timestamp
);
The issue was I had parentheses around the values in my insert statement. remove those and it will work.{values_}
sqlexecute(f'''insert into table.table2 (columnA, columnB, time_log)
values
({values_})
;
''')
I have a massive SQL query that returns a very large table of results.
How can I attach one entry to that result, with a simple different query, to skip extra access to the database?
e.g. for columns [a,b,c,d,e,f] say I get a result of 3 rows, I wish to add a row for only when e = 0, and bring its f value, in this example it's 9, ignoring all the other many conditions on the other columns, and leave the rest of the row NULL, the result should be something like that:
|a|b|c|d|e|f|
|1|2|3|4|5|6|
|6|5|4|3|2|1|
|1|3|4|2|5|6|
|N|N|N|N|0|9|
This would probably make more sense to do that in the application that consumes the data.
In MySQL, you could use union all to add the required record to your resultset (which you seem to be aware of, since you tag your question with union).
So something like:
select -- your massive query here
union all
select null, null, null, null, 0, null
union all will generate the record regardless of what the massive query returns. If there is a change that the massive query might also generate the same record, and you want to avoid duplicates, you can use union(but that means more work for your RDMBS, that will need to search both unioned resultsets for duplicates).
You might consider a UNION after your existing query
<your query>
UNION
SELECT 'N','N','N','N',0,'N'
Or if you really mean NULL
<your query>
UNION
SELECT NULL,NULL,NULL,NULL,0,NULL
So after clarification:
<your query>
UNION
SELECT NULL,NULL,NULL,NULL,e,f
FROM <table> t
WHERE t.e = 0
<may require other clauses in the where>
In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.
We are using BigQuery's Python API, specifically the jobs resource, to run a query on an existing BigQuery table, and to export the results by inserting the resulting dataset in a new BigQuery table (destinationTable).
Is there a way to also update the schema of the newly created table and set a specific datatype? By default, all fields are set to a 'string' type, but we need one of the fields to be 'timestamp'.
In order to set the field types of the destination table you need to CAST to the new type in your query, as the result set describes the new field type in the destination table.
You need to use simple CAST functions to have numbers/dates like.
SELECT TIMESTAMP(t) AS t FROM (SELECT "2015-01-01 00:00:00" t)
Recently it was introduced the "unflatten" feature for record types, so you can now transfer the whole record to another table while preserving the RECORD structure - for that you'd need to set a destination table (and the desired write disposition), set allowLargeResults =TRUE, and then set Flatten Results= FALSE (see the last post in here where it is explained). Then you can run a query like this to transfer the whole record to the dest table:
SELECT cell.* FROM publicdata:samples.trigrams LIMIT 0;
I'm using tables from the publicdata:samples dataset which is also available to you, so you can run these tests, too. In the above query 'cell' is a record, and if you set Flatten Results=FALSE, you'll see that 'cell' is still a RECORD in your dest table.
You can remove some fields from your record when transferring data to the dest table. Here is the query that demonstrates this (again, you'd need to run it with Flatten Results=FALSE):
SELECT cell.value, cell.volume_count FROM publicdata:samples.trigrams
LIMIT 0;
After you run the above query, the 'cell' record will only contain the fields you specified.
You can rename an existing field within a record when transferring data to the dest table:
SELECT cell.value AS cell.newvalue FROM publicdata:samples.trigrams
LIMIT 0;
Unfortunately, currently there is no way to add a field to a record, e.g., the following query will create 'url' outside of both 'actor_attributes' and 'repository' records.
SELECT
actor_attributes.blog,
repository.created_at,
repository.url AS actor_attributes.url
FROM publicdata:samples.github_nested
LIMIT 0;
So in order to add a field to a record, you'd need to export your data, process it outside of BigQuery, and then load it back with the new schema.
The field types of the destination table will be automatically set. If you need to transform a string to an integer or timestamp, do so in the query.
This will create a destination table with one column (string):
SELECT x FROM (SELECT "1" x)
This will create a destination table with one column (integer):
SELECT INTEGER(x) AS x FROM (SELECT "1" x)
This will create a destination table with one column (timestamp):
SELECT TIMESTAMP(x) AS x FROM (SELECT "2015-10-21 04:29:00" x)
I am using sqlite with python. When i insert into table A i need to feed it an ID from table B. So what i wanted to do is insert default data into B, grab the id (which is auto increment) and use it in table A. Whats the best way receive the key from the table i just inserted into?
As Christian said, sqlite3_last_insert_rowid() is what you want... but that's the C level API, and you're using the Python DB-API bindings for SQLite.
It looks like the cursor method lastrowid will do what you want (search for 'lastrowid' in the documentation for more information). Insert your row with cursor.execute( ... ), then do something like lastid = cursor.lastrowid to check the last ID inserted.
That you say you need "an" ID worries me, though... it doesn't matter which ID you have? Unless you are using the data just inserted into B for something, in which case you need that row ID, your database structure is seriously screwed up if you just need any old row ID for table B.
Check out sqlite3_last_insert_rowid() -- it's probably what you're looking for:
Each entry in an SQLite table has a
unique 64-bit signed integer key
called the "rowid". The rowid is
always available as an undeclared
column named ROWID, OID, or _ROWID_ as
long as those names are not also used
by explicitly declared columns. If the
table has a column of type INTEGER
PRIMARY KEY then that column is
another alias for the rowid.
This routine returns the rowid of the
most recent successful INSERT into the
database from the database connection
in the first argument. If no
successful INSERTs have ever occurred
on that database connection, zero is
returned.
Hope it helps! (More info on ROWID is available here and here.)
Simply use:
SELECT last_insert_rowid();
However, if you have multiple connections writing to the database, you might not get back the key that you expect.