How to read Snowflake primary keys into python - python

This question is a bit related to another question: Get List of Primary Key Columns in Snowflake.
Since INFORMATION_SCHEMA.COLUMNS does not provide the required information regarding the primary keys. And the method proposed by Snowflake itself, where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel.
I was thinking about using SHOW PRIMARY KEYs IN DATABASE. This works great when querying the database from within Snowflake. But as soon as I try to do this in python, I get results for the column name like 'Built-in function id'. Which is not useful when dynamically generating sql statements.
The code I am using is as follows:
SQL_PK = "SHOW PRIMARY KEYS IN DATABASE;"
snowflake_service = SnowflakeService(username=cred["username"], password=cred["password"])
snowflake_service.connect(database=DATABASE,role=ROLE, warehouse=WAREHOUSE)
curs = snowflake_service.cursor
primary_keys = curs.execute(SQL_PK).fetchall()
curs.close()
snowflake_service.connection.close()
Is there something I am doing wrong? Is it even possible to do it like this?
Or is the solution that Snowflake provides reliable enough, when sending these queries as one string? Although with many tables, there will be many round trips required to get all the data needed.

where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel
You could search for specific query run using information_schema.query_history_by_session and then refer to resultset using retrieved QUERY_ID.
SHOW PRIMARY KEYS IN DATABASE;
-- find the newest occurence of `SHOW PRIMARY KEYS`:
SET queryId = (SELECT QUERY_ID
FROM TABLE(information_schema.query_history_by_session())
WHERE QUERY_TEXT LIKE '%SHOW PRIMARY KEYS IN DATABASE%'
ORDER BY ENDTIME DESC LIMIT 1);
SELECT * FROM TABLE(RESULT_SCAN($queryId));

Related

Ask sqlite3 to perform query optimization after parameter replacement

I just traced back performance issues to query optimization in sqlite3 (I'm using the Python interface but I'm not sure it's relevant, the problem is probably about the sqlite API).
I describe such an example below.
My table looks like this with millions of records:
CREATE TABLE articles (
id integer primary key,
title text,
)
CREATE TABLE "tags" (
tag text,
article_id integer not null,
foreign key(article_id) references articles(id),
)
Here is the request I want to run:
SELECT tag
FROM "tags" AS t1
INNER JOIN
(
SELECT id
FROM "articles"
LIMIT (?)
) t2
ON t1.article_id = t2.id
I observed that if the limit parameter is hardcoded, the query is way faster (0.01s vs 15s).
Using set_trace_callback, I saw that the query was exactly the same.
However, explain didn't give the same result.
I concluded that the SQLite query optimizer runs before parameter replacement.
Indeed, if I modify the query to be
SELECT t1.tag FROM
(
SELECT id
FROM "articles"
LIMIT (?)
) as t2 CROSS JOIN
"tags" as t1
WHERE
t1.article_id = t2.id
it works with the same speed as hardcoded limit, as CROSS JOIN allows to control the order of the for loops (link).
Thus, I conclude that the query optimizer does not run the nested loop in the same order.
In this toy example, there is an obvious order that is better. However, in other cases, I might want the query optimizer to run after parameter replacement.
Is there a way to do it while still using the safe methods that prevent SQL injection?
The trivial solution would be to validate that limits are integers and use simple string replacement.

Inserting into MySQL from Python using mysql.connector module - How to use executemany() to inert rows and child rows?

I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Cassandra column family not found in pycassa

I'm attempting to connect to Cassandra in order to do a bulk insert. However, when I attempt to connect, I get an error.
The code I'm using:
from pycassa import columnfamily
from pycassa import pool
cassandra_ips = ['<an ip addr>']
conpool = pool.ConnectionPool('my_keyspace', cassandra_ips)
colfam = columnfamily.ColumnFamily(conpool, 'my_table')
However this fails on that last line with:
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family my_table not found.')
The column family definitely exists:
cqlsh> use my_keyspace
... ;
cqlsh:my_keyspace> desc tables;
my_table
cqlsh:my_keyspace>
And I don't think this is a simple typo on the table name, as I've check it a dozen times, but also because of this:
In [3]: sys_mgr = pycassa.system_manager.SystemManager(cassandra_ips[0])
In [4]: sys_mgr.get_keyspace_column_families('my_keyspace')
Out[4]: {}
Why is that {}?
If it matters:
The table/column family was created using CQL.
The table is currently empty.
The table was roughly created using:
CREATE TABLE my_table (
user_id int,
year_month int,
t timestamp,
<tons of other attributes>
PRIMARY KEY ((user_id, year_month), t)
) WITH compaction =
{ 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
In order to access CQL3 databases via a thrift API, such as pycassa, the tables must be created using compact storage.
CREATE TABLE my_table (
...
) WITH COMPACT STORAGE;
With regards to the primary keys, from the docs:
Using the compact storage directive prevents you from defining more
than one column that is not part of a compound primary key.
Currently you are using a composite partition key but enabling compact storage limits us to using a compound partition key. So you will not have to limit it to a single column, it just has to be part of the compound key. One final reference.
This case could happen also after creation CF named with uppercase:
https://docs.datastax.com/en/cql/3.0/cql/cql_reference/ucase-lcase_r.html
I had such strange structure of namespace with quoted CFs:
cqlsh:testkeyspace> DESC TABLES;
"Tabletest" users "PlayerLastStats"
I got error on pycassa system_manager.create_column_family(...) but only if here was column_validation_classes parameter
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family NewTable not found.')
After renaming to lowercase all tables looked good
cqlsh:testkeyspace> DESC TABLES;
tabletest newtable users playerlaststats

creating blank field and receving the INTEGER PRIMARY KEY with sqlite, python

I am using sqlite with python. When i insert into table A i need to feed it an ID from table B. So what i wanted to do is insert default data into B, grab the id (which is auto increment) and use it in table A. Whats the best way receive the key from the table i just inserted into?
As Christian said, sqlite3_last_insert_rowid() is what you want... but that's the C level API, and you're using the Python DB-API bindings for SQLite.
It looks like the cursor method lastrowid will do what you want (search for 'lastrowid' in the documentation for more information). Insert your row with cursor.execute( ... ), then do something like lastid = cursor.lastrowid to check the last ID inserted.
That you say you need "an" ID worries me, though... it doesn't matter which ID you have? Unless you are using the data just inserted into B for something, in which case you need that row ID, your database structure is seriously screwed up if you just need any old row ID for table B.
Check out sqlite3_last_insert_rowid() -- it's probably what you're looking for:
Each entry in an SQLite table has a
unique 64-bit signed integer key
called the "rowid". The rowid is
always available as an undeclared
column named ROWID, OID, or _ROWID_ as
long as those names are not also used
by explicitly declared columns. If the
table has a column of type INTEGER
PRIMARY KEY then that column is
another alias for the rowid.
This routine returns the rowid of the
most recent successful INSERT into the
database from the database connection
in the first argument. If no
successful INSERTs have ever occurred
on that database connection, zero is
returned.
Hope it helps! (More info on ROWID is available here and here.)
Simply use:
SELECT last_insert_rowid();
However, if you have multiple connections writing to the database, you might not get back the key that you expect.

Categories

Resources