Query Job Configuration and Query Job Accessing in SQLAlchemy-BigQuery

Query Job Configuration and Query Job Accessing in SQLAlchemy-BigQuery - python

I'm migrating tasks managing BigQuery tables from using Google's BigQuery Python client library to their SQLAlchemy plugin for BigQuery dialect. There are several issues during migration of some of the logics:
Query configuration -- in the original Python client, user can pass in a configuration instance (e.g., QueryJobConfig, LoadJobConfig, etc.) when issuing query through client instance. From SQLAlchemy plugin's documentation, it allows you to pass in configuration information during Engine instance initialization, but nothing mentioned when it comes to running query through Connection.execute(), Session.query(), or other methods alike. Is it possible to pass in query configuration like the client library do when issuing queries through SQLAlchemy Core/ORM?
Example 1:
Say I want to query against some dataset within location us-west1, but my default location setup is us-west3. Also I want to timeout when there's no response within a minute. In the original library, I can pass the location information when calling Clinet.query(), and set up timeout when running the job through result() --
from google.cloud import bigquery
client = bigquery.Client(project='my-project-id', location='us-west3')
query_job = client.query('SELECT id, name FROM my_dataset.my_table;', location='us-west1')
result_set = query_job.result(timeout=60)
But native SQLAlchemy doesn't provide these options when issuing queries.
Example 2:
I want to label a query job through QueryJobConfig, so that later I can query analytics of similar jobs from system table.
from google.cloud import bigquery
client = bigquery.Client(project='my-project-id', location='us-west3')
config = bigquery.job.QueryJobConfig(labels={"query_category": "common"})
query_job = client.query('SELECT id, name FROM my_dataset.my_table;', job_config=config, location='us-west1')
result_set = query_job.result()
Again, the SQLAlchemy plugin didn't mention how to pass in the configuration when issuing queries. From documentation the only place where you can set up job config is create_engine () for providing default job settings.
Accessing underlying query job instance, or at least job ID during query execution.
For example, if under some circumstance I need to cancel a query job, with the original Python client I can simply call cancel() on the job instance, but at dbapi / SQLAlchemy's end I can't do it by just calling close() on the connection/cursor instance, because seems that they only mark the connection/cursor as closed without dealing with the ongoing job, from underlying implementation. Hence it would be helpful if it's possible to access query job directly, or get the job ID so I can spin up an native BigQuery client then grab the details.
Example: Say I don't want to rely on client library's timeout mechanism, and would want to cancel the request myself. In the client library, you can send cancel request through QueryJob instance:
from time import sleep
from google.cloud import bigquery
client = bigquery.Client(project='my-project-id', location='us-west3')
long_query = ... # Some long running query
query_job = client.query(long_query) # With no timeout
time.sleep(10)
if not query_job.done():
query_job.cancel()
However, SQLAlchemy doesn't offer way to cancel a running query, hence you'll need to access underlying job details (either the job instance or job ID) so that you can cancel the query elsewhere.

Related

How to insert bulk data into Cosmos DB in Python?

I'm developing an application in Python which uses Azure Cosmos DB as the main database. At some point in the app, I need to insert bulk data (a batch of items) into Cosmos DB. So far, I've been using Azure Cosmos DB Python SDK for SQL API for communicating with Cosmos DB; however, it doesn't provide a method for bulk data insertion.
As I understood, these are the insertion methods provided in this SDK, both of which only support single item insert, which can be very slow when using it in a for loop:
.upsert_item()
.create_item()
Is there another way to use this SDK to insert bulk data instead of using the methods above in a for loop? If not, is there an Azure REST API that can handle bulk data insertion?

The Cosmos DB service does not provide this via its REST API. Bulk mode is implemented at the SDK layer and unfortunately, the Python SDK does not yet support bulk mode. It does however support asynchronous IO. Here's an example that may help you.
from azure.cosmos.aio import CosmosClient
import os
URL = os.environ['ACCOUNT_URI']
KEY = os.environ['ACCOUNT_KEY']
DATABASE_NAME = 'myDatabase'
CONTAINER_NAME = 'myContainer'
async def create_products():
async with CosmosClient(URL, credential=KEY) as client:
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)
for i in range(10):
await container.upsert_item({
'id': 'item{0}'.format(i),
'productName': 'Widget',
'productModel': 'Model {0}'.format(i)
}
)
Update: I remembered another way you can do bulk inserts in Cosmos DB for Python SDK and that is using Stored Procedures. There are examples of how to write these, including samples that demonstrate passing an array, which is what you want to do. I would also take a look at bounded execution as you will want to implement this as well. You can learn how to write them here, How to write stored procedures. Then how to register and call them here, How to use Stored Procedures. Note: these can only be used when passing a partition key value so you can only do batches within logical partitions.

How to schedule an RDS snapshot and restore in the same script

so I'm scheduling an AWS python job (through AWS Glue Python shell) that is supposed to clone a MySQL RDS database (best way to take a snapshot and restore?) and perform sql queries on the database. I have the boto3 library on the Python Shell and an SQL Python Library I loaded. I have this code currently
import boto3
client = boto3.client('rds')
# Create a snapshot of the database
snapshot_response = client.create_db_snapshot(
DBSnapshotIdentifier='snapshot-identifier',
DBInstanceIdentifier='instance-db',
)
# Restore db from snapshot
restore_response = client.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier = 'restored-db',
DBSnapshotIdentifier = 'snapshot-identifier',
)
# Code that will perform sql queries on the restored-db database.
However, the client.restore_db_instance_from_db_snapshot fails because it says the snapshot is being created. So I understand that this means these calls are asynchronous. But I am not sure how to get this snapshot restore to work (either by making them synchronous - not a good idea?) or by some other way. Thanks for the help in advance :).

You can use a waiter:
waiter = client.get_waiter('db_cluster_snapshot_available')
Polls RDS.Client.describe_db_cluster_snapshots() every 30 seconds until a successful state is reached. An error is returned after 60 failed checks.
See: class RDS.Waiter.DBClusterSnapshotAvailable

Can't see SQLAlchemy Engine with default pool settings establishing more then one connection at a time

I need some help with understanding connection pooling from SQLAlchemy perspective and Redshift perspective.
Ultimately what I want to achieve:
Maximize query parallelism but save Redshift from connections exhaustion
Avoid bottleneck of connection pool while Redshift performing really well but Python application has not enough connections and queries are starting to form a large queue
I have a few instances of application running in docker containers, each instance creates its own SQLAlchemy Engine object with default pooling settings.
engine = create_engine(REDSHIFT_URI, echo=True, echo_pool=True)
I also use context management model suggested in SQLAlchemy documentation and execute each query inside this context. All my queries are aggregation select queries.
#contextmanager
def session_scope():
session = Session(engine)
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
I count all Redshift sessions with select count(*) from stv_sessions where user_name != ‘rdsdb’. The number of sessions is the same as the number of application containers.
Then I start to trigger tons of queries from application I don’t see the increase in Redshift sessions count. Also I can’t see any messages from SQLAlchemy pool logging.
From Redshift documentation:
Each session corresponds to a connection. You can view information
about the active user sessions for Amazon Redshift, or you can check
the total number of the connections by using STV_SESSIONS.
From my understanding SQLAlchemy should open 5 connections but I can’t see it neither in Redshift nor in application logs. Does SQLAlchemy just reuse one connection all the time? Is the definition of SQLAlchemy connection is different from definition in Redshift?
Where is the flow in my logic? Looks like either my testing is bad or I didn’t get one of the concepts.

https://docs.sqlalchemy.org/en/latest/core/pooling.html
Create a Pool
import sqlalchemy.pool as pool
import psycopg2
def getconn():
c = psycopg2.connect(username='ed', host='127.0.0.1', dbname='test')
return c
mypool = pool.QueuePool(getconn, max_overflow=10, pool_size=5)
e = create_engine('postgresql://', pool=mypool)

Do I authenticate at database level, at Flask User level, or both?

I have an MS-SQL deployed on AWS RDS, that I'm writing a Flask front end for.
I've been following some intro Flask tutorials, all of which seem to pass the DB credentials in the connection string URI. I'm following the tutorial here:
https://medium.com/#rodkey/deploying-a-flask-application-on-aws-a72daba6bb80#.e6b4mzs1l
For deployment, do I prompt for the DB login info and add to the connection string? If so, where? Using SQLAlchemy, I don't see any calls to create_engine (using the code in the tutorial), I just see an initialization using config.from_object, referencing the config.py where the SQLALCHEMY_DATABASE_URI is stored, which points to the DB location. Trying to call config.update(dict(UID='****', PASSWORD='******')) from my application has no effect, and looking in the config dict doesn't seem to have any applicable entries to set for this purpose. What am I doing wrong?
Or should I be authenticating using Flask-User, and then get rid of the DB level authentication? I'd prefer authenticating at the DB layer, for ease of use.

The tutorial you are using uses Flask-Sqlalchemy to abstract the database setup stuff, that's why you don't see engine.connect().
Frameworks like Flask-Sqlalchemy are designed around the idea that you create a connection pool to the database on launch, and share that pool amongst your various worker threads. You will not be able to use that for what you are doing... it takes care of initializing the session and things early in the process.
Because of your requirements, I don't know that you'll be able to make any use of things like connection pooling. Instead, you'll have to handle that yourself. The actual connection isn't too hard...
engine = create_engine('dialect://username:password#host/db')
connection = engine.connect()
result = connection.execute("SOME SQL QUERY")
for row in result:
# Do Something
connection.close()
The issue is that you're going to have to do that in every endpoint. A database connection isn't something you can store in the session- you'll have to store the credentials there and do a connect/disconnect loop in every endpoint you write. Worse, you'll have to either figure out encrypted sessions or server side sessions (without a db connection!) to prevent keeping those credentials in the session from becoming a horrible security leak.
I promise you, it will be easier both now and in the long run to figure out a simple way to authenticate users so that they can share a connection pool that is abstracted out of your app endpoints. But if you HAVE to do it this way, this is how you will do it. (make sure you are closing those connections every time!)

Understanding SqlAlchemy Sessions - How volatile are they?

I'm experimenting with SqlAlchemy, and trying to get a grasp of how I should treat connection objects.
So the sessionmaker returns a sessionFactory (confusingly also called Session in all their documentation), and you use that to create session objects that sound a lot like a database cursor to me.
What is a session object, specifically? Is it as ephemeral as a db cursor, or is it more material (does a session bind exclusively to one of the underlying connections in the engine's connection pool, for example)?

The Session object is not a database cursor; while using the Session you may open and close any number of individual cursors. Within a single session's lifespan you may insert some records, run queries, issue updates, and delete.
There's a FAQ on the session where this topic is addressed; in short, the Session is an in-memory object implementing an identity map pattern which will sync the state of objects as they exist in your application with the database upon commit.
# User here is some SQLAlchemy model
user = session.query(User).filter(User.name == 'John').one()
user.name = 'John Smith'
At this stage, the database still thinks this user's name is John. It will continue to until the session is flushed or committed. Note that under most configurations, any query you run from a session automatically flushes the session so you don't need to worry about this.
Now let's inspect our user to better understand what the session is keeping track of:
> from sqlalchemy import orm
> orm.attributes.get_history(user, 'name')
History(added=['John Smith'], unchanged=(), deleted=['John'])
Watch once we've flushed the session:
> session.flush()
> orm.attributes.get_history(user, 'name')
History(added=(), unchanged=['John Smith'], deleted=())
However, if we do not commit the session but instead roll it back, our change will not stick:
> session.rollback()
> orm.attributes.get_history(user, 'name')
History(added=(), unchanged=['John'], deleted=())
The Session object is a public API for the underlying connection and transaction objects. To understand how connections and transactions work in SQLAlchemy, take a look at the core documentation's section on the topic.
UPDATE: Session persistence
The Session stays open until explicitly closed via Session.close(). Often transaction managers handle this for you automatically in a web application implementation, but, for instance, failure to close sessions you open in a test suite can cause problems due to many open transactions.
The Session holds your changes entirely in Python until it is flushed, either via Session.flush() or, if autoflush is on, when a query is run. Once flushed the session will emit SQL within a transaction to the database. Repeated flushes simply emit more SQL within that transaction. Appropriate calls to Session.begin and Session.begin_nested will can create sub-transactions if your underlying engine/db supports it.
Calls to Session.commit and Session.rollback execute SQL within the currently active transaction.
Turn on echo=True when you initialize your engine and watch the SQL emitted by various Session methods to better understand what's happening.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.