I am using Snowflake Database-as-a-service to store and process our data. Due to handling huge amounts of data, I want to run a query, get the query ID and let it the query execute asynchronously. Another part of the system will monitor the status of the query by checking the query history table using that query ID.
I am using the Snowflake Python Connector.
Here is a sample of what I have so far:
from __future__ import print_function
import io, os, sys, time, datetime
modules_path = os.path.join(os.path.dirname(__file__), 'modules')
sys.path.append(modules_path)
import snowflake.connector
def async_query(data):
connection = snowflake.connector.connect(
user=data['user'],
password=data['password'],
account=data['account'],
region=data['region'],
database=data['database'],
warehouse=data['warehouse'],
schema=data['schema']
)
cursor = connection.cursor()
cursor.execute(data['query'], _no_results=True)
print(cursor.sfqid)
return cursor.sfqid
This piece of code seems to be working, i.e I am getting the query ID, but there is one problem - the SQL query fails with error "SQL execution canceled." in Snowflake. If I remove the _no_results=True parameter, the query works well, but then I have to wait it to complete, which is not the desired behaviour.
Any ideas what is causing the "SQL execution canceled" failure?
A little bit of more info: The reason why I don't want to wait for it, is that I am running the code on AWS Lambda and Lambdas have a maximum running time of 5 minutes.
If _no_results=True is not specified, the execution is synchronized, so the application has to wait for the query to finish. If specified, the query becomes async, so the application will continue running, but the destructor of connection will close the session in the end, and all active queries will be canceled. It seems that's the cause of "SQL execution canceled".
AWS lambda limits the execution time to 5 min, so if the query takes more than the limit, it won't work.
Btw _no_results=True is an internal parameter used for SnowSQL, and its behavior is subject to change in the future.
Related
I'm currently using mysql.connector in a python Flask project and, after users enter their information, the following query is executed:
"SELECT first, last, email, {} FROM {} WHERE {} <= {} AND ispaired IS NULL".format(key, db, class_data[key], key)
It would pose a problem if this query was executed in 2 threads concurrently, and returned the same row in both threads. I was wondering if there was a way to prevent SELECT mysql queries from executing concurrently, or if this was already the default behavior of mysql.connector? For additional information, all mysql.connector queries are executed after being authenticated with the same account credentials.
It is hard to say from your description, but if you're using Flask, you're most probably using (or will use in production) multiple processes, and you probably have a connection pool (i.e. multiple connections) in each process. So while each connection is executing queries sequentially, this query can be ran concurrently by multiple connections at the same time.
To prevent your application from obtaining the same row at the same time while handling different requests, you should use transactions and techniques like SELECT FOR UPDATE. The exact solution depends on your exact use case.
I'm running a few large UNLOAD queries from Redshift to S3 from a python script using SQLAlchemy. (along with the sqlalchemy-redshift package)
The first couple work but the last, which runs the longs (~30 minutes) is marked Terminated in the Redshift Query Dashboard. Some data is loaded to S3 but I suspect it's not ALL of it.
I'm fairly confident the query itself works because I've used it to download locally in the past.
Does SQLAlchemy close queries that take too long? Is there a way to set or lengthen the query-timeout? The script itself continues as if nothing went wrong and the Redshift logs don't indicate a problem either but when a query is marked Terminated it usually means something external has killed the process.
There are two places where you can control timeouts in Redshift:
In the workload manager console, you get an option to specify timeout for each queue.
The ODBC/ JDBC driver settings. Update your registry based on the steps in the link below,
http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-firewall-guidance.html
It turned out to be more an issue with sqlalchemy than AWS/Redshift.
SQLAlchemy does not implicitly "Commit Transactions" so if the connection is closed while uncommitted transactions are still open (even if the query itself appears to be finished), all transactions within that connection are marked Terminated.
Solution is to finish your connection or each transaction with "commit transaction;"
conn = engine.connect()
conn.execute("""SELECT .... """)
conn.execute("""COMMIT TRANSACTION""")
I'm using concurrent.futures module to run jobs in parallel. It runs quite fine.
The start time and completion time gets updated in the mysql database whenever a job starts/ends. Also, each job gets its input files from the database and saves the output files in the database. I'm getting the errors
"Error 2006:MySQL server has gone away"
and
"Error 2013: Lost connection to MySQL server during query" while running the script.
I don't face these errors while running a single job.
Sample Script:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor(max_workers=pool_size)
futures = []
for i in self.parent_job.child_jobs:
futures.append(executor.submit(invokeRunCommand, i))
def invokeRunCommand(self)
self.saveStartTime()
self.getInputFiles()
runShellCommand()
self.saveEndTime()
self.saveOutputFiles()
I'm using a single database connection and cursor to execute all the queries. Some queries are time consuming ones. Not sure of the reason for hitting this error. Could someone clarify?
-Thanks
Yes, a single connection to the database is not thread safe, so if you're using the same database connection for multiple threads, things will fail.
If your pseudo code is representative, just start and use a separate database connection for each thread in your invokeRunCommand and things should be fine.
[Python/MySQLdb] - CentOS - Linux - VPS
I have a page that parses a large file and queries the datase up to 100 times for each run. The database is pretty large and I'm trying to reduce the execution time of this script.
My SQL functions are inside a class, currently the connection object is a class variable created when the class is instantiated. I have various fetch and query functions that create a cursor from the connection object every time they are called. Would it be faster to create the cursor when the connection object is created and reuse it or would it be better practice to create the cursor every time it's called?
import MySQLdb as mdb
class parse:
con = mdb.connect( server, username, password, dbname )
#cur = con.cursor() ## create here?
def q( self, q ):
cur = self.con.cursor() ## it's currently here
cur.execute( q )
Any other suggestions on how to speed up the script are welcome too. The insert statement is the same for all the queries in the script.
Opening and closing connections is never free, it always wastes some amount of performance.
The reason you wouldn't want to just leave the connection open is that if two requests were to come in at the same time the second request would have to wait till the first request had finished before it could do any work.
One way to solve this is to use connection pooling. You create a bunch of open connections and then reuse them. Every time you need to do a query you check a connection out of the pool, preform the request and then put it back into the pool.
Setting all this up can be quite tedious, so I would recommend using SQLAlchemy. It has built in connection pooling, relatively low overhead and supports MySQL.
Since you care about speed I would only use the core part of SQLAlchemy since the ORM part comes is a bit slower.
I'm using psycopg2 for the cherrypy app I'm currently working on and cli & phpgadmin to handle some operations manually. Here's the python code :
#One connection per thread
cherrypy.thread_data.pgconn = psycopg2.connect("...")
...
#Later, an object is created by a thread :
class dbobj(object):
def __init__(self):
self.connection=cherrypy.thread_data.pgconn
self.curs=self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
...
#Then,
try:
blabla
self.curs.execute(...)
self.connection.commit()
except:
self.connection.rollback()
lalala
...
#Finally, the destructor is called :
def __del__(self):
self.curs.close()
I'm having a problem with either psycopg or postgres (altough I think the latter is more likely). After having sent a few queries, my connections drop dead. Similarly, phpgadmin -usually- gets dropped as well ; it prompts me to reconnect after having made requests several times. Only the CLI remains persistent.
The problem is, these happen very randomly and I can't even track down what the cause is. I can either get locked down after a few page requests or never really encounter anything after having requested hundreds of pages. The only error I've found in postgres log, after terminating the app is :
...
LOG: unexpected EOF on client connection
LOG: could not send data to client: Broken pipe
LOG: unexpected EOF on client connection
...
I thought of creating a new connection every time a new dbobj instance is created but I absolutely don't want to do this.
Also, I've read that one may run into similar problems unless all transactions are committed : I use the try/except block for every single INSERT/UPDATE query, but I never use it for SELECT queries nor do I want to write even more boilerplate code (btw, do they need to be committed ?). Even if that's the case, why would phpgadmin close down ?
max_connections is set to 100 in the .conf file, so I don't think that's the reason either. A single cherrypy worker has only 10 threads.
Does anyone have an idea where I should look first ?
Psycopg2 needs a commit or rollback after every transaction, including SELECT queries, or it leaves the connections "IDLE IN TRANSACTION". This is now a warning in the docs:
Warning: By default, any query execution, including a simple SELECT will start a transaction: for long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat...). For long lived scripts, either ensure to terminate a transaction as soon as possible or use an autocommit connection.
It's a bit difficult to see exactly where you're populating and accessing cherrypy.thread_data. I'd recommend investigating psycopg2.pool.ThreadedConnectionPool instead of trying to bind one conn to each thread yourself.
Even though I don't have any idea why successful SELECT queries block the connection, spilling .commit() after pretty much every single query that doesn't have to work in conjunction with another solved the problem.