Execute .sql file in Python with MySQLdb - python

I have a .sql file containing a bunch of SQL queries, with each query spanning multiple lines. I want to execute these queries in MySQL via Python using MySQLdb.
sqlite3 has "a nonstandard shortcut" for this purpose called executescript(), but there doesn't seem to be any equivalent function in MySQLdb.
I noticed this old question from 2 years ago which asks the same thing, but I found the answers unsatisfying. The answers are basically:
Use subprocess to run the mysql command and send it your .sql file.
This works, but it is rather inelegant, and it introduces unwanted complexity with error handling and such.
If each query is on a single line, just execute each line separately.
But in my case, they span multiple lines, so this won't work.
If each query is not on a single line, somehow join them.
But, how? I mean, I can hack up something easily enough so there's no need for you to reply with half-baked answers here, and maybe that's what I'll end up doing, but is there already an established library that does this? I'd feel more comfortable with a comprehensive and correct solution rather than a hack.

MySQLdb seems to allow this out of the box, you just have to call cursor.nextset() to cycle through the returned result sets.
db = conn.cursor()
db.execute('SELECT 1; SELECT 2;')
more = True
while more:
print db.fetchall()
more = db.nextset()
If you want to be absolutely sure the support for this is enabled, and/or disable the support, you can use something like this:
MYSQL_OPTION_MULTI_STATEMENTS_ON = 0
MYSQL_OPTION_MULTI_STATEMENTS_OFF = 1
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_ON)
# Multiple statement execution here...
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_OFF)

Related

Calling several SQL procedures at once with Python

I have made a stored procedure in MySQL which accepts several arguments and does its things.
And while I have no problem to execute the following query in MySQL
CALL my_pr(var1, var2, var3); CALL my_pr(var4, var5, var6);
When I try to execute it (or any other 2 statements at once) via Python I get the following error:
Commands out of sync; you can't run this command now
But when I am trying to execute them one by one - everything works smoothly.
I am adding each statement to a list and then execute it via:
for stm in sql_stms:
mycursor.execute(stm)
mydb.commit()
Where I set each stm to be a single query or a multiple statement query in some code above. And my sql_stms contain several INSERT, SELECT and DELETE queries and tens (or sometimes hundreds) of queries for a stored procedure.
My goal is to speed up the running process and currently the slowest part in my code is submitting queries to SQL, so I believe that when I submit multiple queries at once it will work slightly faster.
Any ideas and suggestions are welcomed.
Probably not expecting more than one resultSet, try setting the multi result to true before executing
mycursor = mydb.cursor(multi=True)
The interface is not designed to easily get two "result sets" at once.
There is very little advantage in trying to run two statements together. Simply run them one at a time.
You can, on the other hand, build a third SP that makes those two CALLs. But, again, why bother.

Multiple write single read SQLite application with Peewee

I'm using an SQLite database with peewee on multiple machines, and I'm encountering various OperationalError, DataBaseError. It's obviously a problem of multithreading, but I'm not at all an expert with this nor with SQL. Here's my setup and what I've tried.
Settings
I'm using peewee to log machine learning experiments. Basically, I have multiple nodes (like, different computers) which run a python file, and all write to the same base.db file in a shared location. On top of that, I need a single read access from my laptop, to see what's going on. There are at most ~50 different nodes which instantiate the database and write things on it.
What I've tried
At first, I used the SQLite object:
db = pw.SqliteDatabase(None)
# ... Define tables Experiment and Epoch
def init_db(file_name: str):
db.init(file_name)
db.create_tables([Experiment, Epoch], safe=True)
db.close()
def train():
xp = Experiment.create(...)
# Do stuff
with db.atomic():
Epoch.bulk_create(...)
xp.save()
This worked fine, but I sometimes had jobs which crashed because of the database being locked. Then, I learnt that SQLite only handled one write operation per connection, which caused the problem.
So I turned to SqliteQueueDatabase as, according to the documentation, it's useful if "if you want simple read and write access to a SQLite database from multiple threads." I also added those keywords I found on other thread which were said to be useful.
The code then looked like this:
db = SqliteQueueDatabase(None, autostart=False, pragmas=[('journal_mode', 'wal')],
use_gevent=False,)
def init_db(file_name: str):
db.init(file_name)
db.start()
db.create_tables([Experiment, Epoch], safe=True)
db.connect()
and the same for saving stuff except for the db.atomic part. However, not only do write queries seem to encounter errors, I practically no longer have access to the database for read: it is almost always busy.
My question
What is the right object to use in this case? I thought SqliteQueueDatabase was the perfect fit. Are pooled database a better fit? I'm also asking this question because I don't know if I have a good grasp on the threading part: the fact that multiple database object are initialized from multiple machines is different from having a single object on a single machine with multiple threads (like this situation). Right? Is there a good way to handle things then?
Sorry if this question is already answered in another place, and thanks for any help! Happy to provide more code if needed of course.
Sqlite only supports a single writer at a time, but multiple readers can have the db open (even while a writer is connected) when using WAL-mode. For peewee you can enable wal mode:
db = SqliteDatabase('/path/to/db', pragmas={'journal_mode': 'wal'})
The other crucial thing, when using multiple writers, is to keep your write transactions as short as possible. Some suggestions can be found here: https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ under the "Transactions, Concurrency and Autocommit" heading.
Also note that SqliteQueueDatabase works well for a single process with multiple threads, but will not help you at all if you have multiple processes.
Inded, After #BoarGules comment, I realize that I confused two very different things:
Having multiple threads on a single machine: here, SqliteQueueDatabase is a very good fit
Having multiple machines, with one or more threads: that's basically how internet works.
So I ended up installing Postgre. A few links if it can be useful to people coming after me, for linux:
Install Postgre. You can build it from source if you don't have root privilege following chapter 17 from the official documentation, then Chapter 19.
You can export an SQLite database with pgloader. But again, if you don't have the right librairies and don't want to build everything, you can do it by hand. I did the following, not sure if more straightforward solution exist.
Export your tables as csv (following #coleifer's comment):
models = [Experiment, Epoch]
for model in models:
outfile = '%s.csv' % model._meta.table_name
with open(outfile, 'w', newline='') as f:
writer = csv.writer(f)
row_iter = model.select().tuples().iterator()
writer.writerows(row_iter)
Create the table in the new Postgre database:
db = pw.PostgresqlDatabase('mydb', host='localhost')
db.create_tables([Experiment, Epoch], safe=True)
Copy the CSV tables to Postgre db with the following command:
COPY epoch("col1", "col2", ...) FROM '/absolute/path/to/epoch.csv'; DELIMITER ',' CSV;
and likewise for the other tables.
IT worked fine for me, as I had only two tables. Can be annoying if you have more than that. pgloader seems a very good solution in that case, if you can install it easily.
Update
I could not create objects from peewee at first. I had integrity error: it seemed that the id which was returned by Postgre (with the RETURNING 'epoch'.'id' clause) was returning an already existing id. From my understanding, it was because the increment had not been called when using the COPY command. Thus, it only returned id 1, then 2, and so on until it reached an non existing id. To avoid going through all this failed creation, you can directly edit the iterator governing the RETURN clause, with:
ALTER SEQUENCE epoch_id_seq RESTART WITH 10000
and replace 10000 with the value from SELECT MAX("id") FROM epoch, +1.
I think you can just increase the timeout for sqlite and be fix your problem.
The issue here is that the default sqlite timeout for writing is low, and when there is even small amounts of concurrent writes, sqlite will start throwing exceptions. This is common and well known.
The default should be something like 5-10 seconds. If you exceed this timeout then either increase it or chunk up your writes to the db.
Here is an example:
I return a DatabaseProxy here because this proxy allows sqlite to be swapped out for postgres without changing client code.
import atexit
from peewee import DatabaseProxy # type: ignore
from playhouse.db_url import connect # type: ignore
from playhouse.sqlite_ext import SqliteExtDatabase # type: ignore
DB_TIMEOUT = 5
def create_db(db_path: str) -> DatabaseProxy:
pragmas = (
# Negative size is per api spec.
("cache_size", -1024 * 64),
# wal speeds up writes.
("journal_mode", "wal"),
("foreign_keys", 1),
)
sqlite_db = SqliteExtDatabase(
db_path,
timeout=DB_TIMEOUT,
pragmas=pragmas)
sqlite_db.connect()
atexit.register(sqlite_db.close)
db_proxy: DatabaseProxy = DatabaseProxy()
db_proxy.initialize(sqlite_db)
return db_proxy

Impala open connection in python

I'm after a way of querying Impala through Python which enables you to keep a connection open and pass queries to it.
I can connect quite happily to Impala using this sort of code:
import subprocess
sql = 'some sort of sql statement;'
cmds = ['impala-shell','-k','-B','-i','impala.company.corp','-q', sql]
out,err = subprocess.Popen(cmds, stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()
print(out.decode())
print(err.decode())
I can also switch out the -q and sql for -f and a file with sql statements as per the documentation here.
When I'm running this for multiple sql statements the name node it uses is the same for all the queries and it it will stop if there is a failure in the code (unless I use the option to continue), this is all expected.
What I'm trying to get to is where I can run a query or two, check the results using some python logic and then continue if it meets my criteria.
I have tried splitting up my code into individual queries using sqlparse and running them one by one. This works well in isolation but if one statement is a drop table if exists x; and the next one then goes create table x (blah string); then if x did actually exist then because the second statement will run on a different node the dropping metadata change hasn't reached that one yet and it fails with table x already exists or similar error.
I'd think as well as getting round this metadata issue it would just make more sense to keep a connection open to impala whilst I run all the statements but I'm struggling to work this out.
Does anyone have any code that has this functionality?
You may wanna look at impyla, the Impala/Hive python client, if you haven't done so already.
As far as the second part of your question, using Impala's SYNC_DDL option will guarantee that DDL changes are propagated across impalads before next DDL is executed.

How to execute many SELECT statements at once using python sqlite

I have some business logic that iterates many many times and needs to perform a simple query every time. Rather than make a call to the db every time I would like to store the SELECT statements as an array of strings or something similar and then execute all of the statements at once after the loop. Is this possible with python and sqlite?
The documentation says:
execute() will only execute a single SQL statement. If you try to execute more than one statement with it, it will raise a Warning. Use executescript() if you want to execute multiple SQL statements with one call.
However, executescript() does not allow you to access all the results.
To get multiple query results, you have to do the loop yourself:
def execute_many_selects(cursor, queries):
return [cursor.execute(query).fetchall() for query in queries]
SQLite is an embedded library, so there is no client/server communication overhead when doing multiple database calls.
I suspect you'd be better off if you work out a "larger" query and then decompose the result set after retrieving the information.
In other words, rather than three calls to the database (one each for Alice, Betty and Claire), use something like:
select stuff from a_table
where person in ('Alice', 'Betty', 'Claire')
and then process the actual data taking person into account.
Obviously, that will only work in the case where you can figure out the query before executing any of the person-based actions, but it looks like that's the case anyway, based on your question.

How, using Python, can I find how long a MySQL query took?

I'm connecting to MySQL with the MySQLdb module. I don't want to use Python's time functions: I want to know how long the query ran within MySQL, i.e. the number I see after I've run a query within MySQL directly.
I do see a thread where this is addressed as something one could eventually dig down to, but I was hoping that since MySQL reports that number, the Python connection would have picked it up somewhere.
May this help?
SET profiling = 1;
Run your query;
SHOW PROFILES;
See here:http://dev.mysql.com/doc/refman/5.7/en/show-profile.html
Because of the above commands will be removed in the future version,  Performance Schema can be used http://dev.mysql.com/doc/refman/5.7/en/performance-schema.html and http://dev.mysql.com/doc/refman/5.7/en/performance-schema-query-profiling.html.
On the above links, there are more details on Query Profiling Using Performance Schema. 

Categories

Resources