How to execute many SELECT statements at once using python sqlite - python

I have some business logic that iterates many many times and needs to perform a simple query every time. Rather than make a call to the db every time I would like to store the SELECT statements as an array of strings or something similar and then execute all of the statements at once after the loop. Is this possible with python and sqlite?

The documentation says:
execute() will only execute a single SQL statement. If you try to execute more than one statement with it, it will raise a Warning. Use executescript() if you want to execute multiple SQL statements with one call.
However, executescript() does not allow you to access all the results.
To get multiple query results, you have to do the loop yourself:
def execute_many_selects(cursor, queries):
return [cursor.execute(query).fetchall() for query in queries]
SQLite is an embedded library, so there is no client/server communication overhead when doing multiple database calls.

I suspect you'd be better off if you work out a "larger" query and then decompose the result set after retrieving the information.
In other words, rather than three calls to the database (one each for Alice, Betty and Claire), use something like:
select stuff from a_table
where person in ('Alice', 'Betty', 'Claire')
and then process the actual data taking person into account.
Obviously, that will only work in the case where you can figure out the query before executing any of the person-based actions, but it looks like that's the case anyway, based on your question.

Related

Calling several SQL procedures at once with Python

I have made a stored procedure in MySQL which accepts several arguments and does its things.
And while I have no problem to execute the following query in MySQL
CALL my_pr(var1, var2, var3); CALL my_pr(var4, var5, var6);
When I try to execute it (or any other 2 statements at once) via Python I get the following error:
Commands out of sync; you can't run this command now
But when I am trying to execute them one by one - everything works smoothly.
I am adding each statement to a list and then execute it via:
for stm in sql_stms:
mycursor.execute(stm)
mydb.commit()
Where I set each stm to be a single query or a multiple statement query in some code above. And my sql_stms contain several INSERT, SELECT and DELETE queries and tens (or sometimes hundreds) of queries for a stored procedure.
My goal is to speed up the running process and currently the slowest part in my code is submitting queries to SQL, so I believe that when I submit multiple queries at once it will work slightly faster.
Any ideas and suggestions are welcomed.
Probably not expecting more than one resultSet, try setting the multi result to true before executing
mycursor = mydb.cursor(multi=True)
The interface is not designed to easily get two "result sets" at once.
There is very little advantage in trying to run two statements together. Simply run them one at a time.
You can, on the other hand, build a third SP that makes those two CALLs. But, again, why bother.

Impala open connection in python

I'm after a way of querying Impala through Python which enables you to keep a connection open and pass queries to it.
I can connect quite happily to Impala using this sort of code:
import subprocess
sql = 'some sort of sql statement;'
cmds = ['impala-shell','-k','-B','-i','impala.company.corp','-q', sql]
out,err = subprocess.Popen(cmds, stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()
print(out.decode())
print(err.decode())
I can also switch out the -q and sql for -f and a file with sql statements as per the documentation here.
When I'm running this for multiple sql statements the name node it uses is the same for all the queries and it it will stop if there is a failure in the code (unless I use the option to continue), this is all expected.
What I'm trying to get to is where I can run a query or two, check the results using some python logic and then continue if it meets my criteria.
I have tried splitting up my code into individual queries using sqlparse and running them one by one. This works well in isolation but if one statement is a drop table if exists x; and the next one then goes create table x (blah string); then if x did actually exist then because the second statement will run on a different node the dropping metadata change hasn't reached that one yet and it fails with table x already exists or similar error.
I'd think as well as getting round this metadata issue it would just make more sense to keep a connection open to impala whilst I run all the statements but I'm struggling to work this out.
Does anyone have any code that has this functionality?
You may wanna look at impyla, the Impala/Hive python client, if you haven't done so already.
As far as the second part of your question, using Impala's SYNC_DDL option will guarantee that DDL changes are propagated across impalads before next DDL is executed.

Execute .sql file in Python with MySQLdb

I have a .sql file containing a bunch of SQL queries, with each query spanning multiple lines. I want to execute these queries in MySQL via Python using MySQLdb.
sqlite3 has "a nonstandard shortcut" for this purpose called executescript(), but there doesn't seem to be any equivalent function in MySQLdb.
I noticed this old question from 2 years ago which asks the same thing, but I found the answers unsatisfying. The answers are basically:
Use subprocess to run the mysql command and send it your .sql file.
This works, but it is rather inelegant, and it introduces unwanted complexity with error handling and such.
If each query is on a single line, just execute each line separately.
But in my case, they span multiple lines, so this won't work.
If each query is not on a single line, somehow join them.
But, how? I mean, I can hack up something easily enough so there's no need for you to reply with half-baked answers here, and maybe that's what I'll end up doing, but is there already an established library that does this? I'd feel more comfortable with a comprehensive and correct solution rather than a hack.
MySQLdb seems to allow this out of the box, you just have to call cursor.nextset() to cycle through the returned result sets.
db = conn.cursor()
db.execute('SELECT 1; SELECT 2;')
more = True
while more:
print db.fetchall()
more = db.nextset()
If you want to be absolutely sure the support for this is enabled, and/or disable the support, you can use something like this:
MYSQL_OPTION_MULTI_STATEMENTS_ON = 0
MYSQL_OPTION_MULTI_STATEMENTS_OFF = 1
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_ON)
# Multiple statement execution here...
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_OFF)

what is the fastest way to update thousands rows in mysql

lets assume you have a table with 1M rows and growing ...
every five minutes of every day you run a python programm which have to update some fields of 50K rows
my question is: what is the fastest way to do the work?
runs those updates in loop and after last one is executed than fire up a cursor commit?
or generate file and than run it throught command line?
create temp table by huge and fast insert and than run a single update to production table?
do prepared statements?
split it up to 1K updates per execute, to generate smaller logs files?
turn off logging while running update?
or do a cases in mysql examples (but this works only up to 255 rows)
i dont know ... have anyone do something like this? what is the best practise? i need to run it as fast as possible ...
Here's some ways you could speed up your UPDATES.
When you UPDATE, the table records are just being rewritten with new data. And all this must be done again on INSERT. That's why you should always use INSERT ... ON DUPLICATE KEY UPDATE instead of REPLACE.
The former one is an UPDATE operation in case of a key violation, while the latter one is DELETE / INSERT
Here's an example INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; More on this here.
UPDATE1: It's a good idea to do your inserts all in a single query. This should speed up your UPDATES. See here on how to do that.
UPDATE2: Now that I have had a chance to read your other sub-questions. Here's what I know-
instead of in a loop, try to execute all UPDATE in a single sql & single commit.
Not sure this is going to make any difference. SQL queries are more important.
Now this is something you could experiment with. Benchmark it. This kind of a thing depends on the size of the TABLE & the INDEXES you have, plus INNODB or MYISAM.
No idea about this.
refer first point.
Yes, this might speed your stuff up slightly. Also see if you have slow_query_log turned on. This logs all slow queries to a separate logfile. Turn this off too.
Again. refer first point.
query execution process: Server first parsing your query then execute you need to analysis
the query
then server take less time to parse then he execute faster instead of slow in other way

How do you make Python / PostgreSQL faster?

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)
Is it possible for this to go faster or is that a limitation on the language/database?
Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.
Three Things.
One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.
Two. Don't Update.
Specifically, Do not do this. It's bad code for two reasons.
cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)
It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]
And. Never build SQL using string substitution. Completely dumb.
If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.
Three. Use Bind Variables. Always.
cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )
The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.
In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:
insert_statement="""
INSERT INTO chats(person_id, message_type, created_at, channel)
VALUES(:person_id,:message_type,:created_at,:channel)
"""
Then in place of each sql statement you execute put this in place:
cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)
This will make things run faster because:
The cursor object won't have to reparse the statement each time
The db server won't have to generate a new execution plan as it can use the one it create previously.
You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.
Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.
Other optimizations:
You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.
Use bind variables instead of literal values in the sql statements and create a cursor for
each unique sql statement so that the statement does not need to be reparsed the next time it is used. From the python db api doc:
Prepare and execute a database
operation (query or command).
Parameters may be provided as sequence
or mapping and will be bound to
variables in the operation. Variables
are specified in a database-specific
notation (see the module's paramstyle
attribute for details). [5]
A reference to the operation will be
retained by the cursor. If the same
operation object is passed in again,
then the cursor can optimize its
behavior. This is most effective for
algorithms where the same operation is
used, but different parameters are
bound to it (many times).
ALWAYS ALWAYS ALWAYS use bind variables.
As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).
Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.
Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)
Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)
Additionally to the many fine suggestions #Mark Roddy has given, do the following:
don't use readlines, you can iterate over file objects
try to use executemany rather than execute: try to do batch inserts rather single inserts, this tends to be faster because there's less overhead. It also reduces the number of commits
str.rstrip will work just fine instead of stripping of the newline with a regex
Batching the inserts will use more memory temporarily, but that should be fine when you don't read the whole file into memory.

Categories

Resources