Impala open connection in python - python

I'm after a way of querying Impala through Python which enables you to keep a connection open and pass queries to it.
I can connect quite happily to Impala using this sort of code:
import subprocess
sql = 'some sort of sql statement;'
cmds = ['impala-shell','-k','-B','-i','impala.company.corp','-q', sql]
out,err = subprocess.Popen(cmds, stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()
print(out.decode())
print(err.decode())
I can also switch out the -q and sql for -f and a file with sql statements as per the documentation here.
When I'm running this for multiple sql statements the name node it uses is the same for all the queries and it it will stop if there is a failure in the code (unless I use the option to continue), this is all expected.
What I'm trying to get to is where I can run a query or two, check the results using some python logic and then continue if it meets my criteria.
I have tried splitting up my code into individual queries using sqlparse and running them one by one. This works well in isolation but if one statement is a drop table if exists x; and the next one then goes create table x (blah string); then if x did actually exist then because the second statement will run on a different node the dropping metadata change hasn't reached that one yet and it fails with table x already exists or similar error.
I'd think as well as getting round this metadata issue it would just make more sense to keep a connection open to impala whilst I run all the statements but I'm struggling to work this out.
Does anyone have any code that has this functionality?

You may wanna look at impyla, the Impala/Hive python client, if you haven't done so already.
As far as the second part of your question, using Impala's SYNC_DDL option will guarantee that DDL changes are propagated across impalads before next DDL is executed.

Related

Calling several SQL procedures at once with Python

I have made a stored procedure in MySQL which accepts several arguments and does its things.
And while I have no problem to execute the following query in MySQL
CALL my_pr(var1, var2, var3); CALL my_pr(var4, var5, var6);
When I try to execute it (or any other 2 statements at once) via Python I get the following error:
Commands out of sync; you can't run this command now
But when I am trying to execute them one by one - everything works smoothly.
I am adding each statement to a list and then execute it via:
for stm in sql_stms:
mycursor.execute(stm)
mydb.commit()
Where I set each stm to be a single query or a multiple statement query in some code above. And my sql_stms contain several INSERT, SELECT and DELETE queries and tens (or sometimes hundreds) of queries for a stored procedure.
My goal is to speed up the running process and currently the slowest part in my code is submitting queries to SQL, so I believe that when I submit multiple queries at once it will work slightly faster.
Any ideas and suggestions are welcomed.
Probably not expecting more than one resultSet, try setting the multi result to true before executing
mycursor = mydb.cursor(multi=True)
The interface is not designed to easily get two "result sets" at once.
There is very little advantage in trying to run two statements together. Simply run them one at a time.
You can, on the other hand, build a third SP that makes those two CALLs. But, again, why bother.

pyhs2 - execute finish but doesn't return

I'm an italian developer and this is my first message here, so sorry if I'm doing something wrong.
This is my problem with pyhs2 module:
I successfully build a connection to my hive database using pyhs2.connect method, and everything works right; the problem is that the same query sometimes freeze, depending on the width of the 'date' clause I use in my query.
Let me explain: if I run cur.execute method with the same query, first with the clause
(date >= '2017-03-01' and date <= '2017-05-10')
then with the clause
(date >= '2017-03-01' and date <= '2017-05-11')
(or even without the 'date' clause)
the first occurrence works right and return the correct results while the second (or third) stay freeze until I manually stop the script.
This behavior is very weird for me because I know there are some data after the 05-10, also I can check my database's running applications and, after my second query successfully ends, it stay pending in running applications for a while, and even when it's done, python script stay in a 'freezed state' and never return the correct results.
I think it might be a timeout problem or something like that, but I've searched both here in your useful website and on the web in general for some solutions, but found nothing.
I don't know if it could be an hive problem (for example for the TEZ type of its applications) or something from pyhs2, that after a certain amount of time maybe can't be able to retrieve the results from a query, so I ask for your help.
Thanks in advance,
Luca

How, using Python, can I find how long a MySQL query took?

I'm connecting to MySQL with the MySQLdb module. I don't want to use Python's time functions: I want to know how long the query ran within MySQL, i.e. the number I see after I've run a query within MySQL directly.
I do see a thread where this is addressed as something one could eventually dig down to, but I was hoping that since MySQL reports that number, the Python connection would have picked it up somewhere.
May this help?
SET profiling = 1;
Run your query;
SHOW PROFILES;
See here:http://dev.mysql.com/doc/refman/5.7/en/show-profile.html
Because of the above commands will be removed in the future version,  Performance Schema can be used http://dev.mysql.com/doc/refman/5.7/en/performance-schema.html and http://dev.mysql.com/doc/refman/5.7/en/performance-schema-query-profiling.html.
On the above links, there are more details on Query Profiling Using Performance Schema. 

Can anyone tell me what' s the point of connection.commit() in python pyodbc ?

I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.
This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.
Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.

Execute .sql file in Python with MySQLdb

I have a .sql file containing a bunch of SQL queries, with each query spanning multiple lines. I want to execute these queries in MySQL via Python using MySQLdb.
sqlite3 has "a nonstandard shortcut" for this purpose called executescript(), but there doesn't seem to be any equivalent function in MySQLdb.
I noticed this old question from 2 years ago which asks the same thing, but I found the answers unsatisfying. The answers are basically:
Use subprocess to run the mysql command and send it your .sql file.
This works, but it is rather inelegant, and it introduces unwanted complexity with error handling and such.
If each query is on a single line, just execute each line separately.
But in my case, they span multiple lines, so this won't work.
If each query is not on a single line, somehow join them.
But, how? I mean, I can hack up something easily enough so there's no need for you to reply with half-baked answers here, and maybe that's what I'll end up doing, but is there already an established library that does this? I'd feel more comfortable with a comprehensive and correct solution rather than a hack.
MySQLdb seems to allow this out of the box, you just have to call cursor.nextset() to cycle through the returned result sets.
db = conn.cursor()
db.execute('SELECT 1; SELECT 2;')
more = True
while more:
print db.fetchall()
more = db.nextset()
If you want to be absolutely sure the support for this is enabled, and/or disable the support, you can use something like this:
MYSQL_OPTION_MULTI_STATEMENTS_ON = 0
MYSQL_OPTION_MULTI_STATEMENTS_OFF = 1
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_ON)
# Multiple statement execution here...
conn.set_server_option(MYSQL_OPTION_MULTI_STATEMENTS_OFF)

Categories

Resources