I'm an italian developer and this is my first message here, so sorry if I'm doing something wrong.
This is my problem with pyhs2 module:
I successfully build a connection to my hive database using pyhs2.connect method, and everything works right; the problem is that the same query sometimes freeze, depending on the width of the 'date' clause I use in my query.
Let me explain: if I run cur.execute method with the same query, first with the clause
(date >= '2017-03-01' and date <= '2017-05-10')
then with the clause
(date >= '2017-03-01' and date <= '2017-05-11')
(or even without the 'date' clause)
the first occurrence works right and return the correct results while the second (or third) stay freeze until I manually stop the script.
This behavior is very weird for me because I know there are some data after the 05-10, also I can check my database's running applications and, after my second query successfully ends, it stay pending in running applications for a while, and even when it's done, python script stay in a 'freezed state' and never return the correct results.
I think it might be a timeout problem or something like that, but I've searched both here in your useful website and on the web in general for some solutions, but found nothing.
I don't know if it could be an hive problem (for example for the TEZ type of its applications) or something from pyhs2, that after a certain amount of time maybe can't be able to retrieve the results from a query, so I ask for your help.
Thanks in advance,
Luca
Related
I'm after a way of querying Impala through Python which enables you to keep a connection open and pass queries to it.
I can connect quite happily to Impala using this sort of code:
import subprocess
sql = 'some sort of sql statement;'
cmds = ['impala-shell','-k','-B','-i','impala.company.corp','-q', sql]
out,err = subprocess.Popen(cmds, stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()
print(out.decode())
print(err.decode())
I can also switch out the -q and sql for -f and a file with sql statements as per the documentation here.
When I'm running this for multiple sql statements the name node it uses is the same for all the queries and it it will stop if there is a failure in the code (unless I use the option to continue), this is all expected.
What I'm trying to get to is where I can run a query or two, check the results using some python logic and then continue if it meets my criteria.
I have tried splitting up my code into individual queries using sqlparse and running them one by one. This works well in isolation but if one statement is a drop table if exists x; and the next one then goes create table x (blah string); then if x did actually exist then because the second statement will run on a different node the dropping metadata change hasn't reached that one yet and it fails with table x already exists or similar error.
I'd think as well as getting round this metadata issue it would just make more sense to keep a connection open to impala whilst I run all the statements but I'm struggling to work this out.
Does anyone have any code that has this functionality?
You may wanna look at impyla, the Impala/Hive python client, if you haven't done so already.
As far as the second part of your question, using Impala's SYNC_DDL option will guarantee that DDL changes are propagated across impalads before next DDL is executed.
I currently have an application that, at any given time, will INSERT new data into my database. I also have a different python script that checks my database in an infinite loop for a new entry, and when it finds one, it selects it, and uses it, then waits again.
Im wondering if there is any way of doing this more efficiently and more accurately?
Thanks
I currently have a set up like this:
conn = pyodbc.connect('Driver={ODBC Driver 13 for SQL Server};'
'Server=REDACTED;'
'Database= REDACTED;'
'UID= REDACTED;'
'PWD= REDACTED;')
cursor = conn.cursor()
in a loop such as:
i = 1
while i < 2:
#check database for new entry with select statement and compare old list with current list and see if there is a difference. If there is a difference, use that new key and process data.
It currently works fine, but I feel like it is doing a lot of work for nothing. For example, during the week it will only every really access the database 30-50 times a day, but on the weekend it will do it close to 100-200 times a day... the only thing is, there is no set number of times it will access it, or when.
Any help would be useful.
Thanks
I’ve never had to do this before. But 3 ideas came to mind.
(1) if your script is writing to another database, then your alternative could be setting up a replication of your source database. The DBA of the original database can set that up for you.
(2) if you are willing to overhaul the original database, you could consider a real-time in-memory database (such as redis), depending on your use case that might help.
(3) it appears that sqlalchemy has a built in event listener. I use sqlalchemy in Python but never this particular feature.
https://docs.sqlalchemy.org/en/13/orm/session_events.html
My basic problem is that i am trying to have two python programs run simultaneously and have access to the same database table. I feel like this should have a simple solution but it has passed my by so far.
All my attempts at this have caused the database(sqlite) to be locked and the program falling over.
i have tried being clever with the timing with how they programs run so that as one program opens the connection the other closes it, copying data from one database to another etc.. but this just gets horrible and messy very quickly and also a big goal in my design is that I would like to keep latency to an absolute minimum.
The basic structure is pictured below.
I should add too that program one - 'always running and adding to database' is in the milliseconds timeframe.
Program two can be in the multiple seconds range. Obviously none of my solutions have been able to come close to that.
Any help, steps in the right direction or links to further reading is greatly appreciated!
Cheers
Although your title mentions MySQL, in your question you are only using sqlite. Now, sqlite is a perfectly capable database if you only have a single process accessing it, but it is not good for multiple simultaneous access. This is exactly where you need a proper database - like MySQL.
I've got a Python script talking to a MySQL database.
This script has been working fine for months.
All of a sudden it isn't actually adding anything to the tables it's supposed to modify.
The script has a lot of print statements and error handlers and it still runs exactly as if it was working, but nothing shows up in the database.
It even prints out "rows affected: 108" or whatever, but when I go look at the database in phpMyAdmin it says there are zero rows in the table.
The only thing it will do is truncate the tables. There's a section at the beginning that truncates the relevant tables so the script can start filling them up again. If I manually create a new row in a table through phpMyAdmin, that row will disappear when the script runs, like it's properly truncating the tables. But nothing after that does anything. It still runs without errors, but it doesn't actually modify the database.
Thanks, yeah for some reason the script was no longer autocommitting by default.
I added "cnx.autocommit(True)" and it's working again.
I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.
This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.
Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.