Read from database in chunks with retries

Read from database in chunks with retries - python

I am reading from a Microsoft SQL server instance. I need to read all data from a table, which is quite big (~4 million records). So I like to do that in chunks, so I can limit the memory usage of my Python program.
This works fine normally, but now I need to move where this runs, which forces it go over a not super stable connection (I believe VPN is sometimes throttling the connection). So occasionally I get a connection error in one of the chunks:
sqlalchemy.exc.OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x68 (104) (
SQLGetData)')
The code I run comes down to this:
import pandas as pd
from sqlalchemy import create_engine
connection_string = 'mssql+pyodbc://DB_USER:DB_PASSWORD#DB_HOST/DB_NAME?trusted_connection=no&driver=ODBC+Driver+17+for+SQL+Server'
db = create_engine(connection_string, pool_pre_ping=True)
query = 'SELECT * FROM table'
for chunk in pd.read_sql_query(query, db, chunksize=500_000):
# do stuff with chunk
What I would like to know: is it possible to add a retry mechanism that can continue with the correct chunk if the connection fails? I've tried a few options, but none of them seem to be able to recover and continue at the same chunk.

query = 'SELECT * FROM table'
is a bad practice
always filter by the fields you need and process in chunks of 500 records
https://www.w3schools.com/sql/sql_top.asp
SELECT TOP number|percent column_name(s)
FROM table_name
WHERE condition;

I feel your pain. My VPN is the same. I'm not sure if this is a viable solution for you, but you can try this technique.
retry_flag = True
retry_count = 0
cursor = cnxn.cursor()
while retry_flag and retry_count < 5:
try:
cursor.execute('SELECT too_id FROM [TTMM].[dbo].[Machines] WHERE MachineID = {}'.format (machineid,))
too_id = cursor.fetchone()[0]
cursor.execute('INSERT INTO [TTMM].[dbo].[{}](counter, effectively, too_id) VALUES ({},{},{})'.format (machineid, counter, effectively, too_id,))
retry_flag = False
print("Printed To DB - Counter = ", counter, ", Effectively = ", effectively, ", too_id = ", too_id,)
except Exception as e:
print (e)
print ("Retry after 5 sec")
retry_count = retry_count + 1
cursor.close()
cnxn.close()
time.sleep(5)
cnxn = pyodbc.connect('DRIVER=FreeTDS;SERVER=*;PORT=*;DATABASE=*;UID=*;PWD=*;TDS_Version=8.7;', autocommit=True)
cursor = cnxn.cursor()
cursor.close()
How to retry after sql connection failed in python?

Related

Last record from a pyodbc query doesn't process (to messaging system using stomp)

I'm querying a database, which returns 38 records. In the below example, all 38 print, but the last record is not sent to the messaging system, only the first 37. What am I missing here???
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};' + connectionString)
cnxn.autocommit = True
cursor = cnxn.cursor()
sql = (sql statement that returns 38 records)
cursor.execute(sql)
print("Connected")
row = cursor.fetchone()
while row:
conn = stomp.Connection([('this.that.sys', '61616')])
conn.connect('user', 'pass', wait=True)
print("Send " + row.id)
conn.send(destination='test1.topic::test1.test1.queue', body=row.payload)
row = cursor.fetchone()
conn.disconnect()
This outputs a list of all 38 IDs, but the last Payload doesn't publish to the queue.
I did some more testing. This is odd. It seems to depend on the records. They're between 7KB and 15KB. If I try a certain two, only the first published. A certain three, the all publish. Another certain three, the last one again doesn't publish.
How can I debug this?
Edit: I've done more experiments. Is Stomp not guarantee delivery? I don't get any errors, but sometimes the messages just don't arrive, especially the last one. I refactored my code a bit, which seems to maybe help a little. I don't know...
cursor.execute(sql)
print("Connected")
row = cursor.fetchone()
print("First " + row.id)
i = 1
try:
conn = stomp.Connection([('this.that.sys', '61616')])
conn.connect('user', 'pass', wait=True)
while row:
print(str(i))
print("Send " + row.id + " " + row.status)
conn.send('test1.topic::test1.test1.queue', row.payload)
row = cursor.fetchone()
i = i + 1
conn.disconnect()
except Exception as e:
print("Error: %s" % e)

Is the send method executing asynchronously? If so, then disconnect might be getting called before the send actually happens. Therefore, you might try delaying invocation of the disconnect method.
It's also worth noting that the STOMP protocol supports the receipt header which you can use to get a reply back from the broker ensuring that your SEND frame was processed successfully. I'm not 100% sure that stomp.py supports this, but it should since it's part of the protocol specification.
Also, it might be worth activating trace logging for STOMP on the broker. Then you can see exactly what the broker is receiving from the client (e.g. if it's receiving the DISCONNECT frame prematurely).
Lastly, you definitely shouldn't be connecting & disconnecting to send a single message. That's a well-known anti-pattern and should be avoided if at all possible. Your second code snippet is better in this regard.

I would change it:
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};' + connectionString)
cnxn.autocommit = True
cursor = cnxn.cursor()
sql = (sql statement that returns 38 records)
cursor.execute(sql)
print("Connected")
rows = cursor.fetchall()
for row in rows:
conn = stomp.Connection([('this.that.sys', '61616')])
conn.connect('user', 'pass', wait=True)
print("Send " + row.id)
conn.send(destination='test1.topic::test1.test1.queue', body=row.payload)
conn.disconnect()

Fast Connection to a SQL Server with pyodbc

I am grabbing json data from a messaging bus and dumping that json into a database. I've had this working pretty well with psycopg2 doing ~3000 entries/sec into postgres. For a number of reason we've since moved to SQL Server 2016, and my entries dropped to around 100 per second.
I've got a functioned called insert_into() that inserts the json to the database. All I've really done to my insert_into() function is change the library to pyodbc and the connection string. It seems that my slow down is coming from setting up then tearing down my connection each time my function is called ('conn' in the code below). If I move the line that setup the connection outside of my insert_into function, my speed comes back. I was just wondering two things:
Whats the proper way to setup connections like this from a SQL Server perspective?
Is this even the best way to do this in postrgres?
For SQL Server, the server is 2016, using ODBC driver 17, SQL authentication.
Slow for SQL Server:
def insert_into():
conn = None
try:
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=server1;DATABASE=json;UID=user;PWD=pass')
cur = conn.cursor()
for i in buffer_list:
command = 'INSERT INTO jsonTable (data) VALUES (%s)' % ("'" + i + "'")
cur.execute(command)
cur.close()
conn.commit()
except (Exception, pyodbc.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
Fast for SQL Server:
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=server1;DATABASE=json;UID=user;PWD=pass')
def insert_into():
#conn = None
try:
cur = conn.cursor()
for i in buffer_list:
command = 'INSERT INTO jsonTable (data) VALUES (%s)' % ("'" + i + "'")
cur.execute(command)
cur.close()
conn.commit()
except (Exception, pyodbc.DatabaseError) as error:
print(error)
This daemon runs 24/7 and any advice on setting up a fast connection to MSSQL will be greatly appreciated.

DatabaseError: ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')

I am trying to read data from SQL server into pandas data frame. Below is the code.
def get_data(size):
con = pyodbc.connect(r'driver={SQL Server}; server=SPROD_RPT01; database=Reporting')
cur = con.cursor()
db_cmd = "select distinct top %s * from dbo.KrishAnalyticsAllCalls" %size
res = cur.execute(db_cmd)
sql_out = pd.read_sql_query(db_cmd, con, chunksize=10**6)
frames = [chunk for chunk in sql_out]
df_sql = pd.concat(frames)
return df_sql
df = get_data(5000000)
I am getting following error:
pandas.io.sql.DatabaseError: Execution failed on sql 'select distinct
top 500000 * from dbo.KrishAnalyticsAllCalls': ('HY000', '[HY000]
[Microsoft][ODBC SQL Server Driver]Connection is busy with results for
another hstmt (0) (SQLExecDirectW)')
I had executed the function before and interrupted the execution with ctrl+k as I wanted to make a change in the function. Now, after making the change when I'm trying to execute the function I am getting the above error.
How can I kill that connection/IPython Kernel since I don't know of any IPython Kernel running executing the query in the function?

I was facing the same issue. This was fixed when I used fetchall() function. The following the code that I used.
import pypyodbc as pyodbc
def connect(self, query):
con = pyodbc.connect(self.CONNECTION_STRING)
cursor = con.cursor()
print('Connection to db successful')
cmd = (query)
results = cursor.execute(cmd).fetchall()
df = pd.read_sql(query, con)
return df, results
Using cursor.execute(cmd).fetchall() instead of cursor.execute(cmd) resolved it.
Hope this helps.

The issue is due to cursor being executed just before the pd.read_sql_query() command .
Pandas is using the connection and SQL String to get the data . DB Cursor is not required .
#res = cur.execute(db_cmd)
sql_out = pd.read_sql_query(db_cmd, con, chunksize=10**6)
print(sql_out)

Most likely you haven't connected to the SQL server yet. Or, you connected in a previous instance for a different SQL query that was run. Either way, you need to re-establish the connection.
import pyodbc as pyodbc
conn = pyodbc.connect('Driver={YOUR_DRIVER};''Server=YOUR_SERVER;''Database=YOUR_DATABASE;''Trusted_Connection=yes')
Then execute your SQL:
sql = conn.cursor()
sql.execute("""ENTER YOUR SQL""")
Then transform into Pandas:
df = pd.DataFrame.from_records(sql.fetchall(),columns=[desc[0] for desc in sql.description])

How can SQLAlchemy be taught to recover from a disconnect?

According to http://docs.sqlalchemy.org/en/rel_0_9/core/pooling.html#disconnect-handling-pessimistic, SQLAlchemy can be instrumented to reconnect if an entry in the connection pool is no longer valid. I create the following test case to test this:
import subprocess
from sqlalchemy import create_engine, event
from sqlalchemy import exc
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def ping_connection(dbapi_connection, connection_record, connection_proxy):
cursor = dbapi_connection.cursor()
try:
print "pinging server"
cursor.execute("SELECT 1")
except:
print "raising disconnect error"
raise exc.DisconnectionError()
cursor.close()
engine = create_engine('postgresql://postgres#localhost/test')
connection = engine.connect()
subprocess.check_call(['psql', str(engine.url), '-c',
"select pg_terminate_backend(pid) from pg_stat_activity " +
"where pid <> pg_backend_pid() " +
"and datname='%s';" % engine.url.database],
stdout=subprocess.PIPE)
result = connection.execute("select 'OK'")
for row in result:
print "Success!", " ".join(row)
But instead of recovering I receive this exception:
sqlalchemy.exc.OperationalError: (OperationalError) terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Since "pinging server" is printed on the terminal it seems safe to conclude that the event listener is attached. How can SQLAlchemy be taught to recover from a disconnect?

It looks like the checkout method is only called when you first get a connection from the pool (eg your connection = engine.connect() line)
If you subsequently lose your connection, you will have to explicitly replace it, so you could just grab a new one, and retry your sql:
try:
result = connection.execute("select 'OK'")
except sqlalchemy.exc.OperationalError: # may need more exceptions here
connection = engine.connect() # grab a new connection
result = connection.execute("select 'OK'") # and retry
This would be a pain to do around every bit of sql, so you could wrap database queries using something like:
def db_execute(conn, query):
try:
result = conn.execute(query)
except sqlalchemy.exc.OperationalError: # may need more exceptions here (or trap all)
conn = engine.connect() # replace your connection
result = conn.execute(query) # and retry
return result
The following:
result = db_execute(connection, "select 'OK'")
Should now succeed.
Another option would be to also listen for the invalidate method, and take some action at that time to replace your connection.

MySQLdb with multiple transaction per connection

Is it okay to use a single MySQLdb connection for multiple transactions without closing the connection between them? In other words, something like this:
conn = MySQLdb.connect(host="1.2.3.4", port=1234, user="root", passwd="x", db="test")
for i in range(10):
try:
cur = conn.cursor()
query = "DELETE FROM SomeTable WHERE ID = %d" % i
cur.execute(query)
cur.close()
conn.commit()
except Exception:
conn.rollback()
conn.close()
It seems to work okay, but I just wanted to double check.

I think there is a misunderstanding about what constitutes a transaction here.
Your example opens up one connection, then executes one transaction on it. You execute multiple SQL statements in that transaction, but you close it completely after committing. Of course that's more than fine.
Executing multiple transactions (as opposed to just SQL statements), looks like this:
conn = MySQLdb.connect(host="1.2.3.4", port=1234, user="root", passwd="x", db="test")
for j in range(10):
try:
for i in range(10):
cur = conn.cursor()
query = "DELETE FROM SomeTable WHERE ID = %d" % i
cur.execute(query)
cur.close()
conn.commit()
except Exception:
conn.rollback()
conn.close()
The above code commits 10 transactions, each consisting of 10 individual delete statements.
And yes, you should be able to re-use the open connection for that without problems, as long as you don't share that connection between threads.
For example, SQLAlchemy re-uses connections by pooling them, handing out open connections as needed to the application. New transactions and new statements are executed on these connections throughout the lifetime of an application, without needing to be closed until the application is shut down.

It would be better to first build a query string and then execute that single MySQL statement. For example:
query = "DELETE FROM table_name WHERE id IN ("
for i in range(10):
query = query + "'" + str(i) + "', "
query = query[:-2] + ')'
cur = conn.cursor()
cur.execute(query)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read from database in chunks with retries - python

query = 'SELECT * FROM table' is a bad practice always filter by the fields you need and process in chunks of 500 records https://www.w3schools.com/sql/sql_top.asp SELECT TOP number|percent column_name(s) FROM table_name WHERE condition;

Related

Last record from a pyodbc query doesn't process (to messaging system using stomp)

Fast Connection to a SQL Server with pyodbc

DatabaseError: ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')

How can SQLAlchemy be taught to recover from a disconnect?

MySQLdb with multiple transaction per connection

Categories

Resources