I have a python application that is reading from mysql/mariadb, uses that to fetch data from an api and then inserts results into another table.
I had setup a module with a function to connect to the database and return the connection object that is passed to other functions/modules. However, I believe this might not be a correct approach. The idea was to have a small module that I could just call whenever I needed to connect to the db.
Also note, that I am using the same connection object during loops (and within the loop passing to the db_update module) and call close() when all is done.
I am also getting some warnings from the db sometimes, those mostly happen at the point where I call db_conn.close(), so I guess I am not handling the connection or session/engine correctly. Also, the connection id's in the log warning keep increasing, so that is another hint, that I am doing it wrong.
[Warning] Aborted connection 351 to db: 'some_db' user: 'some_user' host: '172.28.0.3' (Got an error reading communication packets)
Here is some pseudo code that represents the structure I currently have:
################
## db_connect.py
################
# imports ...
from sqlalchemy import create_engine
def db_connect():
# get env ...
db_string = f"mysql+pymysql://{db_user}:{db_pass}#{db_host}:{db_port}/{db_name}"
try:
engine = create_engine(db_string)
except Exception as e:
return None
db_conn = engine.connect()
return db_conn
################
## db_update.py
################
# imports ...
def db_insert(db_conn, api_result):
# ...
ins_qry = "INSERT INTO target_table (attr_a, attr_b) VALUES (:a, :b);"
ins_qry = text(ins_qry)
ins_qry = ins_qry.bindparams(a = value_a, b = value_b)
try:
db_conn.execute(ins_qry)
except Exception as e:
print(e)
return None
return True
################
## main.py
################
from sqlalchemy import text
from db_connect import db_connect
from db_update import db_insert
def run():
try:
db_conn = db_connect()
if not db_conn:
return False
except Exception as e:
print(e)
qry = "SELECT *
FROM some_table
WHERE some_attr IN (:some_value);"
qry = text(qry)
search_run_qry = qry.bindparams(
some_value = 'abc'
)
result_list = db_conn.execute(qry).fetchall()
for result_item in result_list:
## do stuff like fetching data from api for every record in the query result
api_result = get_api_data(...)
## insert into db:
db_ins_status = db_insert(db_conn, api_result)
## ...
db_conn.close
run()
EDIT: Another question:
a) Is it ok in a loop, that does an update on every iteration to use the same connection, or would it be wiser to instead pass the engine to the run() function and call db_conn = engine.connect() and db_conn.close() just before and after each update?
b) I am thinking about using ThreadPoolExecutor instead of the loop for the API calls. Would this have implications on how to use the connection, i.e. can I use the same connection for multiple threads that are doing updates to the same table?
Note: I am not using the ORM feature mostly because I have a strong DWH/SQL background (though not so much as DBA) and I am used to writing even complex sql queries. I am thinking about switching to just using PyMySQL connector for that reason.
Thanks in advance!
Yes you can return/pass connection object as parameter but what is the aim of db_connect method, except testing connection ? As I see there is no aim of this db_connect method therefore I would recommend you to do this as I done it before.
I would like to share a code snippet from one of my project.
def create_record(sql_query: str, data: tuple):
try:
connection = mysql_obj.connect()
db_cursor = connection.cursor()
db_cursor.execute(sql_query, data)
connection.commit()
return db_cursor, connection
except Exception as error:
print(f'Connection failed error message: {error}')
and then using this one as for another my need
db_cursor, connection, query_data = fetch_data(sql_query, query_data)
and after all my needs close the connection with this method and method call.
def close_connection(connection, db_cursor):
"""
This method used to close SQL server connection
"""
db_cursor.close()
connection.close()
and calling method
close_connection(connection, db_cursor)
I am not sure can I share my github my check this link please. Under model.py you can see database methods and to see how calling them check it main.py
Best,
Hasan.
Related
I'm trying to create a Prefect task that receives as input an instance of PyMySQL connection, such as:
#task
def connect_db():
connection = pymysql.connect(user=user,
password=password,
host=host,
port=port,
db=db,
connect_timeout=5,
cursorclass=pymysql.cursors.DictCursor,
local_infile=True)
return connection
#task
def query_db(connection) -> Any:
query = 'SELECT * FROM myschema.mytable;'
with connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
return rows
#task
def get_df(rows) -> Any:
return pd.DataFrame(rows, dtype=str)
#task
def save_csv(df):
path = 'mypath'
df.to_csv(path, sep=';', index=False)
with Flow(FLOW_NAME) as f:
con = connect_db()
rows = query_db(con)
df = get_df(rows)
save_csv(df)
However, as I try to register the resulting flow, it raises "TypeError: cannot pickle 'socket' object". Going through Prefect's Docs, I've found built-in MySQL Tasks ( https://docs.prefect.io/api/latest/tasks/mysql.html#mysqlexecute), but they open and close connections each time they're called. Is there any way to pass a connection previously opened to a Prefect Task (or implement such thing as a connection manager)?
I tried to replicate your example but it registers fine. The most common way an error like this pops up is if you have a client in the global namespace that the flow uses. Prefect will try to serialize that upon registration. For example, the following code snippet will error if you try to register it:
import pymysql
connection = pymysql.connect(user=user,
password=password,
host=host,
port=port,
db=db,
connect_timeout=5,
cursorclass=pymysql.cursors.DictCursor,
local_infile=True)
#task
def query_db(connection) -> Any:
query = 'SELECT * FROM myschema.mytable;'
with connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
return rows
with Flow(FLOW_NAME) as f:
rows = query_db(connection)
This errors because the connection variable is serialized along with the flow object. You can work around this by storing your Flow as a script. See this link for more information:
https://docs.prefect.io/core/idioms/script-based.html#using-script-based-flow-storage
This will avoid the serialization of the Flow object and create that connection during runtime.
If this happens during runtime
If you encounter this error during runtime, there are two possible reasons you can see this. The first is Dask serializing it, and the second is from the Prefect checkpointing.
Dask uses cloudpickle to send the data to the workers across a network. So if you use Prefect with a DaskExecutor, it will use cloudpickle to send the tasks for execution. Thus, task inputs and outputs need to be serializable. In this scenario, you should instantiate the Client and perform the query inside a task (like you saw with the current MySQL Task implementation)
If you use a LocalExecutor, task outputs are serialized by default because checkpointing is on by default. You can toggle with by doing checkpoint=False when you define the task.
If you need further help, feel free to join the Prefect Slack channel at prefect.io/slack .
I have starting to learn how to code psycopg2 together with Python. what I do is that I have quite few scripts. Lets have an example where it can be up to 150 connections and as we know, we cannot have more than 100 connections connected at the same time. What I figure out is that whenever I want to do a database query/execution - I then connect to the database, do the execution and then close the database. However I do believe that opening and closing new connection are very expensive and should be longer-lived.
I have done something like this:
DATABASE_CONNECTION = {
"host": "TEST",
"database": "TEST",
"user": "TEST",
"password": "TEST"
}
def get_all_links(store):
"""
Get all links from given store
:param store:
:return:
"""
conn = psycopg2.connect(**DATABASE_CONNECTION)
sql_update_query = "SELECT id, link FROM public.store_items WHERE store = %s AND visible = %s;"
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
try:
data_tuple = (store, "yes")
cursor.execute(sql_update_query, data_tuple)
test_data = [{"id": links["id"], "link": links["link"]} for links in cursor]
cursor.close()
conn.close()
return test_data
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
cursor.close()
conn.rollback()
return 1
def get_all_stores():
"""
Get all stores in database
:return:
"""
conn = psycopg2.connect(**DATABASE_CONNECTION)
sql_update_query = "SELECT store FROM public.store_config;"
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
try:
cursor.execute(sql_update_query)
test_data = [stores["store"] for stores in cursor]
cursor.close()
conn.close()
return test_data
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
cursor.close()
conn.rollback()
return 1
I wonder how can I make it as effective as possible where I can have alot of scripts connected to the database but still do not hit the max_connection issue?
I do forgot to add that the way im connecting is that I have multiple scripts etc:
test1.py
test2.py
test3.py
....
....
every script runs for themselves
where they all have a import database.py which has the following code that I have showed before.
UPDATE:
from psycopg2 import pool
threaded_postgreSQL_pool = psycopg2.pool.ThreadedConnectionPool(1, 2,
user="test",
password="test",
host="test",
database="test")
if (threaded_postgreSQL_pool):
print("Connection pool created successfully using ThreadedConnectionPool")
def get_all_stores():
"""
Get all stores in database
:return:
"""
# Use getconn() method to Get Connection from connection pool
ps_connection = threaded_postgreSQL_pool.getconn()
sql_update_query = "SELECT store FROM public.store_config;"
ps_cursor = ps_connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
try:
ps_cursor.execute(sql_update_query)
test_data = [stores["store"] for stores in ps_cursor]
ps_cursor.close()
threaded_postgreSQL_pool.putconn(ps_connection)
print("Put away a PostgreSQL connection")
return test_data
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
ps_cursor.close()
ps_connection.rollback()
return 1
While opening and close database connections is not free, it is also not all that expensive when compared to starting up and stopping the python interpreter. If all your scripts are running independently and briefly, that is probably the first thing you should fix. You have to decide and describe how your scripts are getting scheduled and invoked before you can know how (and if) to use a connection pooler.
and as we know, we cannot have more than 100 connections connected at the same time.
100 is the default setting for max_connections, but it is entirely configurable. You can increase it if you want to. If you refactor for performance, you should probably do so in a way that naturally means you don't need to raise max_connections. But refactoring just because you don't want to raise max_connections is letting the tail wag the dog.
You are right, establishing a database connection is expensive; therefore, you should use connection pooling. But there is no need to re-invent the wheel, since psycopg2 has built-in connection pooling:
Use a psycopg2.pool.SimpleConnectionPool or psycopg2.pool.ThreadedConnectionPool (depending on whether you use threading or not) and use the getconn() and putconn() methods to grab or return a connection.
I have the following:
class FooData(object):
def __init__(self):
...
try:
self.my_cnf = os.environ['HOME'] + '/.my.cnf'
self.my_cxn = mysql.connector.connect(option_files=self.my_cnf)
self.cursor = self.my_cxn.cursor(dictionary=True)
except mysql.connector.Error as err:
if err.errno == 2003:
self.my_cnf = None
self.my_cxn = None
self.cursor = None
I am able to use my_cxn and cursor without any obvious failure. I never explicitly terminate the connection, and have observed the following messages in my mysql error log though...
2017-01-08T15:16:09.355190Z 132 [Note] Aborted connection 132 to db:
'mydatabase' user: 'myusername' host: 'localhost'
(Got an error reading communication packets)
Am I going about this the wrong way? Would it be more efficient for me to initialize my connector and cursor every time I need to run a query?
What do I need to look for on the mysql config to avoid these aborted connection?
Separately, I also observe these messages in my error logs frequently:
2017-01-06T15:28:45.203067Z 0 [Warning] Changed limits: max_open_files: 1024
(requested 5000)
2017-01-06T15:28:45.205191Z 0 [Warning] Changed limits: table_open_cache: 431
(requested 2000)
Is it related to the above? What does it mean and how can I resolve it?
I tried various solutions involving /lib/systemd/system/mysql.service.d/limits.conf and other configuration settings but couldn't get any of them to work.
It's not a config issue. When you are done with a connection you should close it by explicitly calling close. It is generally a best practice to maintain the connection for a long time as creating one takes time. It's not possible to tell from your code snippet where would be the best place to close it - it's whenever you're "done" with it; perhaps at the end of your __main__ method. Similarly, you should close the cursor explicitly when your done with it. Typically that happens after each query.
So, maybe something like:
class FooData(object):
def __init__(self):
...
try:
self.my_cnf = os.environ['HOME'] + '/.my.cnf'
self.my_cxn = mysql.connector.connect(option_files=self.my_cnf)
def execute_some_query(self, query_info):
"""Runs a single query. Thus it creates a cursor to run the
query and closes it when it's done."""
# Note that cursor is not a member variable as it's only for the
# life of this one query
cursor = self.my_cxn.cursor(dictionary=True)
cursor.execute(...)
# All done, close the cursor
cursor.close()
def close():
"""Users of this class should **always** call close when they are
done with this class so it can clean up the DB connection."""
self.my_cxn.close()
You might also look into the Python with statement for a nice way to ensure everything is always cleaned up.
I rewrote my class above to look like this...
class FooData(object):
def __init__(self):
self.myconfig = {
'option_files': os.environ['HOME'] + '/.my.cnf',
'database': 'nsdata'
}
self.mysqlcxn = None
def __enter__(self):
try:
self.mysqlcxn = mysql.connector.connect(**self.myconfig)
except mysql.connector.Error as err:
if err.errno == 2003:
self.mysqlcxn = None
return self
def __exit__(self, exc_type, exc_value, traceback):
if self.mysqlcxn is not None and self.mysqlcxn.is_connected():
self.mysqlcxn.close()
def etl(self)
...
I can then use with ... as and ensure that I am cleaning up properly.
with FooData() as obj:
obj.etl()
The Aborted connection messages are thus properly eliminated.
Oliver Dain's response set me on the right path and Explaining Python's '__enter__' and '__exit__' was very helpful in understanding the right way to implement my Class.
Error OperationalError: (OperationalError) (2006, 'MySQL server has gone away') i'm already received this error when i coded project on Flask, but i cant understand why i get this error.
I have code (yeah, if code small and executing fast, then no errors) like this \
db_engine = create_engine('mysql://root#127.0.0.1/mind?charset=utf8', pool_size=10, pool_recycle=7200)
Base.metadata.create_all(db_engine)
Session = sessionmaker(bind=db_engine, autoflush=True)
Session = scoped_session(Session)
session = Session()
# there many classes and functions
session.close()
And this code returns me error 'MySQL server has gone away', but return it after some time, when i use pauses in my script.
Mysql i use from openserver.ru (it's web server like such as wamp).
Thanks..
Looking at the mysql docs, we can see that there are a bunch of reasons why this error can occur. However, the two main reasons I've seen are:
1) The most common reason is that the connection has been dropped because it hasn't been used in more than 8 hours (default setting)
By default, the server closes the connection after eight hours if nothing has happened. You can change the time limit by setting the wait_timeout variable when you start mysqld
I'll just mention for completeness the two ways to deal with that, but they've already been mentioned in other answers:
A: I have a very long running job and so my connection is stale. To fix this, I refresh my connection:
create_engine(conn_str, pool_recycle=3600) # recycle every hour
B: I have a long running service and long periods of inactivity. To fix this I ping mysql before every call:
create_engine(conn_str, pool_pre_ping=True)
2) My packet size is too large, which should throw this error:
_mysql_exceptions.OperationalError: (1153, "Got a packet bigger than 'max_allowed_packet' bytes")
I've only seen this buried in the middle of the trace, though often you'll only see the generic _mysql_exceptions.OperationalError (2006, 'MySQL server has gone away'), so it's hard to catch, especially if logs are in multiple places.
The above doc say the max packet size is 64MB by default, but it's actually 16MB, which can be verified with SELECT ##max_allowed_packet
To fix this, decrease packet size for INSERT or UPDATE calls.
SQLAlchemy now has a great write-up on how you can use pinging to be pessimistic about your connection's freshness:
http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
From there,
from sqlalchemy import exc
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def ping_connection(dbapi_connection, connection_record, connection_proxy):
cursor = dbapi_connection.cursor()
try:
cursor.execute("SELECT 1")
except:
# optional - dispose the whole pool
# instead of invalidating one at a time
# connection_proxy._pool.dispose()
# raise DisconnectionError - pool will try
# connecting again up to three times before raising.
raise exc.DisconnectionError()
cursor.close()
And a test to make sure the above works:
from sqlalchemy import create_engine
e = create_engine("mysql://scott:tiger#localhost/test", echo_pool=True)
c1 = e.connect()
c2 = e.connect()
c3 = e.connect()
c1.close()
c2.close()
c3.close()
# pool size is now three.
print "Restart the server"
raw_input()
for i in xrange(10):
c = e.connect()
print c.execute("select 1").fetchall()
c.close()
from documentation you can use pool_recycle parameter:
from sqlalchemy import create_engine
e = create_engine("mysql://scott:tiger#localhost/test", pool_recycle=3600)
I just faced the same problem, which is solved with some effort. Wish my experience be helpful to others.
Fallowing some suggestions, I used connection pool and set pool_recycle less than wait_timeout, but it still doesn't work.
Then, I realized that global session maybe just use the same connection and connection pool didn't work. To avoid global session, for each request generate a new session which is removed by Session.remove() after processing.
Finally, all is well.
One more point to keep in mind is to manually push the flask application context with database initialization. This should resolve the issue.
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
db = SQLAlchemy()
app = Flask(__name__)
with app.app_context():
db.init_app(app)
https://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-optimistic
def sql_read(cls, sql, connection):
"""sql for read action like select
"""
LOG.debug(sql)
try:
result = connection.engine.execute(sql)
header = result.keys()
for row in result:
yield dict(zip(header, row))
except OperationalError as e:
LOG.info("recreate pool duo to %s" % e)
connection.engine.pool.recreate()
result = connection.engine.execute(sql)
header = result.keys()
for row in result:
yield dict(zip(header, row))
except Exception as ee:
LOG.error(ee)
raise SqlExecuteError()
I have a seemingly straight-forward situation, but can't find a straight-forward solution.
I'm using sqlalchemy to query postgres. If a client timeout occurs, I'd like to stop/cancel the long running postgres queries from another thread. The thread has access to the Session or Connection object.
At this point I've tried:
session.bind.raw_connection().close()
and
session.connection().close()
and
session.close
and
session.transaction.close()
But no matter what I try, the postgres query still continues until it's end. I know this from watching pg in top. Shouldn't this be fairly easy to do? I'm I missing something? Is this impossible without getting the pid and sending a stop signal directly?
This seems to work well, so far:
def test_close_connection(self):
import threading
from psycopg2.extensions import QueryCanceledError
from sqlalchemy.exc import DBAPIError
session = Session()
conn = session.connection()
sql = self.get_raw_sql_for_long_query()
seconds = 5
t = threading.Timer(seconds, conn.connection.cancel)
t.start()
try:
conn.execute(sql)
except DBAPIError, e:
if type(e.orig) == QueryCanceledError:
print 'Long running query was cancelled.'
t.cancel()
source
For those MySQL folks that may have ended up here, a modified version of this answer that kills the query from a second connection can work. Essentially the following, assuming pymysql under the hood:
thread_id = conn1.connection.thread_id()
t = threading.Timer(seconds, lambda: conn2.execute("kill {}".format(thread_id)))
The original connection will raise pymysql.err.OperationalError. See this other answer for a neat way to create a long running query for testing.
Found on MYSQL that you can specify the query optimiser hints.
One such hint is MAX_EXECUTION_TIME to specify how long query should execute before termination.
You can add this in your app.py
#event.listens_for(engine, 'before_execute', retval=True)
def intercept(conn, clauseelement, multiparams, params):
from sqlalchemy.sql.selectable import Select
# check if it's select statement
if isinstance(clauseelement, Select):
# 'froms' represents list of tables that statement is querying
table = clauseelement.froms[0]
'''Update the timeout here in ms (1s = 1000ms)'''
timeout_ms = 4000
# adding filter in clause
clauseelement = clauseelement.prefix_with(f"/*+ MAX_EXECUTION_TIME({timeout_ms}) */", dialect="mysql")
return clauseelement, multiparams, params
SQLAlchemy query API not working correctly with hints and
MYSQL reference