I'm trying to create a Prefect task that receives as input an instance of PyMySQL connection, such as:
#task
def connect_db():
connection = pymysql.connect(user=user,
password=password,
host=host,
port=port,
db=db,
connect_timeout=5,
cursorclass=pymysql.cursors.DictCursor,
local_infile=True)
return connection
#task
def query_db(connection) -> Any:
query = 'SELECT * FROM myschema.mytable;'
with connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
return rows
#task
def get_df(rows) -> Any:
return pd.DataFrame(rows, dtype=str)
#task
def save_csv(df):
path = 'mypath'
df.to_csv(path, sep=';', index=False)
with Flow(FLOW_NAME) as f:
con = connect_db()
rows = query_db(con)
df = get_df(rows)
save_csv(df)
However, as I try to register the resulting flow, it raises "TypeError: cannot pickle 'socket' object". Going through Prefect's Docs, I've found built-in MySQL Tasks ( https://docs.prefect.io/api/latest/tasks/mysql.html#mysqlexecute), but they open and close connections each time they're called. Is there any way to pass a connection previously opened to a Prefect Task (or implement such thing as a connection manager)?
I tried to replicate your example but it registers fine. The most common way an error like this pops up is if you have a client in the global namespace that the flow uses. Prefect will try to serialize that upon registration. For example, the following code snippet will error if you try to register it:
import pymysql
connection = pymysql.connect(user=user,
password=password,
host=host,
port=port,
db=db,
connect_timeout=5,
cursorclass=pymysql.cursors.DictCursor,
local_infile=True)
#task
def query_db(connection) -> Any:
query = 'SELECT * FROM myschema.mytable;'
with connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
return rows
with Flow(FLOW_NAME) as f:
rows = query_db(connection)
This errors because the connection variable is serialized along with the flow object. You can work around this by storing your Flow as a script. See this link for more information:
https://docs.prefect.io/core/idioms/script-based.html#using-script-based-flow-storage
This will avoid the serialization of the Flow object and create that connection during runtime.
If this happens during runtime
If you encounter this error during runtime, there are two possible reasons you can see this. The first is Dask serializing it, and the second is from the Prefect checkpointing.
Dask uses cloudpickle to send the data to the workers across a network. So if you use Prefect with a DaskExecutor, it will use cloudpickle to send the tasks for execution. Thus, task inputs and outputs need to be serializable. In this scenario, you should instantiate the Client and perform the query inside a task (like you saw with the current MySQL Task implementation)
If you use a LocalExecutor, task outputs are serialized by default because checkpointing is on by default. You can toggle with by doing checkpoint=False when you define the task.
If you need further help, feel free to join the Prefect Slack channel at prefect.io/slack .
Related
Let me start off by saying I am extremely new to Python and Postgresql so I feel like I'm in way over my head. My end goal is to get connected to the dvdrental database in postgresql and be able to access/manipulate the data. So far I have:
created a .config folder and a database.ini is within there with my login credentials.
in my src i have a config.py folder and use config parser, see below:
def config(filename='.config/database.ini', section='postgresql'):
# create a parser
parser = ConfigParser()
# read config file
parser.read(filename)
# get section, default to postgresql
db = {}
if parser.has_section(section):
params = parser.items(section)
for param in params:
db[param[0]] = param[1]
else:
raise Exception('Section {0} not found in the {1} file'.format(section, filename))
return db
then also in my src I have a tasks.py file that has a basic connect function, see below:
import pandas as pd
from clients.config import config
import psycopg
def connect():
""" Connect to the PostgreSQL database server """
conn = None
try:
# read connection parameters
params = config()
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
conn = psycopg.connect(**params)
# create a cursor
cur = conn.cursor()
# execute a statement
print('PostgreSQL database version:')
cur.execute('SELECT version()')
# display the PostgreSQL database server version
db_version = cur.fetchone()
print(db_version)
# close the communication with the PostgreSQL
cur.close()
except (Exception, psycopg.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
print('Database connection closed.')
if __name__ == '__main__':
connect()
Now this runs and prints out the Postgresql database version which is all well & great but I'm struggling to figure out how to change the code so that it's more generalized and maybe just creates a cursor?
I need the connect function to basically just connect to the dvdrental database and create a cursor so that I can then use my connection to select from the database in other needed "tasks" -- for example I'd like to be able to create another function like the below:
def select_from_table(cursor, table_name, schema):
cursor.execute(f"SET search_path TO {schema}, public;")
results= cursor.execute(f"SELECT * FROM {table_name};").fetchall()
return results
but I'm struggling with how to just create a connection to the dvdrental database & a cursor so that I'm able to actually fetch data and create pandas tables with it and whatnot.
so it would be like
task 1 is connecting to the database
task 2 is interacting with the database (selecting tables and whatnot)
task 3 is converting the result from 2 into a pandas df
thanks so much for any help!! This is for a project in a class I am taking and I am extremely overwhelmed and have been googling-researching non-stop and seemingly end up nowhere fast.
The fact that you established the connection is honestly the hardest step. I know it can be overwhelming but you're on the right track.
Just copy these three lines from connect into the select_from_table method
params = config()
conn = psycopg.connect(**params)
cursor = conn.cursor()
It will look like this (also added conn.close() at the end):
def select_from_table(cursor, table_name, schema):
params = config()
conn = psycopg.connect(**params)
cursor = conn.cursor()
cursor.execute(f"SET search_path TO {schema}, public;")
results= cursor.execute(f"SELECT * FROM {table_name};").fetchall()
conn.close()
return results
I have Streamlit app that is connected with SQL Server database. I tried to create a function to connect the app but the app crash and display the below error:
name con is not defined
Code:
#st.cache(allow_outup_mutation=True) # this is changed to st.experimantal_singleton
def connect_db():
try:
con=pyodbc.connect(
driver = 'ODBC DRIVER 17 FOR SQL SERVER',
Server = 'localhost',
DATABASE='test_db',
UID = 'test',
PWD ='test',
)
cursor = con.cursor()
df = pd.read_sql_query('select * from test_db',con)
data = df
except Exception as e:
st.write("error is :{}".format(e))
return data
def main()
# call connect_db in order to use it parameters in latter queries
connect_db()
based on the answer of #InsertCheesyLine i added this generator to the function
st.experimantal_singleton
You should use st.experimental_singleton for database connectors instead.
From the docs - Each singleton object is shared across all users connected to the app. Singleton objects must be thread-safe, because they can be accessed from multiple threads concurrently.
st.cache on the other hand: The first time Streamlit runs the function and stores the result in a local cache. next time the function is called, if those three values have not changed Streamlit knows it can skip executing the function altogether. Instead, it just reads the output from the local cache.
I did solve this issue I just had to return con variable instead of data variable and assign the con variable to the function connect_db()
I have a python application that is reading from mysql/mariadb, uses that to fetch data from an api and then inserts results into another table.
I had setup a module with a function to connect to the database and return the connection object that is passed to other functions/modules. However, I believe this might not be a correct approach. The idea was to have a small module that I could just call whenever I needed to connect to the db.
Also note, that I am using the same connection object during loops (and within the loop passing to the db_update module) and call close() when all is done.
I am also getting some warnings from the db sometimes, those mostly happen at the point where I call db_conn.close(), so I guess I am not handling the connection or session/engine correctly. Also, the connection id's in the log warning keep increasing, so that is another hint, that I am doing it wrong.
[Warning] Aborted connection 351 to db: 'some_db' user: 'some_user' host: '172.28.0.3' (Got an error reading communication packets)
Here is some pseudo code that represents the structure I currently have:
################
## db_connect.py
################
# imports ...
from sqlalchemy import create_engine
def db_connect():
# get env ...
db_string = f"mysql+pymysql://{db_user}:{db_pass}#{db_host}:{db_port}/{db_name}"
try:
engine = create_engine(db_string)
except Exception as e:
return None
db_conn = engine.connect()
return db_conn
################
## db_update.py
################
# imports ...
def db_insert(db_conn, api_result):
# ...
ins_qry = "INSERT INTO target_table (attr_a, attr_b) VALUES (:a, :b);"
ins_qry = text(ins_qry)
ins_qry = ins_qry.bindparams(a = value_a, b = value_b)
try:
db_conn.execute(ins_qry)
except Exception as e:
print(e)
return None
return True
################
## main.py
################
from sqlalchemy import text
from db_connect import db_connect
from db_update import db_insert
def run():
try:
db_conn = db_connect()
if not db_conn:
return False
except Exception as e:
print(e)
qry = "SELECT *
FROM some_table
WHERE some_attr IN (:some_value);"
qry = text(qry)
search_run_qry = qry.bindparams(
some_value = 'abc'
)
result_list = db_conn.execute(qry).fetchall()
for result_item in result_list:
## do stuff like fetching data from api for every record in the query result
api_result = get_api_data(...)
## insert into db:
db_ins_status = db_insert(db_conn, api_result)
## ...
db_conn.close
run()
EDIT: Another question:
a) Is it ok in a loop, that does an update on every iteration to use the same connection, or would it be wiser to instead pass the engine to the run() function and call db_conn = engine.connect() and db_conn.close() just before and after each update?
b) I am thinking about using ThreadPoolExecutor instead of the loop for the API calls. Would this have implications on how to use the connection, i.e. can I use the same connection for multiple threads that are doing updates to the same table?
Note: I am not using the ORM feature mostly because I have a strong DWH/SQL background (though not so much as DBA) and I am used to writing even complex sql queries. I am thinking about switching to just using PyMySQL connector for that reason.
Thanks in advance!
Yes you can return/pass connection object as parameter but what is the aim of db_connect method, except testing connection ? As I see there is no aim of this db_connect method therefore I would recommend you to do this as I done it before.
I would like to share a code snippet from one of my project.
def create_record(sql_query: str, data: tuple):
try:
connection = mysql_obj.connect()
db_cursor = connection.cursor()
db_cursor.execute(sql_query, data)
connection.commit()
return db_cursor, connection
except Exception as error:
print(f'Connection failed error message: {error}')
and then using this one as for another my need
db_cursor, connection, query_data = fetch_data(sql_query, query_data)
and after all my needs close the connection with this method and method call.
def close_connection(connection, db_cursor):
"""
This method used to close SQL server connection
"""
db_cursor.close()
connection.close()
and calling method
close_connection(connection, db_cursor)
I am not sure can I share my github my check this link please. Under model.py you can see database methods and to see how calling them check it main.py
Best,
Hasan.
I have written a Python Tool with an wxPython GUI which has mainly the task to get a lot of user input regarding Customer Data, Product Data and so on and save it to a SQL Database, at the moment locally with a SQLite3 Database for testing an now switching to MS Azure to have anybody work in the same Database.
As i now plan to use a MS Azure SQL DB i have a few questions an i am hoping this is the right place to ask:
What is the best library to connect to Azure via Python? I found
pyodbc and pymssql but i think both need to have an extra driver
installed? Is this true and is this a problem in real usecases?
I have many modules, like Manage_Customer.py and Manage_Factory.py and so on. In all of them I connect to my Database. I have no module which is like a SQL Master which handels some overhead.
So my code looks like this most of the time:
import wx
import sqlite3
SQL_PATH = "Database_Test.db"
class ManageCustomerToDB(wx.Dialog):
def __init__(self, *args, **kw):
super(ManageCustomerToDB, self).__init__(*args, **kw)
def InitUI(self):
#[GUI an so on...]
# I do this on time inside a module:
conn = sqlite3.connect(SQL_PATH)
self.c = conn.cursor()
# Use functions like the ones below...
def GetCustomerData(self):
self.c.execute("SELECT * FROM Customer WHERE CustomerID = ?", (self.tc_customer_id.GetValue(),))
customer_data = self.c.fetchall()
# Do something with Customer Data
def GetPersonData(self):
self.c.execute("SELECT * FROM Person WHERE PersonID = ?", (self.tc_person_id.GetValue(),))
person_data = self.c.fetchall()
# Do something with Person Data
I hope this example shows what i do. Are there any bigger mistakes i do?
After a read in SQL I dont have to close the DB in any way?
Thanks for your help and let me know if i can improve my question or give more details.
It is not a good idea to create a new connection to Azure SQL every time you CRUD. This is a waste of resources, and when the number of accesses reaches a certain number, it will have a large impact on the performance of mssql.
I suggest you use database connection pool. The pool manager will initial several connections to SQL Server instance, and then reuse these connections when requested.
There is an existing package which you can take advantage of. It is DBUtils. You can use the PoolDB from it with pyodbc together.
A sample for showing how database connection pool works:
import pyodbc
from DBUtils.PooledDB import PooledDB
class Database:
def __init__(self, server, driver, port, database, username, password):
self.server = server
self.driver = driver
self.port = port
self.database = database
self.username = username
self.password = password
self._CreatePool()
def _CreatePool(self):
self.Pool = PooledDB(creator=pyodbc, mincached=2, maxcached=5, maxshared=3, maxconnections=6, blocking=True, DRIVER=self.driver, SERVER=self.server, PORT=self.port, DATABASE=self.database, UID=self.username, PWD=self.password)
def _Getconnect(self):
self.conn = self.Pool.connection()
cur = self.conn.cursor()
if not cur:
raise "connection error"
else:
return cur
# query sql
def ExecQuery(self, sql):
cur = self._Getconnect()
cur.execute(sql)
relist = cur.fetchall()
cur.close()
self.conn.close()
return relist
# non-query sql
def ExecNoQuery(self, sql):
cur = self._Getconnect()
cur.execute(sql)
self.conn.commit()
cur.close()
self.conn.close()
def main():
server = 'jackdemo.database.windows.net'
database = 'jackdemo'
username = 'jack'
port=1433
password = '*********'
driver= '{ODBC Driver 17 for SQL Server}'
ms = Database(server=server, driver=driver, port=port, database=database, username=username, password=password)
resList = ms.ExecQuery("select * from Users")
print(resList)
if __name__ == '__main__':
main()
Answers to your questions:
Q1: What is the best library to connect to Azure via Python? I found pyodbc and pymssql but i think both need to have an extra driver installed? Is this true and is this a problem in real usecases?
Answer: Both of then would be OK. ODBC stands for Open Database Connectivity, so it could be used to connect many databases. I see the Microsoft tutorial uses pyodbc, so maybe it is a better choice.
Q2: I have many modules, like Manage_Customer.py and Manage_Factory.py and so on. In all of them I connect to my Database. I have no module which is like a SQL Master which handels some overhead.
Answer: Use database connection pool.
Q3: After a read in SQL I dont have to close the DB in any way?
Answer: If you use database connection pool, the connection will be put back too pool after you call close() method.
My problem is I am facing
ProgrammingError: copy_from cannot be used with an asynchronous
callback.
while trying to copy_from without async connection. This must be stated, I am creating connection from a celery task. Can someone give me a clue how sqlalchemy or celery or whatever forces my psycopg2 connection behave like async?
conn = psycopg2.connect(con_string)
conn.async
>>0
cur = conn.cursor()
data = BytesIO()
data.write('\n'.join(['Tom\tJenkins\t37',
'Madonna\t\N\t45',
'Federico\tDi Gregorio\t\N']))
data.seek(0)
curs.copy_from(data, 'test_copy')
We were facing this error in pgcli; in that case it turned out that a wait_callback was making the connection behave as though it were asynchronous from psycopg2's point of view. This helped:
from contextlib import contextmanager
#contextmanager
def _paused_thread():
try:
thread = psycopg2.extensions.get_wait_callback()
psycopg2.extensions.set_wait_callback(None)
yield
finally:
psycopg2.extensions.set_wait_callback(thread)
with _paused_thread():
cursor.copy_expert('copy mytable to STDOUT', file)