How to pass psycopg2 cursor object to foreachPartition() in pyspark?

How to pass psycopg2 cursor object to foreachPartition() in pyspark? - python

I'm getting following error
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'psycopg2.extensions.cursor' object
PicklingError: Could not serialize object: TypeError: cannot pickle 'psycopg2.extensions.cursor' object
while running the below script
def get_connection():
conn_props = brConnect.value
print(conn_props)
#extract value from broadcast variables
database = conn_props.get("database")
user = conn_props.get("user")
pwd = conn_props.get("password")
host = conn_props.get("host")
db_conn = psycopg2.connect(
host = host,
user = user,
password = pwd,
database = database,
port = 5432
)
return db_conn
def process_partition_up(partition, db_cur):
updated_rows = 0
try:
for row in partition:
process_row(row, myq, db_cur)
except Exception as e:
print("Not connected")
return updated_rows
def update_final(df, db_cur):
df.rdd.coalesce(2).foreachPartition(lambda x: process_partition_up(x, db_cur))
def etl_process():
for id in ['003']:
conn = get_connection()
for t in ['email_table']:
query = f'''(select * from public.{t} where id= '{id}') as tab'''
df_updated = load_data(query)
if df_updated.count() > 0:
q1 = insert_ops(df_updated, t) #assume this function returns a insert query
query_props = q1
sc = spark.sparkContext
brConnectQ = sc.broadcast(query_props)
db_conn = get_connection()
db_cur = db_conn.cursor()
update_final(df_updated, db_cur)
conn.commit()
conn.close()
Explanation:
Here etl_process() internally calling get_connection() which returns a psycopg2 connection object. After that it's calling a update_final() which takes dataframe and psycopg2 cursor object as an arguments.
Now update_final() is calling process_partition_up() on each partition(df.rdd.coalesce(2).foreachPartition) which takes dataframe and psycopg2 cursor object as an arguments.
Here after passing psycopg2 cursor object to the process_partition_up(), I'm not getting cursor object rather I'm getting above error.
Can anyone help me out to resolve this error?
Thank you.

I think that you don't understand what's happening here.
You are creating a database connection in your driver(etl_process), and then trying to ship that live connection from the driver, across your network to executor to do the work.(your lambda in foreachPartitions is executed on the executor.)
That is what spark is telling you "cannot pickle 'psycopg2.extensions.cursor'". (It can't serialize your live connection to the database to ship it to an executor.)
You need to call conn = get_connection() from inside process_partition_up this will initialize the connection to the database from inside the executor.(And any other book keeping you need to do.)
FYI: The worst part that I want to call out is that this code will work on your local machine. This is because it's both the executor and the driver.

Related

Error: 'Authentication plugin 'caching_sha2_password' is not supported when running executable of Python Module using pyinstaller

I'm attempting to run an executable, "main.exe", that was built from three python modules (with "main.py" being the main script) using the pyinstaller module. The command that was used to build an executable from the scripts is,
pyinstaller --onefile main.py
This script invokes functions from the "tictactoe_office_release.py" script which establishes connection to a MySQL 8.0.31 server database for performing CRUD operations. When running the executable from the command line, I receive the following string of errors:
Error: 'Authentication plugin 'caching_sha2_password' is not supported'
Traceback (most recent call last):
File "main.py", line 24, in <module>
File "tictactoe_office_release.py", line 42, in __init__
File "mysql_python_lib.py", line 124, in __init__
File "mysql_python_lib.py", line 96, in read_query
AttributeError: 'NoneType' object has no attribute 'cursor'
[25743] Failed to execute script 'main' due to unhandled exception!
It is important to note, however, that my main.py script executes without errors when run outside of the executable. Now, I have troubleshooted the errors using numerous comments from Authentication Plugin 'caching_sha2_password' is not supportedincluding the following
uninstalling 'mysql-connector' and installing 'mysql-connector-python'
2)Setting the 'auth_plugin' parameter to 'mysql_native_password' in the 'mysql.connector.connect()' function calls
3)Modifying the mysql encryption by running
ALTER USER 'root'#'localhost' IDENTIFIED WITH caching_sha2_password BY 'Panther021698';
but am receiving the same error after I re-build and run the executable.
The relevant code in my "tictactoe_office_release.py" module that depicts the function definitions for enabling communication between the Python interpreter and the MySQL server, and database is provided below:
from distutils.util import execute
import mysql.connector
from mysql.connector import Error
from mysql.connector.locales.eng import client_error
class mysql_python_connection:
''' Provide class definition for creating connection to MySQL server,
initializing database, and executing queries '''
def __init__(self):
self.host_name = "localhost"
self.user_name = "root"
self.passwd = "Panther021698"
def create_server_connection(self):
''' This function establishes a connection between Python
interpreter and the MySQL Community Server that we are attempting
to connect to '''
self.connection = None # Close any existing connections
try:
self.connection = mysql.connector.connect(
host = self.host_name,
user = self.user_name,
passwd = self.passwd
)
print("MySQL connection successful")
except Error as err:
print(f"Error: '{err}'")
def create_database(self, query):
''' This function initializes a new database
on the connected MySQL server'''
cursor = self.connection.cursor()
try:
cursor.execute(query)
print("Database created successfully")
except Error as err:
print(f"Error: '{err}'")
def create_db_connection(self, db_name):
''' This function establishes a connection between Python
the MySQL Community Server and a database that we
are initializing on the server '''
self.connection = None # Close any existing connections
self.db_name = db_name
try:
self.connection = mysql.connector.connect(
host = self.host_name,
user = self.user_name,
passwd = self.passwd,
database = self.db_name
)
print("MySQL Database connection successful")
except Error as err:
print(f"Error: '{err}'")
def execute_query(self, query):
''' This function takes SQL queries stored
in Python as strings and passes them
to the "cursor.execute()" method to
execute them on the server'''
cursor = self.connection.cursor()
try:
cursor.execute(query)
self.connection.commit() # Implements commands detailed in SQL queries
print(query + "Query successful")
except Error as err:
print(f"Error: '{err}'")
def read_query(self,query):
''' This function reads and returns data from
a MySQL database using the specified query '''
cursor = self.connection.cursor()
print("cursor datatype is ")
print(type(cursor))
#result = None
try:
cursor.execute(query)
result = cursor.fetchall()
return result
except Error as err:
print(f"Error: '{err}'")
Additionally, my MySQL environment variables are provided in the image below.

Why do I need to renew the connection to the base?

I write a program that, with the help of pyodbc, connects to the base several times and performs selects.
Unfortunately, I have to reestablish the connection before calling any of my methods.
Why doesn't a single connection in each method work?
# create object (connect to DB)
conn = db.db_connect()
# Call method with my select
weak_password_list = db.Find_LoginsWithWeakPassword(conn)
# I need to connect again
conn = db.db_connect()
# Call method with my select
logins_with_expired_password = db.LoginsWithExpiredPassword(conn)
# And again...
conn = db.db_connect()
# Call method with my select
logins_with_expiring_password = db.Find_LoginsWithExpiringPassword(conn)
######################################################
def db_connect(self):
try:
conn = pyodbc.connect('Driver={SQL Server};'
'Server='+self.server_name+''
'Database='+self.database_name+';'
'Trusted_Connection='+self.trusted_connection+'')
except Exception as e:
conn = ""
self.print_error("Failed to connect to the database.", e)
return conn
############################
def Find_LoginsWithWeakPassword(self, conn):
try:
cursor = conn.cursor()
query_result = cursor.execute('''SELECT * FROM table_name''')
except Exception as e:
query_result=""
self.print_error("Select failed in Find_LoginsWithWeakPassword", e)
return query_result
If I only connect once, the second and subsequent methods with select has no effect.
Why?

When you call
weak_password_list = db.Find_LoginsWithWeakPassword(conn)
the function returns the pyodbc Cursor object returned by .execute():
<pyodbc.Cursor object at 0x012F4F60>
You are not calling .fetchall() (or similar) on it, so the connection has an open cursor with unconsumed results. If you do your next call
logins_with_expired_password = db.LoginsWithExpiredPassword(conn)
without first (implicitly) closing the existing connection by clobbering it, then .execute() will fail with
('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
TL;DR: Consume you result sets before calling another function, either by having the functions themselves call .fetchall() or by calling .fetchall() on the cursor objects that they return.

Sqlite - Cannot operate on a closed database

I am trying to insert a small set of rows into sqlite using python and getting an error "Cannot operate on a closed database"
This is my code snippet:
import sqlite3
from sqlite3 import Error
db_file = "/home/sosuser/mediaserver/camera.db"
def create_connection(db_file):
conn = None
try:
conn = sqlite3.connect(db_file)
print(sqlite3.version)
except Error as e:
print(e)
finally:
if conn:
conn.close()
return conn
def create_task(conn, task):
sql = ''' INSERT INTO camerainfo(id, cameraid, maplat, maplong, name)
VALUES(?,?,?,?,?) '''
cur = conn.cursor()
cur.execute(sql, task)
def prepare_data(conn):
for cam in range(len(camID)):
print(camID[cam])
task = (cam, camID[cam], '12.972442','77.580643','testCAM')
create_task(conn, task)
conn.commit()
conn.close()
conn = create_connection(db_file)
prepare_data(conn)
Get the following error -
Traceback (most recent call last):
File "dumpCamera1.py", line 92, in <module>
prepare_data(conn)
File "dumpCamera1.py", line 86, in prepare_data
create_task(conn, task)
File "dumpCamera1.py", line 79, in create_task
cur = conn.cursor()
sqlite3.ProgrammingError: Cannot operate on a closed database.
Not sure where is my connection being closed. Might have done something very silly but would appreciate for pointers?
Thanks.

The finally clause in the create_connection function closes the connection before it's returned.
It looks as if you are trying to create a kind of context manager for the connection, but an sqlite3 Connection is already a context manager, so this is unnecessary.
You can do
with sqlite3.connect(dbfile) as conn:
print(sqlite3.version)
prepare_data(conn)
The connection will be closed automatically on exiting the context manager. You can trap errors raised inside the context manager by wrapping it in a try / except block.

How to set a query timeout in sqlalchemy using Oracle database?

I want to create a query timeout in sqlalchemy. I have an oracle database.
I have tried following code:
import sqlalchemy
engine = sqlalchemy.create_engine('oracle://db', connect_args={'querytimeout': 10})
I got following error:
TypeError: 'querytimeout' is an invalid keyword argument for this function
I would like a solution looking like:
connection.execute('query').set_timeout(10)
Maybe it is possible to set timeout in sql query? I found how to do it in pl/sql, but i need just sql.
How could i set a query timeout?

The only way how you can set connection timeout for the Oracle engine from the Sqlalchemy is create and configure the sqlnet.ora
Linux
Create file sqlnet.ora in folder
/opt/oracle/instantclient_19_9/network/admin
Windows
For windows please create such folder as \network\admin
C:\oracle\instantclient_19_9\network\admin
Example sqlnet.ora file
SQLNET.INBOUND.CONNECT_TIMEOUT = 120
SQLNET.SEND_TIMEOUT = 120
SQLNET.RECV_TIMEOUT = 120
More parameters you can find here https://docs.oracle.com/cd/E11882_01/network.112/e10835/sqlnet.htm

The way to do it in Oracle is via resource manager. Have a look here

timeout decorator
Get your session handle as you normally would. (Notice that the session has not actually connected yet.) Then, test the session in a function that is decorated with wrapt_timeout_decorator.timeout.
#!/usr/bin/env python3
from time import time
from cx_Oracle import makedsn
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import text
from wrapt_timeout_decorator import timeout
class ConnectionTimedOut(Exception):
pass
class Blog:
def __init__(self):
self.port = None
def connect(self, connection_timeout):
#timeout(connection_timeout, timeout_exception=ConnectionTimedOut)
def test_session(session):
session.execute(text('select dummy from dual'))
session = sessionmaker(bind=self.engine())()
test_session(session)
return session
def engine(self):
return create_engine(
self.connection_string(),
max_identifier_length=128
)
def connection_string(self):
driver = 'oracle'
username = 'USR'
password = 'solarwinds123'
return '%s://%s:%s#%s' % (
driver,
username,
password,
self.dsn()
)
def dsn(self):
host = 'hn.com'
dbname = 'ORCL'
print('port: %s expected: %s' % (
self.port,
'success' if self.port == 1530 else 'timeout'
))
return makedsn(host, self.port, dbname)
def run(self):
self.port = 1530
session = self.connect(connection_timeout=4)
for r in session.execute(text('select status from v$instance')):
print(r.status)
self.port = 1520
session = self.connect(connection_timeout=4)
for r in session.execute(text('select status from v$instance')):
print(r.status)
if __name__ == '__main__':
Blog().run()
In this example, the network is firewalled with port 1530 open. Port 1520 is blocked and leads to a TCP connection timeout. Output:
port: 1530 expected: success
OPEN
port: 1520 expected: timeout
Traceback (most recent call last):
File "./blog.py", line 68, in <module>
Blog().run()
File "./blog.py", line 62, in run
session = self.connect(connection_timeout=4)
File "./blog.py", line 27, in connect
test_session(session)
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrapt_timeout_decorator.py", line 123, in wrapper
return wrapped_with_timeout(wrap_helper)
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrapt_timeout_decorator.py", line 131, in wrapped_with_timeout
return wrapped_with_timeout_process(wrap_helper)
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrapt_timeout_decorator.py", line 145, in wrapped_with_timeout_process
return timeout_wrapper()
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrap_function_multiprocess.py", line 43, in __call__
self.cancel()
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrap_function_multiprocess.py", line 51, in cancel
raise_exception(self.wrap_helper.timeout_exception, self.wrap_helper.exception_message)
File "/home/exagriddba/lib/python3.8/site-packages/wrapt_timeout_decorator/wrap_helper.py", line 178, in raise_exception
raise exception(exception_message)
__main__.ConnectionTimedOut: Function test_session timed out after 4.0 seconds
Caution
Do not decorate the function that calls sessionmaker, or you will get:
_pickle.PicklingError: Can't pickle <class 'sqlalchemy.orm.session.Session'>: it's not the same object as sqlalchemy.orm.session.Session
SCAN
This implementation is a "connection timeout" without regard to underlying cause. The client could time out before trying all available SCAN listeners.

sqlobject: No connection has been defined for this thread or process

I'm using sqlobject in Python. I connect to the database with
conn = connectionForURI(connStr)
conn.makeConnection()
This succeeds, and I can do queries on the connection:
g_conn = conn.getConnection()
cur = g_conn.cursor()
cur.execute(query)
res = cur.fetchall()
This works as intended. However, I also defined some classes, e.g:
class User(SQLObject):
class sqlmeta:
table = "gui_user"
username = StringCol(length=16, alternateID=True)
password = StringCol(length=16)
balance = FloatCol(default=0)
When I try to do a query using the class:
User.selectBy(username="foo")
I get an exception:
...
File "c:\python25\lib\site-packages\SQLObject-0.12.4-py2.5.egg\sqlobject\main.py", line 1371, in selectBy
conn = connection or cls._connection
File "c:\python25\lib\site-packages\SQLObject-0.12.4-py2.5.egg\sqlobject\dbconnection.py", line 837, in __get__
return self.getConnection()
File "c:\python25\lib\site-packages\SQLObject-0.12.4-py2.5.egg\sqlobject\dbconnection.py", line 850, in getConnection
"No connection has been defined for this thread "
AttributeError: No connection has been defined for this thread or process
How do I define a connection for a thread? I just realized I can pass in a connection keyword which I can give conn to to make it work, but how do I get it to work if I weren't to do that?

Do:
from sqlobject import sqlhub, connectionForURI
sqlhub.processConnection = connectionForURI(connStr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass psycopg2 cursor object to foreachPartition() in pyspark? - python

Related

Error: 'Authentication plugin 'caching_sha2_password' is not supported when running executable of Python Module using pyinstaller

Why do I need to renew the connection to the base?

Sqlite - Cannot operate on a closed database

How to set a query timeout in sqlalchemy using Oracle database?

sqlobject: No connection has been defined for this thread or process

Categories

Resources