Parallelizing pandas pyodbc SQL database calls

Parallelizing pandas pyodbc SQL database calls - python

I am currently querying data into dataframe via the pandas.io.sql.read_sql() command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))
Something like (very general):
pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)
Is something like that possible?

Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:
Old Code:
def fetchall(self):
results = [None] * len(self.cursors)
def do_work(index, cursor):
results[index] = cursor.fetchall()
self._do_parallel(do_work)
return list(chain(*[rs for rs in results]))
New Code:
def fetchall(self):
results = [None] * len(self.sql_connections)
def do_work(index, sql_connection):
sql, conn = sql_connection # Store tuple of sql/conn instead of cursor
results[index] = pd.read_sql(sql, conn)
self._do_parallel(do_work)
return pd.DataFrame().append([rs for rs in results])
Repo: https://github.com/godatadriven/ParallelConnection

Related

mock psycopg2 fetchone and fetchall to return different dataset doesn't work [duplicate]

I have this code segment in Python2:
def super_cool_method():
con = psycopg2.connect(**connection_stuff)
cur = con.cursor(cursor_factory=DictCursor)
cur.execute("Super duper SQL query")
rows = cur.fetchall()
for row in rows:
# do some data manipulation on row
return rows
that I'd like to write some unittests for. I'm wondering how to use mock.patch in order to patch out the cursor and connection variables so that they return a fake set of data? I've tried the following segment of code for my unittests but to no avail:
#mock.patch("psycopg2.connect")
#mock.patch("psycopg2.extensions.cursor.fetchall")
def test_super_awesome_stuff(self, a, b):
testing = super_cool_method()
But I seem to get the following error:
TypeError: can't set attributes of built-in/extension type 'psycopg2.extensions.cursor'

You have a series of chained calls, each returning a new object. If you mock just the psycopg2.connect() call, you can follow that chain of calls (each producing mock objects) via .return_value attributes, which reference the returned mock for such calls:
#mock.patch("psycopg2.connect")
def test_super_awesome_stuff(self, mock_connect):
expected = [['fake', 'row', 1], ['fake', 'row', 2]]
mock_con = mock_connect.return_value # result of psycopg2.connect(**connection_stuff)
mock_cur = mock_con.cursor.return_value # result of con.cursor(cursor_factory=DictCursor)
mock_cur.fetchall.return_value = expected # return this when calling cur.fetchall()
result = super_cool_method()
self.assertEqual(result, expected)
Because you hold onto references for the mock connect function, as well as the mock connection and cursor objects you can then also assert if they were called correctly:
mock_connect.assert_called_with(**connection_stuff)
mock_con.cursor.asset_called_with(cursor_factory=DictCursor)
mock_cur.execute.assert_called_with("Super duper SQL query")
If you don't need to test these, you could just chain up the return_value references to go straight to the result of cursor() call on the connection object:
#mock.patch("psycopg2.connect")
def test_super_awesome_stuff(self, mock_connect):
expected = [['fake', 'row', 1], ['fake', 'row' 2]]
mock_connect.return_value.cursor.return_value.fetchall.return_value = expected
result = super_cool_method()
self.assertEqual(result, expected)
Note that if you are using the connection as a context manager to automatically commit the transaction and you use as to bind the object returned by __enter__() to a new name (so with psycopg2.connect(...) as conn: # ...) then you'll need to inject an additional __enter__.return_value in the call chain:
mock_con_cm = mock_connect.return_value # result of psycopg2.connect(**connection_stuff)
mock_con = mock_con_cm.__enter__.return_value # object assigned to con in with ... as con
mock_cur = mock_con.cursor.return_value # result of con.cursor(cursor_factory=DictCursor)
mock_cur.fetchall.return_value = expected # return this when calling cur.fetchall()
The same applies to the result of with conn.cursor() as cursor:, the conn.cursor.return_value.__enter__.return_value object is assigned to the as target.

Since the cursor is the return value of con.cursor, you only need to mock the connection, then configure it properly. For example,
query_result = [("field1a", "field2a"), ("field1b", "field2b")]
with mock.patch('psycopg2.connect') as mock_connect:
mock_connect.cursor.return_value.fetchall.return_value = query_result
super_cool_method()

The following answer is the variation of above answers.
I was using django.db.connections cursor object.
So following code worked for me
#patch('django.db.connections')
def test_supercool_method(self, mock_connections):
query_result = [("field1a", "field2a"), ("field1b", "field2b")]
mock_connections.__getitem__.return_value.cursor.return_value.__enter__.return_value.fetchall.return_value = query_result
result = supercool_method()
self.assertIsInstance(result, list)

#patch("psycopg2.connect")
async def test_update_task_after_launch(fake_connection):
"""
"""
fake_update_count =4
fake_connection.return_value = Mock(cursor=lambda : Mock(execute=lambda x,y :"",
fetch_all=lambda:['some','fake','rows'],rowcount=fake_update_count,close=lambda:""))

UnitTesting: Mock pyodbc cursor messages

I've been trying to test the below metheod specially the if block and have tried multiple things like patching, mocking in various combinations of pyodbc but I've not been able to mock the if condition.
def execute_read(self, query):
dbconn = pyodbc.connect(self.connection_string, convert_unicode=True)
with dbconn.cursor() as cur:
cursor = cur.execute(query)
if not cursor.messages:
res = cursor.fetchall()
else:
raise Exception(cursor.messages[0][1])
return res;
# unit test method
#patch.object(pyodbc, 'connect')
def test_execute_read(self, pyodbc_mock):
pyodbc_mock.return_value = MagicMock()
self.assertIsNotNone(execute_read('query'))
I've read the docs of unittest.mock, but I haven't found a way to get this above if condition covered. Thank you.

You would want to patch the Connection class (given the Cursor object is immutable) and supply a return value for covering the if block. Something that may look like:
with patch.object("pyodbc.Connection") as conn:
conn.cursor().messages = []
...
Tried this with sqlite3 and that worked for me.
Here's an example of using the patch object, something I wrote for frappe/frappe:
def test_db_update(self):
with patch.object(Database, "sql") as sql_called:
frappe.db.set_value(
self.todo1.doctype,
self.todo1.name,
"description",
f"{self.todo1.description}-edit by `test_for_update`",
)
first_query = sql_called.call_args_list[0].args[0]
second_query = sql_called.call_args_list[1].args[0]
self.assertTrue(sql_called.call_count == 2)
self.assertTrue("FOR UPDATE" in first_query)

python cassandra get big result of select * in generator (without storage result in ram)

I want to get all data in cassandra table "user"
i have 840000 users and i don't want to get all users in python list.
i want get users in packs of 100 users
in cassandra doc https://datastax.github.io/python-driver/query_paging.html
i see i can use fetch_size, but in my python code i have database object that contains all cql instruction
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
class Database:
def __init__(self, name, salary):
self.cluster = Cluster(['192.168.1.1', '192.168.1.2'])
self.session = cluster.connect()
def get_users(self):
users_list = []
query = "SELECT * FROM users"
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):
users_list.append(user_row.name)
return users_list
actually get_users return very big list of user name
but i want to transform return get_users to a "generator"
i don't want get all users name in 1 list and 1 call of function get_users, but i want to have lot of call get_users and return list with only 100 users max every call function
for example :
list1 = database.get_users()
list2 = database.get_users()
...
listn = database.get_users()
list1 contains 100 first user in query
list2 contains 100 "second" users in query
listn contains the latest elements in query (<=100)
is this possible ?
thanks for advance for your answer

According to Paging Large Queries:
Whenever there are no more rows in the current page, the next page
will be fetched transparently.
So, if you execute your code like this, you will still the whole result set, but this is paged in a transparent manner.
In order to achieve what you need to use callbacks. You can also find some code sample on the link above.
I added below the full code for reference.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from threading import Event
class PagedResultHandler(object):
def __init__(self, future):
self.error = None
self.finished_event = Event()
self.future = future
self.future.add_callbacks(
callback=self.handle_page,
errback=self.handle_error)
def handle_page(self, rows):
for row in rows:
process_row(row)
if self.future.has_more_pages:
self.future.start_fetching_next_page()
else:
self.finished_event.set()
def handle_error(self, exc):
self.error = exc
self.finished_event.set()
def process_row(user_row):
print user_row.name, user_row.age, user_row.email
cluster = Cluster()
session = cluster.connect()
query = "SELECT * FROM myschema.users"
statement = SimpleStatement(query, fetch_size=5)
future = session.execute_async(statement)
handler = PagedResultHandler(future)
handler.finished_event.wait()
if handler.error:
raise handler.error
cluster.shutdown()
Moving to next page is done in handle_page when start_fetching_next_page is called.
If you replace the if statement with self.finished_event.set() you will see that the iteration stops after the first 5 rows as defined in fetch_size

Issues returning python parameter to main function

problem: Im trying to extract values out of a mysql search to use later within the code.
I have setup a mysql function (connector_mysql) that i use to connect/run the mysql commands.
The problem i'm having is returning these values back out of the mysql function to the rest of the code.
There are two scenarios that i've tried which i think should work but dont when run.
full code sample below...
1. In the mysql function result acts like a dictionary if - using just..
result = cursor.fetchone(variable1, variable2, variable3)
return result
AND in the main function
result1 = connector_mysql(subSerialNum, ldev, today_date)
print(result1)
This appears to work and when printing result1 i get a dictionary looking output:
{'ldev_cap': '0938376656', 'ldev_usedcap': '90937763873'}
HOWEVER...
I cant then use dictionary methods to get or separate the values out.. eg
used_capacity = result1['ldev_cap']
which i would have expected that now 'used_capacity' to represent or equal 0938376656.
Instead i get an error about the object is not able to be subscripted...?
Error below:
File "/Users/graham/PycharmProjects/VmExtrat/VmExtract.py", line 160, in openRead
used_capacity = result1['ldev_cap']
TypeError: 'NoneType' object is not subscriptable
In the mysql function the result acts like a dictionary if I manipulate it and try and return multiple values with the tuple concept - using..
cursor.execute(sql_query, {'serial': subSerialNum, 'lun': ldev, 'todayD': today_date})
result = cursor.fetchone()
while result:
ldev_cap = result['ldev_cap']
ldev_usdcap = result['ldev_usdcap']
return ldev_cap, ldev_usdcap
Here, result acts like a dictionary and i'm able to assign a parameter to the key like:
ldev_cap = result['ldev_cap']
and if you print ldev_cap you get the correct figure...
If I return one figure, the main function line of:
result1 = connector_mysql(subSerialNum, ldev, today_date)
it Works....
HOWEVER...
When then trying to return multiple parameters from the mysql function by doing
return ldev_cap, ldev_usdcap
and in the main function:
capacity, usd_capacity = connector_mysql(subSerialNum, ldev, today_date)
I get errors again
File "/Users/graham/PycharmProjects/VmExtrat/VmExtract.py", line 156, in openRead
capacity, usd_capacity = connector_mysql(subSerialNum, ldev, today_date)
TypeError: 'NoneType' object is not iterable
I think i'm doing the right thing with the dictionary and the tuple but i'm obviously missing something or not doing it correctly, as i need to do this with 4-5 paramaters per sql query,
I didn't want to do multiple querys for the same thing to get the individual paramaters out.
Any help or suggestions would be greatly welcomed...
full code below.
main code:
capacity, usd_capacity = connector_mysql(subSerialNum, ldev, today_date)
print(capacity)
print(usd_capacity)
def connector_mysql(subSerialNum, ldev, today_date):
import pymysql.cursors
db_server = 'localhost'
db_name = 'CBDB'
db_pass = 'secure_password'
db_user = 'user1'
sql_query = (
"SELECT ldev_cap, ldev_usdcap FROM Ldevs WHERE sub_serial=%(serial)s "
"and ldev_id=%(lun)s and data_date=%(todayD)s")
connection = pymysql.connect(host=db_server,
user=db_user,
password=db_pass,
db=db_name,
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
cursor.execute(sql_query, {'serial': subSerialNum, 'lun': ldev, 'todayD': today_date})
result = cursor.fetchone()
#return result #used for returning dict
while result:
ldev_cap = result['ldev_cap']
ldev_usdcap = result['ldev_usdcap']
print(result)
return ldev_cap, ldev_usdcap
finally:
connection.close()

PySQLPool and Celery, proper way to use it?

I am wondering what is the proper way to use the mysql pool with celery tasks.
At the moment, this is how (the relevant portion) of my tasks module looks like:
from start import celery
import PySQLPool as pool
dbcfg = config.get_config('inputdb')
input_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
dbcfg = config.get_config('outputdb')
output_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
#celery.task
def fetch():
ic = pool.getNewQuery(input_db)
oc = pool.getNewQuery(output_db)
count = 1
for e in get_new_stuff():
# do stuff with new stuff
# read the db with ic
# write to db using oc
# commit from time to time
if count % 1000:
pool.commitPool()
# commit whatever's left
pool.commitPool()
On one machine there can be at most 4 fetch() tasks running at the same time (1 per core).
I notice, however, that sometimes a task will hang and I suspect it is due to mysql.
Any tips on how to use mysql and celery?
Thank you!

I am also using celery and PySQLPool.
maria = PySQLPool.getNewConnection(username=app.config["MYSQL_USER"],
password=app.config["MYSQL_PASSWORD"],
host=app.config["MYSQL_HOST"],
db='configuration')
def myfunc(self, param1, param2):
query = PySQLPool.getNewQuery(maria, True)
try:
sSql = """
SELECT * FROM table
WHERE col1= %s AND col2
"""
tDatas = ( var1, var2)
query.Query(sSql, tDatas)
return query.record
except Exception, e:
logger.info(e)
return False
#celery.task
def fetch():
myfunc('hello', 'world')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelizing pandas pyodbc SQL database calls - python

Related

mock psycopg2 fetchone and fetchall to return different dataset doesn't work [duplicate]

UnitTesting: Mock pyodbc cursor messages

python cassandra get big result of select * in generator (without storage result in ram)

Issues returning python parameter to main function

PySQLPool and Celery, proper way to use it?

Categories

Resources