Why is Twisted's adbapi failing to recover data from within unittests? - python

Overview
Context
I am writing unit tests for some higher-order logic that depends on writing to an SQLite3 database. For this I am using twisted.trial.unittest and twisted.enterprise.adbapi.ConnectionPool.
Problem statement
I am able to create a persistent sqlite3 database and store data therein. Using sqlitebrowser, I am able to verify that the data has been persisted as expected.
The issue is that calls to t.e.a.ConnectionPool.run* (e.g.: runQuery) return an empty set of results, but only when called from within a TestCase.
Notes and significant details
The problem I am experiencing occurs only within Twisted's trial framework. My first attempt at debugging was to pull the database code out of the unit test and place it into an independent test/debug script. Said script works as expected while the unit test code does not (see examples below).
Case 1: misbehaving unit test
init.sql
This is the script used to initialize the database. There are no (apparent) errors stemming from this file.
CREATE TABLE ajxp_changes ( seq INTEGER PRIMARY KEY AUTOINCREMENT, node_id NUMERIC, type TEXT, source TEXT, target TEXT, deleted_md5 TEXT );
CREATE TABLE ajxp_index ( node_id INTEGER PRIMARY KEY AUTOINCREMENT, node_path TEXT, bytesize NUMERIC, md5 TEXT, mtime NUMERIC, stat_result BLOB);
CREATE TABLE ajxp_last_buffer ( id INTEGER PRIMARY KEY AUTOINCREMENT, type TEXT, location TEXT, source TEXT, target TEXT );
CREATE TABLE ajxp_node_status ("node_id" INTEGER PRIMARY KEY NOT NULL , "status" TEXT NOT NULL DEFAULT 'NEW', "detail" TEXT);
CREATE TABLE events (id INTEGER PRIMARY KEY AUTOINCREMENT, type text, message text, source text, target text, action text, status text, date text);
CREATE TRIGGER LOG_DELETE AFTER DELETE ON ajxp_index BEGIN INSERT INTO ajxp_changes (node_id,source,target,type,deleted_md5) VALUES (old.node_id, old.node_path, "NULL", "delete", old.md5); END;
CREATE TRIGGER LOG_INSERT AFTER INSERT ON ajxp_index BEGIN INSERT INTO ajxp_changes (node_id,source,target,type) VALUES (new.node_id, "NULL", new.node_path, "create"); END;
CREATE TRIGGER LOG_UPDATE_CONTENT AFTER UPDATE ON "ajxp_index" FOR EACH ROW BEGIN INSERT INTO "ajxp_changes" (node_id,source,target,type) VALUES (new.node_id, old.node_path, new.node_path, CASE WHEN old.node_path = new.node_path THEN "content" ELSE "path" END);END;
CREATE TRIGGER STATUS_DELETE AFTER DELETE ON "ajxp_index" BEGIN DELETE FROM ajxp_node_status WHERE node_id=old.node_id; END;
CREATE TRIGGER STATUS_INSERT AFTER INSERT ON "ajxp_index" BEGIN INSERT INTO ajxp_node_status (node_id) VALUES (new.node_id); END;
CREATE INDEX changes_node_id ON ajxp_changes( node_id );
CREATE INDEX changes_type ON ajxp_changes( type );
CREATE INDEX changes_node_source ON ajxp_changes( source );
CREATE INDEX index_node_id ON ajxp_index( node_id );
CREATE INDEX index_node_path ON ajxp_index( node_path );
CREATE INDEX index_bytesize ON ajxp_index( bytesize );
CREATE INDEX index_md5 ON ajxp_index( md5 );
CREATE INDEX node_status_status ON ajxp_node_status( status );
test_sqlite.py
This is the unit test class that fails unexpectedly. TestStateManagement.test_db_clean passes, indicated that the tables were properly created. TestStateManagement.test_inode_create fails, reporitng that zero results were retrieved.
import os.path as osp
from twisted.internet import defer
from twisted.enterprise import adbapi
import sqlengine # see below
class TestStateManagement(TestCase):
def setUp(self):
self.meta = mkdtemp()
self.db = adbapi.ConnectionPool(
"sqlite3", osp.join(self.meta, "db.sqlite"), check_same_thread=False,
)
self.stateman = sqlengine.StateManager(self.db)
with open("init.sql") as f:
script = f.read()
self.d = self.db.runInteraction(lambda c, s: c.executescript(s), script)
def tearDown(self):
self.db.close()
del self.db
del self.stateman
del self.d
rmtree(self.meta)
#defer.inlineCallbacks
def test_db_clean(self):
"""Canary test to ensure that the db is initialized in a blank state"""
yield self.d # wait for db to be initialized
q = "SELECT name FROM sqlite_master WHERE type='table' AND name=?;"
for table in ("ajxp_index", "ajxp_changes"):
res = yield self.db.runQuery(q, (table,))
self.assertTrue(
len(res) == 1,
"table {0} does not exist".format(table)
)
#defer.inlineCallbacks
def test_inode_create_file(self):
yield self.d
path = osp.join(self.ws, "test.txt")
with open(path, "wt") as f:
pass
inode = mk_dummy_inode(path)
yield self.stateman.create(inode, directory=False)
entry = yield self.db.runQuery("SELECT * FROM ajxp_index")
emsg = "got {0} results, expected 1. Are canary tests failing?"
lentry = len(entry)
self.assertTrue(lentry == 1, emsg.format(lentry))
sqlengine.py
These are the artefacts being tested by the above unit tests.
def values_as_tuple(d, *param):
"""Return the values for each key in `param` as a tuple"""
return tuple(map(d.get, param))
class StateManager:
"""Manages the SQLite database's state, ensuring that it reflects the state
of the filesystem.
"""
log = Logger()
def __init__(self, db):
self._db = db
def create(self, inode, directory=False):
params = values_as_tuple(
inode, "node_path", "bytesize", "md5", "mtime", "stat_result"
)
directive = (
"INSERT INTO ajxp_index (node_path,bytesize,md5,mtime,stat_result) "
"VALUES (?,?,?,?,?);"
)
return self._db.runOperation(directive, params)
Case 2: bug disappears outside of twisted.trial
#! /usr/bin/env python
import os.path as osp
from tempfile import mkdtemp
from twisted.enterprise import adbapi
from twisted.internet.task import react
from twisted.internet.defer import inlineCallbacks
INIT_FILE = "example.sql"
def values_as_tuple(d, *param):
"""Return the values for each key in `param` as a tuple"""
return tuple(map(d.get, param))
def create(db, inode):
params = values_as_tuple(
inode, "node_path", "bytesize", "md5", "mtime", "stat_result"
)
directive = (
"INSERT INTO ajxp_index (node_path,bytesize,md5,mtime,stat_result) "
"VALUES (?,?,?,?,?);"
)
return db.runOperation(directive, params)
def init_database(db):
with open(INIT_FILE) as f:
script = f.read()
return db.runInteraction(lambda c, s: c.executescript(s), script)
#react
#inlineCallbacks
def main(reactor):
meta = mkdtemp()
db = adbapi.ConnectionPool(
"sqlite3", osp.join(meta, "db.sqlite"), check_same_thread=False,
)
yield init_database(db)
# Let's make sure the tables were created as expected and that we're
# starting from a blank slate
res = yield db.runQuery("SELECT * FROM ajxp_index LIMIT 1")
assert not res, "database is not empty [ajxp_index]"
res = yield db.runQuery("SELECT * FROM ajxp_changes LIMIT 1")
assert not res, "database is not empty [ajxp_changes]"
# The details of this are not important. Suffice to say they (should)
# conform to the DB schema for ajxp_index.
test_data = {
"node_path": "/this/is/some/arbitrary/path.ext",
"bytesize": 0,
"mtime": 179273.0,
"stat_result": b"this simulates a blob of raw binary data",
"md5": "d41d8cd98f00b204e9800998ecf8427e", # arbitrary
}
# store the test data in the ajxp_index table
yield create(db, test_data)
# test if the entry exists in the db
entry = yield db.runQuery("SELECT * FROM ajxp_index")
assert len(entry) == 1, "got {0} results, expected 1".format(len(entry))
print("OK")
Closing remarks
Again, upon checking with sqlitebrowser, it seems as though the data is being written to db.sqlite, so this looks like a retrieval problem. From here, I'm sort of stumped... any ideas?
EDIT
This code will produce an inode that that can be used for testing.
def mk_dummy_inode(path, isdir=False):
return {
"node_path": path,
"bytesize": osp.getsize(path),
"mtime": osp.getmtime(path),
"stat_result": dumps(stat(path), protocol=4),
"md5": "directory" if isdir else "d41d8cd98f00b204e9800998ecf8427e",
}

Okay, it turns out that this is a bit of a tricky one. Running the tests in isolation (as was posted to this question) makes it such that the bug only rarely occurs. However, when running in the context of an entire test suite, it fails almost 100% of the time.
I added yield task.deferLater(reactor, .00001, lambda: None) after writing to the db and before reading from the db, and this solves the issue.
From there, I suspected this might be a race condition stemming from the connection pool and sqlite's limited concurrency-tolerance. I tried setting the cb_min and cb_max parameters to ConnectionPool to 1, and this also solved the issue.
In short: it seems as though sqlite doesn't play very nicely with multiple connections, and that the appropriate fix is to avoid concurrency to the extent possible.

If you take a look at your setUp function, you're returning self.db.runInteraction(...), which returns a deferred. As you've noted, you assume that it waits for the deferred to finish. However this is not the case and it's a trap that most fall victim to (myself included). I'll be honest with you, for situations like this, especially for unit tests, I just execute the synchronous code outside the TestCase class to initialize the database. For example:
def init_db():
import sqlite3
conn = sqlite3.connect('db.sqlite')
c = conn.cursor()
with open("init.sql") as f:
c.executescript(f.read())
init_db() # call outside test case
class TestStateManagement(TestCase):
"""
My test cases
"""
Alternatively, you could decorate the setup and yield runOperation(...) but something tells me that it wouldn't work... In any case, it's surprising that no errors were raised.
PS
I've been eyeballing this question for a while and it's been in the back of my head for days now. A potential reason for this finally dawned on me at nearly 1am. However, I'm too tired/lazy to actually test this out :D but it's a pretty damn good hunch. I'd like to commend you on your level of detail in this question.

Related

Python sqlalchemy and mySQL stored procedure always returns 0 (out param only)

I am trying to get the ROW_COUNT() from a MySQL stored procedure into python.
here is what I got, but I don't know what I am missing.
DELIMITER //
CREATE OR REPLACE PROCEDURE sp_refresh_mytable(
OUT row_count INT
)
BEGIN
DECLARE exit handler for SQLEXCEPTION
BEGIN
ROLLBACK;
END;
DECLARE exit handler for SQLWARNING
BEGIN
ROLLBACK;
END;
DECLARE exit handler FOR NOT FOUND
BEGIN
ROLLBACK;
END;
START TRANSACTION;
DELETE FROM mytable;
INSERT INTO mytable
(
col1
, col2
)
SELECT
col1
, col2
FROM othertable
;
SET row_count = ROW_COUNT();
COMMIT;
END //
DELIMITER ;
If I call this in via normal SQL like follows I get the correct row_count of the insert operation (e.g. 26 rows inserted):
CALL sp_refresh_mytable(#rowcount);
select #rowcount as t;
-- output: 26
Then in python/mysqlalchemy:
def call_procedure(engine, function_name, params=None):
connection = engine.raw_connection()
try:
cursor = connection.cursor()
result = cursor.callproc('sp_refresh_mytable', [0])
## try result outputs
resultfetch = cursor.fetchone()
logger.info(result)
logger.info(result[0])
logger.info(resultfetch)
cursor.close()
connection.commit()
connection.close()
logger.info(f"Running procedure {function_name} success!")
return result
except Exception as e:
logger.error(f"Running procedure {function_name} failed!")
logger.exception(e)
return None
finally:
connection.close()
So I tried logging different variations of getting the out value, but it is always 0 or None.
[INFO] db_update [0]
[INFO] db_update 0
[INFO] db_update None
What am I missing?
Thanks!
With the help of this answer I found the following solution that worked for me.
a) Working solution using engine.raw_connection() and cursor.callproc:
def call_procedure(engine, function_name):
connection = engine.raw_connection()
try:
cursor = connection.cursor()
cursor.callproc(function_name, [0])
cursor.execute(f"""SELECT #_{function_name}_0""")
results = cursor.fetchone() ## returns a tuple e.g. (285,)
rows_affected = results[0]
cursor.close()
connection.commit()
logger.info(f"Running procedure {function_name} success!")
return rows_affected
except Exception as e:
logger.error(f"Running procedure {function_name} failed!")
logger.exception(e)
return None
finally:
connection.close()
And with this answer I found this solution also:
b) Instead of using a raw connection, this worked as well:
def call_procedure(engine, function_name, params=None):
try:
with engine.begin() as db_conn:
db_conn.execute(f"""CALL {function_name}(#out)""")
results = db_conn.execute('SELECT #out').fetchone() ## returns a tuple e.g. (285,)
rows_affected = results[0]
logger.debug(f"Running procedure {function_name} success!")
return rows_affected
except Exception as e:
logger.error(f"Running procedure {function_name} failed!")
logger.exception(e)
return None
finally:
if db_conn: db_conn.close()
If there are any advantages or drawbacks of using one of these methods over the other, please let me know in a comment.
I just wanted to add another piece of code, since I was trying to get callproc to work (using sqlalchemy) with multiple in- and out-params.
For this case I went with the callproc method using a raw connection [solution b) in my previous answer], since this functions accepts params as a list.
It could probably be done more elegantly or more pythonic in some parts, but it was mainly for getting it to work and I will probably create a function from this so I can use it for generically calling a SP with multiple in and out params.
I included comments in the code below to make it easier to understand what is going on.
In my case I decided to put the out-params in a dict so I can pass it along to the calling app in case I need to react to the results. Of course you could also include the in-params which could make sense for error logging maybe.
## some in params
function_name = 'sp_upsert'
in_param1 = 'my param 1'
in_param2 = 'abcdefg'
in_param3 = 'some-name'
in_param4 = 'some display name'
in_params = [in_param1, in_param1, in_param1, in_param1]
## out params
out_params = [
'out1_row_count'
,'out2_row_count'
,'out3_row_count'
,'out4_row_count_ins'
,'out5_row_count_upd'
]
params = copy(in_params)
## adding the outparams as integers from out_params indices
params.extend([i for i, x in enumerate(out_params)])
## the params list will look like
## ['my param 1', 'abcdefg', 'some-name', 'some display name', 0, 1, 2, 3, 4]
logger.info(params)
## build query to get results from callproc (including in and out params)
res_qry_params = []
for i in range(len(params)):
res_qry_params.append(f"#_{function_name}_{i}")
res_qry = f"SELECT {', '.join(res_qry_params)}"
## the query to fetch the results (in and out params) will look like
## SELECT #_sp_upsert_0, #_sp_upsert_1, #_sp_upsert_2, #_sp_upsert_3, #_sp_upsert_4, #_sp_upsert_5, #_sp_upsert_6, #_sp_upsert_7, #_sp_upsert_8
logger.info(res_qry)
try:
connection = engine.raw_connection()
## calling the sp
cursor = connection.cursor()
cursor.callproc(function_name, params)
## get the results (includes in and out params), the 0/1 in the end are the row_counts from the sp
## fetchone is enough since all results come as on result record like
## ('my param 1', 'abcdefg', 'some-name', 'some display name', 1, 0, 1, 1, 0)
cursor.execute(res_qry)
results = cursor.fetchone()
logger.info(results)
## adding just the out params to a dict
res_dict = {}
for i, element in enumerate(out_params):
res_dict.update({
element: results[i + len(in_params)]
})
## the result dict in this case only contains the out param results and will look like
## { 'out1_row_count': 1,
## 'out2_row_count': 0,
## 'out3_row_count': 1,
## 'out4_row_count_ins': 1,
## 'out5_row_count_upd': 0}
logger.info(pformat(res_dict, indent=2, sort_dicts=False))
cursor.close()
connection.commit()
logger.debug(f"Running procedure {function_name} success!")
except Exception as e:
logger.error(f"Running procedure {function_name} failed!")
logger.exception(e)
Just to complete the picture, here is a shortened version of my stored procedure. After BEGIN I declare some error handlers I set the out params to default 0, otherwise they could also return as NULL/None if not set by the procedure (e.g. because no insert was made):
DELIMITER //
CREATE OR REPLACE PROCEDURE sp_upsert(
IN in_param1 VARCHAR(32),
IN in_param2 VARCHAR(250),
IN in_param3 VARCHAR(250),
IN in_param4 VARCHAR(250),
OUT out1_row_count INTEGER,
OUT out2_row_count INTEGER,
OUT out3_row_count INTEGER,
OUT out4_row_count_ins INTEGER,
OUT out5_row_count_upd INTEGER
)
BEGIN
-- declare variables, do NOT declare the out params here!
DECLARE dummy INTEGER DEFAULT 0;
-- declare error handlers (e.g. continue handler for not found)
DECLARE CONTINUE HANDLER FOR NOT FOUND SET dummy = 1;
-- set out params defaulting to 0
SET out1_row_count = 0;
SET out2_row_count = 0;
SET out3_row_count = 0;
SET out4_row_count_ins = 0;
SET out5_row_count_upd = 0;
-- do inserts and updates and set the outparam variables accordingly
INSERT INTO some_table ...;
SET out1_row_count = ROW_COUNT();
-- commit if no errors
COMMIT;
END //
DELIMITER ;

How change one value to another in one place and use it in couple functions?

I'm writing test automation for API in BDD behave. I need a switcher between environments. Is any possible way to change one value in one place without adding this value to every functions? Example:
I've tried to do it by adding value to every function but its makes all project very complicated
headers = {
'Content-Type': 'application/json',
'country': 'fi'
}
what i what to switch only country value in headers e.g from 'fi' to 'es'
and then all function should switch themselves to es environment, e.g
def sending_post_request(endpoint, user):
url = fi_api_endpoints.api_endpoints_list.get(endpoint)
personalId = {'personalId': user}
json_post = requests.post(url,
headers=headers,
data=json.dumps(personalId)
)
endpoint_message = json_post.text
server_status = json_post.status_code
def phone_number(phone_number_status):
if phone_number_status == 'wrong':
cursor = functions_concerning_SQL_conection.choosen_db('fi_sql_identity')
cursor.execute("SELECT TOP 1 PersonalId from Registrations where PhoneNumber is NULL")
result = cursor.fetchone()
user_with_no_phone_number = result[0]
return user_with_no_phone_number
else:
cursor = functions_concerning_SQL_conection.choosen_db('fi_sql_identity')
cursor.execute("SELECT TOP 1 PersonalId from Registrations where PhoneNumber is not NULL")
result = cursor.fetchone()
user_with_phone_number = result[0]
return user_with_phone_number
and when i will change from 'fi' to 'es' in headers i want:
fi_sql_identity change to es_sql_identity
url = fi_api_endpoints.api_endpoints_list.get(endpoint) change to
url = es_api_endpoints.api_endpoints_list.get(endpoint)
thx and please help
With respect to your original question, a solution for this case is closure:
def f(x):
def long_calculation(y):
return x * y
return long_calculation
# create different functions without dispatching multiple times
g = f(val_1)
h = f(val_2)
g(val_3)
h(val_3)
Well, the problem is why do you hardcode everything? With the update you can simplify your function as:
def phone_number(phone_number_status, db_name='fi_sql_identity'):
cursor = functions_concerning_SQL_conection.choosen_db(db_name)
if phone_number_status == 'wrong':
sql = "SELECT TOP 1 PersonalId from Registrations where PhoneNumber is NULL"
else:
sql = "SELECT TOP 1 PersonalId from Registrations where PhoneNumber is not NULL"
cursor.execute(sql)
result = cursor.fetchone()
return result[0]
Also please don't write like:
# WRONG
fi_db_conn.send_data()
But use a parameter:
region = 'fi' # or "es"
db_conn = initialize_conn(region)
db_conn.send_data()
And use a config file to store your endpoints with respect to your region, e.g. consider YAML:
# config.yml
es:
db_name: es_sql_identity
fi:
db_name: fi_sql_identity
Then use them in Python:
import yaml
with open('config.yml') as f:
config = yaml.safe_load(f)
region = 'fi'
db_name = config[region]['db_name'] # "fi_sql_identity"
# status = ...
result = phone_number(status, db_name)
See additional useful link for using YAML.
First, provide an encapsulation how to access the resources of a region by providing this encapsulation with a region parameter. It may also be a good idea to provide this functionality as a behave fixture.
CASE 1: region parameter needs to vary between features / scenarios
For example, this means that SCENARIO_1 needs region="fi" and SCENARIO_2 needs region="es".
Use fixture and fixture-tag with region parameter.
In this case you need to write own scenarios for each region (BAD TEST REUSE)
or use a ScenarioOutline as template to let behave generate the tests for you (by using a fixture-tag with a region parameter value for example).
CASE 2: region parameter is constant for all features / scenarios (during test-run)
You can support multiple test-runs with different region parameters by using a userdata parameter.
Look at behave userdata concept.
This allows you to run behave -D region=fi ... and behave -D region=es ...
This case provides a better reuse of testsuite, meaning a large part of the testsuite is the common testsuite that is applied to all regions.
HINT: Your code examples are too specific ("fi" based) which is a BAD-SMELL.

SQLAlchemy 'on_conflict_do_update' does not update

I have the following code which I would like to do an upsert:
def add_electricity_reading(
*, period_usage, period_started_at, is_estimated, customer_pk
):
from sqlalchemy.dialects.postgresql import insert
values = dict(
customer_pk=customer_pk,
period_usage=period_usage,
period_started_at=period_started_at,
is_estimated=is_estimated,
)
insert_stmt = insert(ElectricityMeterReading).values(**values)
do_update_stmt = insert_stmt.on_conflict_do_update(
constraint=ElectricityMeterReading.__table_args__[0].name,
set_=dict(
period_usage=period_usage,
period_started_at=period_started_at,
is_estimated=is_estimated,
)
)
conn = DBSession.connection()
conn.execute(do_update_stmt)
return DBSession.query(ElectricityMeterReading).filter_by(**dict(
period_usage=period_usage,
period_started_at=period_started_at,
customer_pk=customer_pk,
is_estimated=is_estimated,
)).one()
def test_updates_existing_record_for_started_at_if_already_exists():
started_at = datetime.now(timezone.utc)
existing = add_electricity_reading(
period_usage=0.102,
customer_pk=customer.pk,
period_started_at=started_at,
is_estimated=True,
)
started_at = existing.period_started_at
reading = add_electricity_reading(
period_usage=0.200,
customer_pk=customer.pk,
period_started_at=started_at,
is_estimated=True,
)
# existing record was updated
assert reading.period_usage == 0.200
assert reading.id == existing.id
In my test when I add an existing record with period_usage=0.102 and then execute the query again but change to period_usage=0.2. When the final query at the bottom returns the record the period_usage is still 0.102.
Any idea why this could be happening?
This behaviour is explained in "Session Basics" under "What does the Session do?" The session holds references to objects it has loaded in a structure called the identity map, and so ensures that only 1 unique object per primary key value exists at a time during a session's lifetime. You can verify this with the following assertion in your own code:
assert existing is reading
The Core insert (or update) statements you are executing do not keep the session in sync with the changes taking place in the database the way for example Query.update() does. In order to fetch the new values you can expire the ORM loaded state of the unique object:
DBSession.expire(existing) # or reading, does not matter
# existing record was updated
assert reading.period_usage == 0.200
assert reading.id == existing.id

python cassandra get big result of select * in generator (without storage result in ram)

I want to get all data in cassandra table "user"
i have 840000 users and i don't want to get all users in python list.
i want get users in packs of 100 users
in cassandra doc https://datastax.github.io/python-driver/query_paging.html
i see i can use fetch_size, but in my python code i have database object that contains all cql instruction
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
class Database:
def __init__(self, name, salary):
self.cluster = Cluster(['192.168.1.1', '192.168.1.2'])
self.session = cluster.connect()
def get_users(self):
users_list = []
query = "SELECT * FROM users"
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):
users_list.append(user_row.name)
return users_list
actually get_users return very big list of user name
but i want to transform return get_users to a "generator"
i don't want get all users name in 1 list and 1 call of function get_users, but i want to have lot of call get_users and return list with only 100 users max every call function
for example :
list1 = database.get_users()
list2 = database.get_users()
...
listn = database.get_users()
list1 contains 100 first user in query
list2 contains 100 "second" users in query
listn contains the latest elements in query (<=100)
is this possible ?
thanks for advance for your answer
According to Paging Large Queries:
Whenever there are no more rows in the current page, the next page
will be fetched transparently.
So, if you execute your code like this, you will still the whole result set, but this is paged in a transparent manner.
In order to achieve what you need to use callbacks. You can also find some code sample on the link above.
I added below the full code for reference.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from threading import Event
class PagedResultHandler(object):
def __init__(self, future):
self.error = None
self.finished_event = Event()
self.future = future
self.future.add_callbacks(
callback=self.handle_page,
errback=self.handle_error)
def handle_page(self, rows):
for row in rows:
process_row(row)
if self.future.has_more_pages:
self.future.start_fetching_next_page()
else:
self.finished_event.set()
def handle_error(self, exc):
self.error = exc
self.finished_event.set()
def process_row(user_row):
print user_row.name, user_row.age, user_row.email
cluster = Cluster()
session = cluster.connect()
query = "SELECT * FROM myschema.users"
statement = SimpleStatement(query, fetch_size=5)
future = session.execute_async(statement)
handler = PagedResultHandler(future)
handler.finished_event.wait()
if handler.error:
raise handler.error
cluster.shutdown()
Moving to next page is done in handle_page when start_fetching_next_page is called.
If you replace the if statement with self.finished_event.set() you will see that the iteration stops after the first 5 rows as defined in fetch_size

Cassandra multiprocessing can't pickle _thread.lock objects

I tried to use Cassandra and multiprocessing to insert rows (dummy data) concurrently based on the examples in
http://www.datastax.com/dev/blog/datastax-python-driver-multiprocessing-example-for-improved-bulk-data-throughput
This is my code
class QueryManager(object):
concurrency = 100 # chosen to match the default in execute_concurrent_with_args
def __init__(self, session, process_count=None):
self.pool = Pool(processes=process_count, initializer=self._setup, initargs=(session,))
#classmethod
def _setup(cls, session):
cls.session = session
cls.prepared = cls.session.prepare("""
INSERT INTO test_table (key1, key2, key3, key4, key5) VALUES (?, ?, ?, ?, ?)
""")
def close_pool(self):
self.pool.close()
self.pool.join()
def get_results(self, params):
results = self.pool.map(_multiprocess_write, (params[n:n+self.concurrency] for n in range(0, len(params), self.concurrency)))
return list(itertools.chain(*results))
#classmethod
def _results_from_concurrent(cls, params):
return [results[1] for results in execute_concurrent_with_args(cls.session, cls.prepared, params)]
def _multiprocess_write(params):
return QueryManager._results_from_concurrent(params)
if __name__ == '__main__':
processes = 2
# connect cluster
cluster = Cluster(contact_points=['127.0.0.1'], port=9042)
session = cluster.connect()
# database name is a concatenation of client_id and system_id
keyspace_name = 'unit_test_0'
# drop keyspace if it already exists in a cluster
try:
session.execute("DROP KEYSPACE IF EXISTS " + keyspace_name)
except:
pass
create_keyspace_query = "CREATE KEYSPACE " + keyspace_name \
+ " WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};"
session.execute(create_keyspace_query)
# use a session's keyspace
session.set_keyspace(keyspace_name)
# drop table if it already exists in the keyspace
try:
session.execute("DROP TABLE IF EXISTS " + "test_table")
except:
pass
# create a table for invoices in the keyspace
create_test_table = "CREATE TABLE test_table("
keys = "key1 text,\n" \
"key2 text,\n" \
"key3 text,\n" \
"key4 text,\n" \
"key5 text,\n"
create_invoice_table_query += keys
create_invoice_table_query += "PRIMARY KEY (key1))"
session.execute(create_test_table)
qm = QueryManager(session, processes)
params = list()
for row in range(100000):
key = 'test' + str(row)
params.append([key, 'test', 'test', 'test', 'test'])
start = time.time()
rows = qm.get_results(params)
delta = time.time() - start
log.info(fm('Cassandra inserts 100k dummy rows for ', delta, ' secs'))
when I executed the code, I got the following error
TypeError: can't pickle _thread.lock objects
which pointed at
self.pool = Pool(processes=process_count, initializer=self._setup, initargs=(session,))
That would suggest you're trying to serialize a lock over IPC boundaries. I think it might be because you're supplying a Session object as an argument to the worker initialization function. Make the init function create a new session in each worker process (see the "Session per Process" section in the blog post you cited).
I know this already has an answer, but I wanted to highlight a couple of changes in the cassandra-driver package that make this code still not work properly with python 3.7 and the 3.18.0 cassandra-driver package.
If you look at the blog post that is linked. The __init__ function doesn't pass in the session, but it passes a cluster object. Even the cluster cannot be sent as an initarg anymore because it contains a lock. You need to create it inside the def _setup(cls): classmethod.
Second, execute_concurrent_with_args returns a ResultSet now and that also cannot be serialized. The older version of the cassandra-driver package just returned a list of objects.
To fix the above code change these 2 sections:
First, the __init__ and _setup methods
def __init__(self, process_count=None):
self.pool = Pool(processes=process_count, initializer=self._setup)
#classmethod
def _setup(cls):
cluster = Cluster()
cls.session = cluster.connect()
cls.prepared = cls.session.prepare("""
INSERT INTO test_table (key1, key2, key3, key4, key5) VALUES (?, ?, ?, ?, ?)
""")
Second, the _results_from_concurrent method
#classmethod
def _results_from_concurrent(cls, params):
return [list(results[1]) for results in execute_concurrent_with_args(cls.session, cls.prepared, params)]
Lastly, if you are interested in a gist for the multiprocess_execute.py in the original DataStax blog post that works with python3 and cassandra-driver 3.18.0, you can find that here: https://gist.github.com/jWolo/6127b2e57c7e24740afd7a4254cc00a3

Categories

Resources