High numerical precision floats with MySQL and the SQLAlchemy ORM

High numerical precision floats with MySQL and the SQLAlchemy ORM - python

I store some numbers in a MySQL using the ORM of SQLAlchemy. When I fetch them afterward, they are truncated such that only 6 significant digits are conserved, thus losing a lot of precision on my float numbers. I suppose there is an easy way to fix this but I can't find how. For example, the following code:
import sqlalchemy as sa
from sqlalchemy.pool import QueuePool
import sqlalchemy.ext.declarative as sad
Base = sad.declarative_base()
Session = sa.orm.scoped_session(sa.orm.sessionmaker())
class Test(Base):
__tablename__ = "test"
__table_args__ = {'mysql_engine':'InnoDB'}
no = sa.Column(sa.Integer, primary_key=True)
x = sa.Column(sa.Float)
a = 43210.123456789
b = 43210.0
print a, b, a - b
dbEngine = sa.create_engine("mysql://chore:BlockWork33!#localhost", poolclass=QueuePool, pool_size=20,
pool_timeout=180)
Session.configure(bind=dbEngine)
session = Session()
dbEngine.execute("CREATE DATABASE IF NOT EXISTS test")
dbEngine.execute("USE test")
Base.metadata.create_all(dbEngine)
try:
session.add_all([Test(x=a), Test(x=b)])
session.commit()
except:
session.rollback()
raise
[(a,), (b,)] = session.query(Test.x).all()
print a, b, a - b
produces
43210.1234568 43210.0 0.123456788999
43210.1 43210.0 0.0999999999985
and I would need a solution for it to produce
43210.1234568 43210.0 0.123456788999
43210.1234568 43210.0 0.123456788999

Per our discussion in the comments: sa.types.Float(precision=[precision here]) instead of sa.Float allows you to specify precision; however, sa.Float(Precision=32) has no effect. See the documentation for more information.

Related

How can I get keyword arguments in sqlalchemy.FunctionElement?

I am trying to create a function which would produce statement equivalent to datetime.utcnow() + timedelta(days=5, minutes=4). I want to be able to call it like utc_after(days=5, minutes=4).
It should be similar to utcnow(), as described in SQLAlchemy documentation.
Here is an example what I got working so far (one dialect only for brevity):
from sqlalchemy.sql.expression import FunctionElement
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.types import DateTime
class utc_after(FunctionElement):
type = DateTime()
name = 'utc_after'
#compiles(utc_after, 'sqlite')
def sqlite_utc_after(element, compiler, **kwargs):
days, hours = list(element.clauses)
return "datetime('now', '+%s day', '+%s hours')" % (days.value, hours.value)
It works. I can use it as in:
week_after_submission = db.Column(db.DateTime, default=utc_after(7, 0))
Obviously this is only a stub of the final code (it needs +/- formatting, minutes, seconds and so on).
The question is: how can I use keyword arguments from FunctionElement so I could have:
next_week = db.Column(db.DateTime, default=utc_after(days=7))
When I specify utc_after(days=7), element.clauses is empty.
So far I tried using kwargs from sqlite_utc_after (these are empty), digging through element properties, and searching clues in documentation, without results.

Set the keyword argument as a property.
from sqlalchemy.sql.functions import GenericFunction
from sqlalchemy.ext.compiler import compiles
from sqlalchemy import Table, Column, String, MetaData
class last_value(GenericFunction):
def __init__(self, *clauses, ignore_nulls=False, **kwargs):
self.ignore_nulls = ignore_nulls
super().__init__(*clauses, **kwargs)
#compiles(last_value)
def visit_last_value(element, compiler, **kwargs):
clauses = element.clauses
ignore_nulls = element.ignore_nulls
ignore_nulls_clause = ' ignore nulls' if ignore_nulls else ''
return 'last_value(%s%s)' % (compiler.process(clauses), ignore_nulls_clause)
def test_last_value():
m = MetaData()
t = Table('test', m, Column('a', String), Column('b', String))
c = func.last_value(t.c.a, ignore_nulls=True)
s = select([c])
assert ' '.join(str(s).split()) == 'SELECT last_value(test.a ignore nulls) AS last_value_1 FROM test'

Why is Twisted's adbapi failing to recover data from within unittests?

Overview
Context
I am writing unit tests for some higher-order logic that depends on writing to an SQLite3 database. For this I am using twisted.trial.unittest and twisted.enterprise.adbapi.ConnectionPool.
Problem statement
I am able to create a persistent sqlite3 database and store data therein. Using sqlitebrowser, I am able to verify that the data has been persisted as expected.
The issue is that calls to t.e.a.ConnectionPool.run* (e.g.: runQuery) return an empty set of results, but only when called from within a TestCase.
Notes and significant details
The problem I am experiencing occurs only within Twisted's trial framework. My first attempt at debugging was to pull the database code out of the unit test and place it into an independent test/debug script. Said script works as expected while the unit test code does not (see examples below).
Case 1: misbehaving unit test
init.sql
This is the script used to initialize the database. There are no (apparent) errors stemming from this file.
CREATE TABLE ajxp_changes ( seq INTEGER PRIMARY KEY AUTOINCREMENT, node_id NUMERIC, type TEXT, source TEXT, target TEXT, deleted_md5 TEXT );
CREATE TABLE ajxp_index ( node_id INTEGER PRIMARY KEY AUTOINCREMENT, node_path TEXT, bytesize NUMERIC, md5 TEXT, mtime NUMERIC, stat_result BLOB);
CREATE TABLE ajxp_last_buffer ( id INTEGER PRIMARY KEY AUTOINCREMENT, type TEXT, location TEXT, source TEXT, target TEXT );
CREATE TABLE ajxp_node_status ("node_id" INTEGER PRIMARY KEY NOT NULL , "status" TEXT NOT NULL DEFAULT 'NEW', "detail" TEXT);
CREATE TABLE events (id INTEGER PRIMARY KEY AUTOINCREMENT, type text, message text, source text, target text, action text, status text, date text);
CREATE TRIGGER LOG_DELETE AFTER DELETE ON ajxp_index BEGIN INSERT INTO ajxp_changes (node_id,source,target,type,deleted_md5) VALUES (old.node_id, old.node_path, "NULL", "delete", old.md5); END;
CREATE TRIGGER LOG_INSERT AFTER INSERT ON ajxp_index BEGIN INSERT INTO ajxp_changes (node_id,source,target,type) VALUES (new.node_id, "NULL", new.node_path, "create"); END;
CREATE TRIGGER LOG_UPDATE_CONTENT AFTER UPDATE ON "ajxp_index" FOR EACH ROW BEGIN INSERT INTO "ajxp_changes" (node_id,source,target,type) VALUES (new.node_id, old.node_path, new.node_path, CASE WHEN old.node_path = new.node_path THEN "content" ELSE "path" END);END;
CREATE TRIGGER STATUS_DELETE AFTER DELETE ON "ajxp_index" BEGIN DELETE FROM ajxp_node_status WHERE node_id=old.node_id; END;
CREATE TRIGGER STATUS_INSERT AFTER INSERT ON "ajxp_index" BEGIN INSERT INTO ajxp_node_status (node_id) VALUES (new.node_id); END;
CREATE INDEX changes_node_id ON ajxp_changes( node_id );
CREATE INDEX changes_type ON ajxp_changes( type );
CREATE INDEX changes_node_source ON ajxp_changes( source );
CREATE INDEX index_node_id ON ajxp_index( node_id );
CREATE INDEX index_node_path ON ajxp_index( node_path );
CREATE INDEX index_bytesize ON ajxp_index( bytesize );
CREATE INDEX index_md5 ON ajxp_index( md5 );
CREATE INDEX node_status_status ON ajxp_node_status( status );
test_sqlite.py
This is the unit test class that fails unexpectedly. TestStateManagement.test_db_clean passes, indicated that the tables were properly created. TestStateManagement.test_inode_create fails, reporitng that zero results were retrieved.
import os.path as osp
from twisted.internet import defer
from twisted.enterprise import adbapi
import sqlengine # see below
class TestStateManagement(TestCase):
def setUp(self):
self.meta = mkdtemp()
self.db = adbapi.ConnectionPool(
"sqlite3", osp.join(self.meta, "db.sqlite"), check_same_thread=False,
)
self.stateman = sqlengine.StateManager(self.db)
with open("init.sql") as f:
script = f.read()
self.d = self.db.runInteraction(lambda c, s: c.executescript(s), script)
def tearDown(self):
self.db.close()
del self.db
del self.stateman
del self.d
rmtree(self.meta)
#defer.inlineCallbacks
def test_db_clean(self):
"""Canary test to ensure that the db is initialized in a blank state"""
yield self.d # wait for db to be initialized
q = "SELECT name FROM sqlite_master WHERE type='table' AND name=?;"
for table in ("ajxp_index", "ajxp_changes"):
res = yield self.db.runQuery(q, (table,))
self.assertTrue(
len(res) == 1,
"table {0} does not exist".format(table)
)
#defer.inlineCallbacks
def test_inode_create_file(self):
yield self.d
path = osp.join(self.ws, "test.txt")
with open(path, "wt") as f:
pass
inode = mk_dummy_inode(path)
yield self.stateman.create(inode, directory=False)
entry = yield self.db.runQuery("SELECT * FROM ajxp_index")
emsg = "got {0} results, expected 1. Are canary tests failing?"
lentry = len(entry)
self.assertTrue(lentry == 1, emsg.format(lentry))
sqlengine.py
These are the artefacts being tested by the above unit tests.
def values_as_tuple(d, *param):
"""Return the values for each key in `param` as a tuple"""
return tuple(map(d.get, param))
class StateManager:
"""Manages the SQLite database's state, ensuring that it reflects the state
of the filesystem.
"""
log = Logger()
def __init__(self, db):
self._db = db
def create(self, inode, directory=False):
params = values_as_tuple(
inode, "node_path", "bytesize", "md5", "mtime", "stat_result"
)
directive = (
"INSERT INTO ajxp_index (node_path,bytesize,md5,mtime,stat_result) "
"VALUES (?,?,?,?,?);"
)
return self._db.runOperation(directive, params)
Case 2: bug disappears outside of twisted.trial
#! /usr/bin/env python
import os.path as osp
from tempfile import mkdtemp
from twisted.enterprise import adbapi
from twisted.internet.task import react
from twisted.internet.defer import inlineCallbacks
INIT_FILE = "example.sql"
def values_as_tuple(d, *param):
"""Return the values for each key in `param` as a tuple"""
return tuple(map(d.get, param))
def create(db, inode):
params = values_as_tuple(
inode, "node_path", "bytesize", "md5", "mtime", "stat_result"
)
directive = (
"INSERT INTO ajxp_index (node_path,bytesize,md5,mtime,stat_result) "
"VALUES (?,?,?,?,?);"
)
return db.runOperation(directive, params)
def init_database(db):
with open(INIT_FILE) as f:
script = f.read()
return db.runInteraction(lambda c, s: c.executescript(s), script)
#react
#inlineCallbacks
def main(reactor):
meta = mkdtemp()
db = adbapi.ConnectionPool(
"sqlite3", osp.join(meta, "db.sqlite"), check_same_thread=False,
)
yield init_database(db)
# Let's make sure the tables were created as expected and that we're
# starting from a blank slate
res = yield db.runQuery("SELECT * FROM ajxp_index LIMIT 1")
assert not res, "database is not empty [ajxp_index]"
res = yield db.runQuery("SELECT * FROM ajxp_changes LIMIT 1")
assert not res, "database is not empty [ajxp_changes]"
# The details of this are not important. Suffice to say they (should)
# conform to the DB schema for ajxp_index.
test_data = {
"node_path": "/this/is/some/arbitrary/path.ext",
"bytesize": 0,
"mtime": 179273.0,
"stat_result": b"this simulates a blob of raw binary data",
"md5": "d41d8cd98f00b204e9800998ecf8427e", # arbitrary
}
# store the test data in the ajxp_index table
yield create(db, test_data)
# test if the entry exists in the db
entry = yield db.runQuery("SELECT * FROM ajxp_index")
assert len(entry) == 1, "got {0} results, expected 1".format(len(entry))
print("OK")
Closing remarks
Again, upon checking with sqlitebrowser, it seems as though the data is being written to db.sqlite, so this looks like a retrieval problem. From here, I'm sort of stumped... any ideas?
EDIT
This code will produce an inode that that can be used for testing.
def mk_dummy_inode(path, isdir=False):
return {
"node_path": path,
"bytesize": osp.getsize(path),
"mtime": osp.getmtime(path),
"stat_result": dumps(stat(path), protocol=4),
"md5": "directory" if isdir else "d41d8cd98f00b204e9800998ecf8427e",
}

Okay, it turns out that this is a bit of a tricky one. Running the tests in isolation (as was posted to this question) makes it such that the bug only rarely occurs. However, when running in the context of an entire test suite, it fails almost 100% of the time.
I added yield task.deferLater(reactor, .00001, lambda: None) after writing to the db and before reading from the db, and this solves the issue.
From there, I suspected this might be a race condition stemming from the connection pool and sqlite's limited concurrency-tolerance. I tried setting the cb_min and cb_max parameters to ConnectionPool to 1, and this also solved the issue.
In short: it seems as though sqlite doesn't play very nicely with multiple connections, and that the appropriate fix is to avoid concurrency to the extent possible.

If you take a look at your setUp function, you're returning self.db.runInteraction(...), which returns a deferred. As you've noted, you assume that it waits for the deferred to finish. However this is not the case and it's a trap that most fall victim to (myself included). I'll be honest with you, for situations like this, especially for unit tests, I just execute the synchronous code outside the TestCase class to initialize the database. For example:
def init_db():
import sqlite3
conn = sqlite3.connect('db.sqlite')
c = conn.cursor()
with open("init.sql") as f:
c.executescript(f.read())
init_db() # call outside test case
class TestStateManagement(TestCase):
"""
My test cases
"""
Alternatively, you could decorate the setup and yield runOperation(...) but something tells me that it wouldn't work... In any case, it's surprising that no errors were raised.
PS
I've been eyeballing this question for a while and it's been in the back of my head for days now. A potential reason for this finally dawned on me at nearly 1am. However, I'm too tired/lazy to actually test this out :D but it's a pretty damn good hunch. I'd like to commend you on your level of detail in this question.

results from an sqlachemy query as iterator

I am struggling to create an iterator from a query from sqlalchemy.
Here is what I tried so far
create a table
from sqlalchemy import create_engine, Column, MetaData, Table , Integer, String
engine = create_engine('sqlite:///test90.db')
conn = engine.connect()
metadata = MetaData()
myTable = Table('myTable', metadata,
Column('Doc_id', Integer, primary_key=True),
Column('Doc_Text', String))
metadata.create_all(engine)
conn.execute(myTable.insert(), [{'Doc_id': 1, 'Doc_Text' : 'first sentence'},
{'Doc_id': 2, 'Doc_Text' : 'second sentence'},
{'Doc_id': 3, 'Doc_Text' : 'third sentence'},
{'Doc_id': 4, 'Doc_Text' : 'fourth sentence'}
])
I read everything I could on iterator but do not get it.
Here the class I created to get an iterator but it does not work
(it overflows although I specify a break)
from sqlalchemy import create_engine
class RecordsIterator:
def __init__(self, xDB, xSQL):
self.engine = create_engine(xDB)
self.conn = self.engine.connect()
self.xResultCollection = self.conn.execute(xSQL)
def __iter__(self):
return self
def next (self):
while self.xResultCollection.closed is False:
xText = (self.xResultCollection.fetchone())[1]
xText = xText.encode('utf-8')
yield xText.split()
if not self.xResultCollection:
break
x1 = RecordsIterator(xDB = 'sqlite:///test91.db', xSQL = 'select * from myTable')
In case you are wondering why I am not just using a generator .
I need to feed the iterator in gensim.Word2Vec and unfortunately, it does not take a generator
import gensim
gensim.models.Word2Vec(x1)
Thanks in advance

Your check if not self.xResultCollection will always return False, as the truth value of the result object will always be True.
In your next method you have a for and a while loop, which shouldn't really be needed, the next method should just return one element, there's no need for a loop there.
As self.xResultCollection is itself an iterable you could just do:
class RecordsIterator:
def __init__(self, xDB, xSQL):
self.engine = create_engine(xDB)
self.conn = self.engine.connect()
self.resultIterator = iter(self.conn.execute(xSQL))
def __iter__(self):
return self
def next (self):
return next(self.resultIterator)[1].encode('utf-8').split()

For those interested in a using this with gensim.
It turns out that the problem was that gensim wants an iterator, on which we can return (iterating over results of a query cursor, consumes it).
see discussions here
this is what seems to work for me
import gensim
from sqlalchemy import create_engine
xDB = 'sqlite:///test91.db'
xSQL = 'select * from myTable'
engine = create_engine(xDB)
conn = engine.connect()
xResultIterator = conn.execute(xSQL)
class MyIterator(object):
def __init__(self, xResults, xNrCol):
self.xResults = xResults
self.xNrCol = xNrCol
def __iter__(self):
for xRecord in self.xResults:
xText = (xRecord[self.xNrCol]).lower().encode('utf8')
xToken = xText.split()
if not xToken:
continue
yield xToken
self.xResults = conn.execute(xSQL) ### THIS SEEMS TO FIX IT
#to use
q1 = MyIterator(xResultIterator, xNrCol = 1)
model = gensim.models.Word2Vec(sentences = q1 , min_count = 1)
and here the vocabulary
model.vocab.keys()
I run this on a postgresql with 1 Million entries (titles of scientific papers) in about 90 seconds without problem
I hope this will help someone else

Random Time Generation

I am Newbie to Python I am trying a little Random Time Generator which generates random time from the given initialize variable and ending at the given end variable for the 1000 records and have to save those 1000 records into database.
So Far i have reached is this code.
SQL.py
from sqlalchemy import create_engine, Column, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///sql.sqlite')
Base = declarative_base()
Session = sessionmaker(bind=engine)
session = Session()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
time = Column(Integer, default=None, index=True)
Base.metadata.create_all(engine)
Random.py
import datetime
import time
import random
MINTIME = datetime.datetime(2010,8,6,8,14,59)
MAXTIME = datetime.datetime(2013,8,6,8,14,59)
RECORDS = 1000
for RECORD in range(RECORDS):
RANDOMTIME = random.randint(MINTIME, MAXTIME)
print RANDOMTIME
It produces the traceback as this
TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'int'
What i am doing wrong and if possible suggest some refactored method.

Basically, the issue is: random.randint expects integers. So you should give it integers, and convert it back to datetime as you require
There could be a more efficient way, but here is one approach:
import datetime
import time
MINTIME = datetime.datetime(2010,8,6,8,14,59)
MAXTIME = datetime.datetime(2013,8,6,8,14,59)
mintime_ts = int(time.mktime(MINTIME.timetuple()))
maxtime_ts = int(time.mktime(MAXTIME.timetuple()))
for RECORD in range(RECORDS):
random_ts = random.randint(mintime_ts, maxtime_ts)
RANDOMTIME = datetime.datetime.fromtimestamp(random_ts)
print RANDOMTIME

The problem is:
RANDOMTIME = random.randint(MINTIME, MAXTIME)
randint expects two integers, you are providing two dates.
you can do the following, considering the time is the same for MINTIME and MAXTIME:
for RECORD in range(RECORDS):
n = random.randint(0,(MAXTIME-MINTIME).days)
RANDOMTIME = MINTIME + datetime.deltatime(n)

Use the start and end dates to determine the number of seconds between, then generate a random number between 0 and that, then add it to the start date, eg:
from datetime import datetime, timedelta
from random import randint
MINTIME = datetime(2010,8,6,8,14,59)
MAXTIME = datetime(2013,8,6,8,14,59)
PERIOD = (MAXTIME-MINTIME).total_seconds()
for n in range(1000):
dt = MINTIME + timedelta(seconds=randint(0, PERIOD))
# 2012-12-10 18:34:23
# etc...

SQLAlchemy Misuse Causing Memory Leak?

My program is sucking up a meg every few seconds. I read that python doesn't see curors in garbage collection, so I have a feeling that I might be doing something wrong with my use of pydbc and sqlalchemy and maybe not closing something somwhere?
#Set up SQL Connection
def connect():
conn_string = 'DRIVER={FreeTDS};Server=...;Database=...;UID=...;PWD=...'
return pyodbc.connect(conn_string)
metadata = MetaData()
e = create_engine('mssql://', creator=connect)
c = e.connect()
metadata.bind = c
log_table = Table('Log', metadata, autoload=True)
...
atexit.register(cleanup)
#Core Loop
line_c = 0
inserts = []
insert_size = 2000
while True:
#line = sys.stdin.readline()
line = reader.readline()
line_c +=1
m = line_regex.match(line)
if m:
fields = m.groupdict()
...
inserts.append(fields)
if line_c >= insert_size:
c.execute(log_table.insert(), inserts)
line_c = 0
inserts = []
Should I maybe move the metadata block or part of it to the insert block and close the connection each insert?
Edit:
Q: Does it every stabilize?
A: Only if you count Linux blowing away the process :-) (Graph does exclude Buffers/Cache from Memory Usage)

I would not necessarily blame SQLAlchemy. It could also be a problem of the underlaying driver. In general memory leaks are hard to detect. In any case you should ask on the SQLALchemy mailing list where the core developer Michael Bayer is responding on almost
every question...perhaps a better chance to get real help there...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

High numerical precision floats with MySQL and the SQLAlchemy ORM - python

Per our discussion in the comments: sa.types.Float(precision=[precision here]) instead of sa.Float allows you to specify precision; however, sa.Float(Precision=32) has no effect. See the documentation for more information.

Related

How can I get keyword arguments in sqlalchemy.FunctionElement?

Why is Twisted's adbapi failing to recover data from within unittests?

results from an sqlachemy query as iterator

Random Time Generation

SQLAlchemy Misuse Causing Memory Leak?

Categories

Resources