SQLAlchemy causing memory leaks - python

App startup memory
Partition of a set of 249162 objects. Total size = 28889880 bytes.
Index Count % Size % Cumulative % Referrers by Kind (class / dict of class)
0 77463 31 5917583 20 5917583 20 types.CodeType
1 30042 12 3774404 13 9691987 34 function
2 51799 21 3070789 11 12762776 44 tuple
3 15106 6 2061017 7 14823793 51 dict of type
4 5040 2 1928939 7 16752732 58 function, tuple
5 6627 3 1459448 5 18212180 63 type
6 5227 2 1346136 5 19558316 68 dict of module
7 16466 7 1026538 4 20584854 71 dict (no owner)
8 734 0 685897 2 21270751 74 dict of module, tuple
9 420 0 626760 2 21897511 76 function, module
App memory after 100 subsequent calls (interacting with SQLAlchemy)
Partition of a set of 628910 objects. Total size = 107982928 bytes.
Index Count % Size % Cumulative % Referrers by Kind (class / dict of class)
0 23373 4 27673632 26 27673632 26 sqlalchemy.sql.schema.Column
1 141175 22 20904408 19 48578040 45 dict of sqlalchemy.sql.schema.Column
2 78401 12 5984371 6 54562411 51 types.CodeType
3 34133 5 4239726 4 58802137 54 function
4 64371 10 3661978 3 62464115 58 tuple
5 20034 3 2971710 3 65435825 61 dict of sqlalchemy.sql.schema.Table
6 13356 2 2297232 2 67733057 63 sqlalchemy.sql.base.ColumnCollection
7 15924 3 2133374 2 69866431 65 dict of type
8 5095 1 1946855 2 71813286 67 function, tuple
9 8714 1 1793696 2 73606982 68 type
Helper function that detects memory usage by rcs
def heap_results():
from guppy import hpy
hp = hpy()
h = hp.heap()
return Response(response=str(h.bytype),
status=200,
mimetype='application/json')
The implementation of SQLAlchemy is fairly straightforward. Using the db.Model, we are creating a class for ORM, and breaking the tables into subfunctions of the ORM class.
We are gc.collect() just before returning the final response to the user. We are also using db.session.flush(), db.session.expunge_all(), and db.session.close().
We have tried to remove the db.session.* commands, as well as the gc.collect(). Nothing changes.
Here is a timeseries graph of our app's memory usage, the application being restarted is where you see the memory cap reset back to a stable state:
Code to simulate an HAProxy.
def reconnect():
hostnames = [Settings.SECRETS.get('PATRONI_HOST_C', ''), Settings.SECRETS.get('PATRONI_HOST_E', '')]
try:
master_node = HAProxy(hostnames=hostnames)
except (ValueError, TypeError, BaseException) as e:
# send an alert here though, use the informant!
raise e
else:
if master_node in ['None', None]:
raise ValueError("Failed to determined which server is acting as the master node")
my_app.config['SQLALCHEMY_DATABASE_URI'] = "postgresql://{}:{}#{}/{}".format(Settings.SECRETS['PATRONI_USER'],
Settings.SECRETS['PATRONI_PASSWORD'],
master_node,
Settings.SECRETS['PATRONI_DB'])
my_app.config['SQLALCHEMY_ENGINE_OPTIONS'] = {
'pool_recycle': 1800
}
new_db = SQLAlchemy(my_app)
new_db_orm = DBORM(new_db)
return new_db_orm
What the DBORM (modified to hide full functionality) looks like:
class DBORM(object):
def __init__(self, database):
self.database = database
self.owner_model = self.owner_model()
def create_owner_model(self):
db = self.database
class OwnerModel(db.Model):
__tablename__ = "owners"
owner_id = db.Column(UUID(as_uuid=True), unique=True,
nullable=False, primary_key=True)
client_owner = db.Column(db.String(255), unique=False, nullable=False)
admin_owner = db.Column(db.String(255), unique=False, nullable=False)
#staticmethod
def owner_validation(owner_model, owner=None):
if owner is not None:
owner_model = OwnerModel.get_owner_by_id(owner_id=owner_id,
return_as_model=True)
if owner_model is not None:
client_owner = owner_model.client_owner
admin_owner = owner_model.admin_owner
if client_owner is None and admin_owner is None:
return False
elif client_owner.lower() == owner.lower():
return True
elif admin_owner.lower() == owner.lower():
return True
else:
return False
else:
return None
else:
return None
Example of Using OwnerModel from API
#api.route('/owners/{owner_id}')
def my_function(owner_id):
try:
dborm = reconnect()
except (AttributeError, KeyError, ValueError, BaseException) as e:
logger.error(f'Unable to get an owner model.')
logger.info(f'Garbage collector, collected: {gc.collect()}')
return Response(response=Exception.database_reconnect_failure(),
status=503,
mimetype='application/json')
else:
response = dborm.get_owner_by_id(owner_id=owner_id)
logger.info(f'Garbage collector, collected: {gc.collect()}')
return Response(response=json.dumps(response),
status=200,
mimetype='application/json')

SQLAlchemy MetaData holds references to Table objects and the Declarative base class also has an internal registry for lookups, for example for use as context in relationship() lazily evaluated arguments. When you repeatedly create new versions of model classes, which also creates the required metadata like Table, you likely consume more and more memory, if the references are kept around. The fact that Column objects dominate your memory usage supports this in my view.
You should aim to create your models and their metadata just once during your application's life cycle. You only need to be able to change the connection parameters dynamically. SQLAlchemy versions up to 1.3 provide the creator argument of Engine for exactly this, and version 1.4 introduced DialectEvents.do_connect() event hook for even finer control.
Using creator:
import psycopg2
db = SQLAlchemy()
dborm = DBORM(db)
def initialize(app):
"""
Setup `db` configuration and initialize the application. Call this once and
once only, before your application starts using the database.
The `creator` creates a new connection every time the connection pool
requires one, due to all connections being in use, or old ones having been
recycled, etc.
"""
# Placeholder that lets the Engine know which dialect it will be speaking
app.config['SQLALCHEMY_DATABASE_URI'] = "postgresql+psycopg2://"
app.config['SQLALCHEMY_ENGINE_OPTIONS'] = {
'pool_recycle': 1800,
'creator': lambda: psycopg2.connect(
dbname=Settings.SECRETS['PATRONI_DB'],
user=Settings.SECRETS['PATRONI_USER'],
password=Settings.SECRETS['PATRONI_PASSWORD'],
host=HAProxy([
Settings.SECRETS.get('PATRONI_HOST_C', ''),
Settings.SECRETS.get('PATRONI_HOST_E', ''),
]))
}
db.init_app(app)
class OwnerModel(db.Model):
__tablename__ = "owners"
...
Note that you need to change DBORM to use the global db and model classes, and that your controllers do not call reconnect()—that does not exist—anymore, but just use db, dborm, and the classes directly as well.

Related

Sqlalchemy is slow when doing query the first time

I'm using Sqlalchemy(2.0.3) with python3.10 and after fresh container boot it takes ~2.2s to execute specific query, all consecutive calls of the same query take ~70ms to execute. I'm using PostgreSQL and it takes 40-70ms to execute raw query in DataGrip.
Here is the code:
self._Session = async_sessionmaker(self._engine, expire_on_commit=False)
...
#property
def session(self):
return self._Session
...
async with PostgreSQL().session.begin() as session:
total_functions = aliased(db_models.Function)
finished_functions = aliased(db_models.Function)
failed_functions = aliased(db_models.Function)
stmt = (
select(
db_models.Job,
func.count(distinct(total_functions.id)).label("total"),
func.count(distinct(finished_functions.id)).label("finished"),
func.count(distinct(failed_functions.id)).label("failed")
)
.where(db_models.Job.project_id == project_id)
.outerjoin(db_models.Job.packages)
.outerjoin(db_models.Package.modules)
.outerjoin(db_models.Module.functions.of_type(total_functions))
.outerjoin(finished_functions, and_(
finished_functions.module_id == db_models.Module.id,
finished_functions.progress == db_models.FunctionProgress.FINISHED
))
.outerjoin(failed_functions, and_(
failed_functions.module_id == db_models.Module.id,
or_(
failed_functions.state == db_models.FunctionState.FAILED,
failed_functions.state == db_models.FunctionState.TERMINATED,
))
)
.group_by(db_models.Job.id)
)
start = time.time()
yappi.set_clock_type("WALL")
with yappi.run():
job_infos = await session.execute(stmt)
yappi.get_func_stats().print_all()
end = time.time()
Things I have tried and discovered:
Problem is not related to connection or querying the database database. On service boot I establish connection and make some other queries.
Problem most likely not related to cache. I have disabled cache with query_cache_size=0, however I'm not 100% sure that it worked, since documentations says:
ORM functions related to unit-of-work persistence as well as some attribute loading strategies will make use of individual per-mapper caches outside of the main cache.
Profiler didn't show anything that caught my attention:
..urrency_py3k.py:130 greenlet_spawn 2/1 0.000000 2.324807 1.162403
..rm/session.py:2168 Session.execute 1 0.000028 2.324757 2.324757
..0 _UnixSelectorEventLoop._run_once 11 0.000171 2.318555 0.210778
..syncpg_cursor._prepare_and_execute 1 0.000054 2.318187 2.318187
..cAdapt_asyncpg_connection._prepare 1 0.000020 2.316333 2.316333
..nnection.py:533 Connection.prepare 1 0.000003 2.316154 2.316154
..nection.py:573 Connection._prepare 1 0.000017 2.316151 2.316151
..n.py:359 Connection._get_statement 2/1 0.001033 2.316122 1.158061
..ectors.py:452 EpollSelector.select 11 0.000094 2.315352 0.210487
..y:457 Connection._introspect_types 1 0.000025 2.314904 2.314904
..ction.py:1669 Connection.__execute 1 0.000027 2.314879 2.314879
..ion.py:1699 Connection._do_execute 1 2.314095 2.314849 2.314849
...py:2011 Session._execute_internal 1 0.000034 0.006174 0.006174
I have also seen that one may disable cache per connection:
with engine.connect().execution_options(compiled_cache=None) as conn:
conn.execute(table.select())
However I'm working with ORM layer and not sure how to apply this in my case.
Any ideas where this delay might come from?

Troubleshooting uncooperative For Loop of SQLalchemy results

Looking for a second set of eyes here. I cannot figure out why the following loop will not continue past the first iteration.
The 'servicestocheck' sqlalchemy query returns 45 rows in my test, but I cannot iterate through the results like I'm expecting... and no errors are returned. All of the functionality works on the first iteration.
Anyone have any ideas?
def serviceAssociation(current_contact_id,perm_contact_id):
servicestocheck = oracleDB.query(PORTAL_CONTACT).filter(
PORTAL_CONTACT.contact_id == current_contact_id
).order_by(PORTAL_CONTACT.serviceID).count()
print(servicestocheck) # returns 45 items
servicestocheck = oracleDB.query(PORTAL_CONTACT).filter(
PORTAL_CONTACT.contact_id = current_contact_id
).order_by(PORTAL_CONTACT.serviceID).all()
for svc in servicestocheck:
#
# Check to see if already exists
#
check_existing_association = mysqlDB.query(
CONTACTTOSERVICE).filter(CONTACTTOSERVICE.contact_id ==
perm_contact_id,CONTACTTOSERVICE.serviceID ==
svc.serviceID).first()
#
# If no existing association
#
if check_existing_association is None:
print ("Prepare Association")
assoc_contact_id = perm_contact_id
assoc_serviceID = svc.serviceID
assoc_role_billing = False
assoc_role_technical = False
assoc_role_commercial = False
if svc.contact_type == 'Billing':
assoc_role_billing = True
if svc.contact_type == 'Technical':
assoc_role_technical = True
if svc.contact_type == 'Commercial':
assoc_role_commercial = True
try:
newAssociation = CONTACTTOSERVICE(
assoc_contact_id, assoc_serviceID,
assoc_role_billing,assoc_role_technical,
assoc_role_commercial)
mysqlDB.add(newAssociation)
mysqlDB.commit()
mysqlDB.flush()
except Exception as e:
print(e)
This function is called from a script, and it is called from within another loop. I can't find any issues with nested loops.
Ended up being an issue with SQLAlchemy ORM (see SqlAlchemy not returning all rows when querying table object, but returns all rows when I query table object column)
I think the issue is due to one of my tables above does not have a primary key in real life, and adding a fake one did not help. (I don't have access to the DB to add a key)
Rather than fight it further... I went ahead and wrote raw SQL to move my project along.
This did the trick:
query = 'SELECT * FROM PORTAL_CONTACT WHERE contact_id = ' + str(current_contact_id) + 'ORDER BY contact_id ASC'
servicestocheck = oracleDB.execute(query)

python cassandra get big result of select * in generator (without storage result in ram)

I want to get all data in cassandra table "user"
i have 840000 users and i don't want to get all users in python list.
i want get users in packs of 100 users
in cassandra doc https://datastax.github.io/python-driver/query_paging.html
i see i can use fetch_size, but in my python code i have database object that contains all cql instruction
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
class Database:
def __init__(self, name, salary):
self.cluster = Cluster(['192.168.1.1', '192.168.1.2'])
self.session = cluster.connect()
def get_users(self):
users_list = []
query = "SELECT * FROM users"
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):
users_list.append(user_row.name)
return users_list
actually get_users return very big list of user name
but i want to transform return get_users to a "generator"
i don't want get all users name in 1 list and 1 call of function get_users, but i want to have lot of call get_users and return list with only 100 users max every call function
for example :
list1 = database.get_users()
list2 = database.get_users()
...
listn = database.get_users()
list1 contains 100 first user in query
list2 contains 100 "second" users in query
listn contains the latest elements in query (<=100)
is this possible ?
thanks for advance for your answer
According to Paging Large Queries:
Whenever there are no more rows in the current page, the next page
will be fetched transparently.
So, if you execute your code like this, you will still the whole result set, but this is paged in a transparent manner.
In order to achieve what you need to use callbacks. You can also find some code sample on the link above.
I added below the full code for reference.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from threading import Event
class PagedResultHandler(object):
def __init__(self, future):
self.error = None
self.finished_event = Event()
self.future = future
self.future.add_callbacks(
callback=self.handle_page,
errback=self.handle_error)
def handle_page(self, rows):
for row in rows:
process_row(row)
if self.future.has_more_pages:
self.future.start_fetching_next_page()
else:
self.finished_event.set()
def handle_error(self, exc):
self.error = exc
self.finished_event.set()
def process_row(user_row):
print user_row.name, user_row.age, user_row.email
cluster = Cluster()
session = cluster.connect()
query = "SELECT * FROM myschema.users"
statement = SimpleStatement(query, fetch_size=5)
future = session.execute_async(statement)
handler = PagedResultHandler(future)
handler.finished_event.wait()
if handler.error:
raise handler.error
cluster.shutdown()
Moving to next page is done in handle_page when start_fetching_next_page is called.
If you replace the if statement with self.finished_event.set() you will see that the iteration stops after the first 5 rows as defined in fetch_size

Pyqt 4 - QWebView.load(url) leaks memory (not from python)

Basically, I pull a series of links from my database, and want to scrape them for specific links I'm looking for. I then re-feed those links into my link queue that my multiple QWebViews reference, and they continue to pull those down for processing/storage.
My issue is that as this runs for... say 200 or 500 links, it starts to use up more and more RAM.
I have exhaustively looked into this, using heapy, memory_profiler, and objgraph to figure out what's causing the memory leak... The python heap's objects stay about the the same in terms of amount AND size over time. This made me think the C++ objects weren't getting removed. Sure enough, using memory_profiler, the RAM only goes up when the self.load(self.url) lines of code are called. I've tried to fix this, but to no avail.
Code:
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebView, QWebSettings
from PyQt4.QtGui import QApplication
from lxml.etree import HTMLParser
# My functions
from util import dump_list2queue, parse_doc
class ThreadFlag:
def __init__(self, threads, jid, db):
self.threads = threads
self.job_id = jid
self.db_direct = db
self.xml_parser = HTMLParser()
class WebView(QWebView):
def __init__(self, thread_flag, id_no):
super(QWebView, self).__init__()
self.loadFinished.connect(self.handleLoadFinished)
self.settings().globalSettings().setAttribute(QWebSettings.AutoLoadImages, False)
# This is actually a dict with a few additional details about the url we want to pull
self.url = None
# doing one instance of this to avoid memory leaks
self.qurl = QUrl()
# id of the webview instance
self.id = id_no
# Status webview instance, green mean it isn't working and yellow means it is.
self.status = 'GREEN'
# Reference to a single universal object all the webview instances can see.
self.thread_flag = thread_flag
def handleLoadFinished(self):
try:
self.processCurrentPage()
except Exception as e:
print e
self.status = 'GREEN'
if not self.fetchNext():
# We're finished!
self.loadFinished.disconnect()
self.stop()
else:
# We're not finished! Do next url.
self.qurl.setUrl(self.url['url'])
self.load(self.qurl)
def processCurrentPage(self):
self.frame = str(self.page().mainFrame().toHtml().toUtf8())
# This is the case for the initial web pages I want to gather links from.
if 'name' in self.url:
# Parse html string for links I'm looking for.
new_links = parse_doc(self.thread_flag.xml_parser, self.url, self.frame)
if len(new_links) == 0: return 0
fkid = self.url['pkid']
new_links = map(lambda x: (fkid, x['title'],x['url'], self.thread_flag.job_id), new_links)
# Post links to database, db de-dupes and then repull ones that made it.
self.thread_flag.db_direct.post_links(new_links)
added_links = self.thread_flag.db_direct.get_links(self.thread_flag.job_id,fkid)
# Add the pulled links to central queue all the qwebviews pull from
dump_list2queue(added_links, self._urls)
del added_links
else:
# Process one of the links I pulled from the initial set of data that was originally in the queue.
print "Processing target link!"
# Get next url from the universal queue!
def fetchNext(self):
if self._urls and self._urls.empty():
self.status = 'GREEN'
return False
else:
self.status = 'YELLOW'
self.url = self._urls.get()
return True
def start(self, urls):
# This is where the reference to the universal queue gets made.
self._urls = urls
if self.fetchNext():
self.qurl.setUrl(self.url['url'])
self.load(self.qurl)
# uq = central url queue shared between webview instances
# ta = array of webview objects
# tf - thread flag (basically just a custom universal object that all the webviews can access).
# This main "program" is started by another script elsewhere.
def main_program(uq, ta, tf):
app = QApplication([])
webviews = ta
threadflag = tf
tf.app = app
print "Beginning the multiple async web calls..."
# Create n "threads" (really just webviews) that each will make asynchronous calls.
for n in range(0,threadflag.threads):
webviews.append(WebView(threadflag, n+1))
webviews[n].start(uq)
app.exec_()
Here's what my memory tools say (they're all about constant through the whole program)
RAM: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
2491(MB)
Objgraph most common types:
methoddescriptor 9959
function 8342
weakref 6440
tuple 6418
dict 4982
wrapper_descriptor 4380
getset_descriptor 2314
list 1890
method_descriptor 1445
builtin_function_or_method 1298
Heapy:
Partition of a set of 9879 objects. Total size = 1510000 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 2646 27 445216 29 445216 29 str
1 563 6 262088 17 707304 47 dict (no owner)
2 2267 23 199496 13 906800 60 __builtin__.weakref
3 2381 24 179128 12 1085928 72 tuple
4 212 2 107744 7 1193672 79 dict of guppy.etc.Glue.Interface
5 50 1 52400 3 1246072 83 dict of guppy.etc.Glue.Share
6 121 1 40200 3 1286272 85 list
7 116 1 32480 2 1318752 87 dict of guppy.etc.Glue.Owner
8 240 2 30720 2 1349472 89 types.CodeType
9 42 0 24816 2 1374288 91 dict of class
Your program is indeed experiencing growth due to C++ code, but it is not an actual leak in terms of the creation of objects that are no longer referenced. What is happening, at least in part, is that your QWebView holds a QWebPage which holds a QWebHistory(). Each time you call self.load the history is getting a bit longer.
Note that QWebHistory has a clear() function.
Documentation is available: http://pyqt.sourceforge.net/Docs/PyQt4/qwebview.html#history

High numerical precision floats with MySQL and the SQLAlchemy ORM

I store some numbers in a MySQL using the ORM of SQLAlchemy. When I fetch them afterward, they are truncated such that only 6 significant digits are conserved, thus losing a lot of precision on my float numbers. I suppose there is an easy way to fix this but I can't find how. For example, the following code:
import sqlalchemy as sa
from sqlalchemy.pool import QueuePool
import sqlalchemy.ext.declarative as sad
Base = sad.declarative_base()
Session = sa.orm.scoped_session(sa.orm.sessionmaker())
class Test(Base):
__tablename__ = "test"
__table_args__ = {'mysql_engine':'InnoDB'}
no = sa.Column(sa.Integer, primary_key=True)
x = sa.Column(sa.Float)
a = 43210.123456789
b = 43210.0
print a, b, a - b
dbEngine = sa.create_engine("mysql://chore:BlockWork33!#localhost", poolclass=QueuePool, pool_size=20,
pool_timeout=180)
Session.configure(bind=dbEngine)
session = Session()
dbEngine.execute("CREATE DATABASE IF NOT EXISTS test")
dbEngine.execute("USE test")
Base.metadata.create_all(dbEngine)
try:
session.add_all([Test(x=a), Test(x=b)])
session.commit()
except:
session.rollback()
raise
[(a,), (b,)] = session.query(Test.x).all()
print a, b, a - b
produces
43210.1234568 43210.0 0.123456788999
43210.1 43210.0 0.0999999999985
and I would need a solution for it to produce
43210.1234568 43210.0 0.123456788999
43210.1234568 43210.0 0.123456788999
Per our discussion in the comments: sa.types.Float(precision=[precision here]) instead of sa.Float allows you to specify precision; however, sa.Float(Precision=32) has no effect. See the documentation for more information.

Categories

Resources