Sqlalchemy is slow when doing query the first time

Sqlalchemy is slow when doing query the first time - python

I'm using Sqlalchemy(2.0.3) with python3.10 and after fresh container boot it takes ~2.2s to execute specific query, all consecutive calls of the same query take ~70ms to execute. I'm using PostgreSQL and it takes 40-70ms to execute raw query in DataGrip.
Here is the code:
self._Session = async_sessionmaker(self._engine, expire_on_commit=False)
...
#property
def session(self):
return self._Session
...
async with PostgreSQL().session.begin() as session:
total_functions = aliased(db_models.Function)
finished_functions = aliased(db_models.Function)
failed_functions = aliased(db_models.Function)
stmt = (
select(
db_models.Job,
func.count(distinct(total_functions.id)).label("total"),
func.count(distinct(finished_functions.id)).label("finished"),
func.count(distinct(failed_functions.id)).label("failed")
)
.where(db_models.Job.project_id == project_id)
.outerjoin(db_models.Job.packages)
.outerjoin(db_models.Package.modules)
.outerjoin(db_models.Module.functions.of_type(total_functions))
.outerjoin(finished_functions, and_(
finished_functions.module_id == db_models.Module.id,
finished_functions.progress == db_models.FunctionProgress.FINISHED
))
.outerjoin(failed_functions, and_(
failed_functions.module_id == db_models.Module.id,
or_(
failed_functions.state == db_models.FunctionState.FAILED,
failed_functions.state == db_models.FunctionState.TERMINATED,
))
)
.group_by(db_models.Job.id)
)
start = time.time()
yappi.set_clock_type("WALL")
with yappi.run():
job_infos = await session.execute(stmt)
yappi.get_func_stats().print_all()
end = time.time()
Things I have tried and discovered:
Problem is not related to connection or querying the database database. On service boot I establish connection and make some other queries.
Problem most likely not related to cache. I have disabled cache with query_cache_size=0, however I'm not 100% sure that it worked, since documentations says:
ORM functions related to unit-of-work persistence as well as some attribute loading strategies will make use of individual per-mapper caches outside of the main cache.
Profiler didn't show anything that caught my attention:
..urrency_py3k.py:130 greenlet_spawn 2/1 0.000000 2.324807 1.162403
..rm/session.py:2168 Session.execute 1 0.000028 2.324757 2.324757
..0 _UnixSelectorEventLoop._run_once 11 0.000171 2.318555 0.210778
..syncpg_cursor._prepare_and_execute 1 0.000054 2.318187 2.318187
..cAdapt_asyncpg_connection._prepare 1 0.000020 2.316333 2.316333
..nnection.py:533 Connection.prepare 1 0.000003 2.316154 2.316154
..nection.py:573 Connection._prepare 1 0.000017 2.316151 2.316151
..n.py:359 Connection._get_statement 2/1 0.001033 2.316122 1.158061
..ectors.py:452 EpollSelector.select 11 0.000094 2.315352 0.210487
..y:457 Connection._introspect_types 1 0.000025 2.314904 2.314904
..ction.py:1669 Connection.__execute 1 0.000027 2.314879 2.314879
..ion.py:1699 Connection._do_execute 1 2.314095 2.314849 2.314849
...py:2011 Session._execute_internal 1 0.000034 0.006174 0.006174
I have also seen that one may disable cache per connection:
with engine.connect().execution_options(compiled_cache=None) as conn:
conn.execute(table.select())
However I'm working with ORM layer and not sure how to apply this in my case.
Any ideas where this delay might come from?

Related

how do i call the function in specified period of time?

I am trying to build a learner which will call the function and store the weights into the DB, now the problem is, it at least takes from 30 to 60 seconds to learn, so if i want to store i need to wait and i decided to call the function with threading timer which will call the function after specified time period,
Example of code:
def learn(myConnection):
'''
Derive all the names,images where state = 1
Learn and Store
Delete all the column where state is 1
'''
id = 0
with myConnection:
cur = myConnection.cursor()
cur.execute("Select name, image FROM images WHERE state = 1")
rows = cur.fetchall()
for row in rows:
print "%s, %s" % (row[0], row[1])
name ='images/Output%d.jpg' % (id,)
names = row[0]
with open(name, "wb") as output_file:
output_file.write(row[1])
unknown_image = face_recognition.load_image_file(name)
unknown_encoding = face_recognition.face_encodings(unknown_image)[0]
# here i give a timer and call the function
threading=Timer(60, storeIntoSQL(names,unknown_encoding) )
threading.start()
id += 1
the thing that did not work with this is that it just worked as if i did not specify the timer it did not wait 60 seconds it just worked normal as if i called the function without the timer, Any ideas on how i can make this work or what alternatives i can use ? ... PS i have already used time.sleep it just stops the main thread i need the Project to be running while this is training
Example of the function that is being called:
def storeIntoSQL(name,unknown_face_encoding):
print 'i am printing'
# connect to the database
con = lite.connect('users2.db')
# store new person into the database rmena
with con:
cur = con.cursor()
# get the new id
cur.execute("SELECT DISTINCT id FROM Users ")
rows = cur.fetchall()
newId = len(rows)+1
# store into the Database
query = "INSERT INTO Users VALUES (?,?,?)"
cur.executemany(query, [(newId,name,r,) for r in unknown_face_encoding])
con
I was also told that MUTEX synchronization could help, where i can make one thread to work only if the other thread has finished it's job but i am not sure how to implement it and am open to any suggestions

I would suggest to use the threading library of python and implement a time.sleep(60) somewhere inside your function or in a wrapper function. For example
import time
import threading
def delayed_func(name,unknown_face_encoding):
time.sleep(60)
storeIntoSQL(name,unknown_face_encoding)
timer_thread = threading.Thread(target=delayed_func, args=(name,unknown_face_encoding))
timer_thread.start()

"BadRequestError: cursor position is outside the range of the original query" when fetching the next batch of a query

I have this handler to do some process on all users of our app. Basically, it takes 1 batch, processes the records of that batch, then queues a new task for the next batch.
QueueAllUsers(BaseHandler):
FETCH_SIZE = 10
FILTERS = [
UserConfig.level != 0,
UserConfig.is_configured == True
]
ORDER = [UserConfig.level, UserConfig._key]
def get(self):
cursor_key = self.request.get('cursor')
cursor = None
if cursor_key:
# if `cursor` param is provided, use it
cursor = Cursor(urlsafe=str(cursor_key))
q = UserConfig.query(*self.FILTERS).order(*self.ORDER)
total = q.count() # 31 total records
logging.info(total)
users, next_cursor, more = q.fetch_page(self.FETCH_SIZE,
keys_only=True,
start_cursor=cursor)
self.process_users(users)
if more:
self.queue_next_batch(next_cursor)
def queue_next_batch(self, next_cursor):
# Call get() again but this time pass `cursor` param to process next batch
logging.info(next_cursor.urlsafe())
url = '/queue_all_users?cursor=%s' % (next_cursor.urlsafe())
taskqueue.add(
url=url,
method='get',
queue_name='cronjobs'
)
def process_users(self, users):
logging.info(len(users))
# trimmed
But when the task for the 2nd batch runs, NDB throws BadRequest error saying that the cursor is out of range.
I don't understand why it's out of range? I fetched 10 records out of a total of 31, so the cursor should still be valid.
Note that the error is thrown on the 2nd batch (i.e. records 11-20).
So the flow is like this:
Call /queue_all_users to process the first batch (no cursor). Everything works ok.
Step 1 creates a task for /queue_all_users?cursor=123456 for the next batch.
Call /queue_all_users?cursor=123456 (cursor provided). fetch_page throws BadRequestErrror.
EDIT: I tried setting FETCH_SIZE to 17, and fetching the 2nd batch worked! It seems anything below 17 causes the error, and 17 above works. So... what the heck?

I had same problem. When I make the first query everything goes fine and a cursor is returned. The second query using the cursor give me the error:
BadRequestError: cursor position is outside the range of the original query.
I tried your solution but isn't work for me. So I change my filters in my query and it works, I don't know why but maybe can be a solution for you and others.
My old query was:
page_size = 10
query = Sale.query(ancestor=self.key).filter(ndb.AND(
Sale.current_status.status != SaleStatusEnum.WAITING_PAYMENT,
Sale.current_status.status != SaleStatusEnum.WAITING_PAYMENT
).order(Sale.current_status.status, Sale._key)
query.fetch_page(page_size, start_cursor=cursor)
Then a I change all "!=" to IN operation, like this:
page_size = 10
query = Sale.query(ancestor=self.key).filter(
Sale.current_status.status.IN([
SaleStatusEnum.PROCESSING,
SaleStatusEnum.PAID,
SaleStatusEnum.SHIPPING,
SaleStatusEnum.FINALIZED,
SaleStatusEnum.REFUSED])
).order(Sale.current_status.status, Sale._key)
query.fetch_page(page_size, start_cursor=cursor)

PySQLPool and Celery, proper way to use it?

I am wondering what is the proper way to use the mysql pool with celery tasks.
At the moment, this is how (the relevant portion) of my tasks module looks like:
from start import celery
import PySQLPool as pool
dbcfg = config.get_config('inputdb')
input_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
dbcfg = config.get_config('outputdb')
output_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
#celery.task
def fetch():
ic = pool.getNewQuery(input_db)
oc = pool.getNewQuery(output_db)
count = 1
for e in get_new_stuff():
# do stuff with new stuff
# read the db with ic
# write to db using oc
# commit from time to time
if count % 1000:
pool.commitPool()
# commit whatever's left
pool.commitPool()
On one machine there can be at most 4 fetch() tasks running at the same time (1 per core).
I notice, however, that sometimes a task will hang and I suspect it is due to mysql.
Any tips on how to use mysql and celery?
Thank you!

I am also using celery and PySQLPool.
maria = PySQLPool.getNewConnection(username=app.config["MYSQL_USER"],
password=app.config["MYSQL_PASSWORD"],
host=app.config["MYSQL_HOST"],
db='configuration')
def myfunc(self, param1, param2):
query = PySQLPool.getNewQuery(maria, True)
try:
sSql = """
SELECT * FROM table
WHERE col1= %s AND col2
"""
tDatas = ( var1, var2)
query.Query(sSql, tDatas)
return query.record
except Exception, e:
logger.info(e)
return False
#celery.task
def fetch():
myfunc('hello', 'world')

SQLAlchemy Misuse Causing Memory Leak?

My program is sucking up a meg every few seconds. I read that python doesn't see curors in garbage collection, so I have a feeling that I might be doing something wrong with my use of pydbc and sqlalchemy and maybe not closing something somwhere?
#Set up SQL Connection
def connect():
conn_string = 'DRIVER={FreeTDS};Server=...;Database=...;UID=...;PWD=...'
return pyodbc.connect(conn_string)
metadata = MetaData()
e = create_engine('mssql://', creator=connect)
c = e.connect()
metadata.bind = c
log_table = Table('Log', metadata, autoload=True)
...
atexit.register(cleanup)
#Core Loop
line_c = 0
inserts = []
insert_size = 2000
while True:
#line = sys.stdin.readline()
line = reader.readline()
line_c +=1
m = line_regex.match(line)
if m:
fields = m.groupdict()
...
inserts.append(fields)
if line_c >= insert_size:
c.execute(log_table.insert(), inserts)
line_c = 0
inserts = []
Should I maybe move the metadata block or part of it to the insert block and close the connection each insert?
Edit:
Q: Does it every stabilize?
A: Only if you count Linux blowing away the process :-) (Graph does exclude Buffers/Cache from Memory Usage)

I would not necessarily blame SQLAlchemy. It could also be a problem of the underlaying driver. In general memory leaks are hard to detect. In any case you should ask on the SQLALchemy mailing list where the core developer Michael Bayer is responding on almost
every question...perhaps a better chance to get real help there...

Task Chaining with Cursor issue on app engine. Exception: Too big query offset. Anyone else get this issue?

I'm not sure if anyone else has this problem, but I'm getting an exception "Too big query offset" when using a cursor for chaining tasks on appengine development server (not sure if it happens on live).
The error occurs when requesting a cursor after 4000+ records have been processed in a single query.
I wasn't aware that offsets had anything to do with cursors, and perhaps its just a quirk in sdk for app engine.
To fix, either shorten the time allowed before task is deferred (so fewer records get processed at a time) or when checking time elapsed you can also check the number of records processed is still within range. e.g, if time.time() > end_time or count == 2000.Reset count and defer task. 2000 is an arbitrary number, I'm not sure what the limit should be.
EDIT:
After making the above mentioned changes, the never finishes executing. The with_cursor(cursor) code is being called, but seems to start at the beginning each time. Am I missing something obvious?
The code that causes the exception is as follows:
The table "Transact" has 4800 rows. The error occurs when transacts.cursor() is called when time.time() > end_time is true. 4510 records have been processed at the time when the cursor is requested, which seems to cause the error (on development server, haven't tested elsewhere).
def some_task(trans):
tts = db.get(trans)
for t in tts:
#logging.info('in some_task')
pass
def test_cursor(request):
ret = test_cursor_task()
def test_cursor_task(cursor = None):
startDate = datetime.datetime(2010,7,30)
endDate = datetime.datetime(2010,8,30)
end_time = time.time() + 20.0
transacts = Transact.all().filter('transactionDate >', startDate).filter('transactionDate <=',endDate)
count =0
if cursor:
transacts.with_cursor(cursor)
trans =[]
logging.info('queue_trans')
for tran in transacts:
count+=1
#trans.append(str(tran))
trans.append(str(tran.key()))
if len(trans)==20:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
if time.time() > end_time:
logging.info(count)
if len(trans)>0:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
logging.info('time limit exceeded setting next call to queue')
cursor = transacts.cursor()
deferred.defer(test_cursor_task, cursor)
logging.info('returning false')
return False
return True
return HttpResponse('')
Hope this helps someone.
Thanks
Bert

Try this again without using the iter functionality:
#...
CHUNK = 500
objs = transacts.fetch(CHUNK)
for tran in objs:
do_your_stuff
if len(objs) == CHUNK:
deferred.defer(my_task_again, cursor=str(transacts.cursor()))
This works for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sqlalchemy is slow when doing query the first time - python

Related

how do i call the function in specified period of time?

"BadRequestError: cursor position is outside the range of the original query" when fetching the next batch of a query

PySQLPool and Celery, proper way to use it?

SQLAlchemy Misuse Causing Memory Leak?

Task Chaining with Cursor issue on app engine. Exception: Too big query offset. Anyone else get this issue?

Categories

Resources