PySQLPool and Celery, proper way to use it?

PySQLPool and Celery, proper way to use it? - python

I am wondering what is the proper way to use the mysql pool with celery tasks.
At the moment, this is how (the relevant portion) of my tasks module looks like:
from start import celery
import PySQLPool as pool
dbcfg = config.get_config('inputdb')
input_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
dbcfg = config.get_config('outputdb')
output_db = pool.getNewConnection(username=dbcfg['user'], password=dbcfg['passwd'], host=dbcfg['host'], port=dbcfg['port'], db=dbcfg['db'], charset='utf8')
#celery.task
def fetch():
ic = pool.getNewQuery(input_db)
oc = pool.getNewQuery(output_db)
count = 1
for e in get_new_stuff():
# do stuff with new stuff
# read the db with ic
# write to db using oc
# commit from time to time
if count % 1000:
pool.commitPool()
# commit whatever's left
pool.commitPool()
On one machine there can be at most 4 fetch() tasks running at the same time (1 per core).
I notice, however, that sometimes a task will hang and I suspect it is due to mysql.
Any tips on how to use mysql and celery?
Thank you!

I am also using celery and PySQLPool.
maria = PySQLPool.getNewConnection(username=app.config["MYSQL_USER"],
password=app.config["MYSQL_PASSWORD"],
host=app.config["MYSQL_HOST"],
db='configuration')
def myfunc(self, param1, param2):
query = PySQLPool.getNewQuery(maria, True)
try:
sSql = """
SELECT * FROM table
WHERE col1= %s AND col2
"""
tDatas = ( var1, var2)
query.Query(sSql, tDatas)
return query.record
except Exception, e:
logger.info(e)
return False
#celery.task
def fetch():
myfunc('hello', 'world')

Related

How to get execution time of postgres cursor (python) [duplicate]

I'm trying to get the performance statistics on queries executed by psycopg2, but the documentation / examples still seem fuzzy and not as clear as it could be.
I've at least got debugging working through the logger.
What would I need to do to access the performance data for the query? I'm wanting to get the number for query execution time.
Is there a method I can access, or something else I need to initialize to output the query execution time?
Here's a pieced together extract of what I have so far:
import psycopg2
import psycopg2.extensions
from psycopg2.extras import LoggingConnection
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# set higher up in script
db_settings = {
"user": user,
"password": password,
"host": host,
"database": dbname,
}
query_txt = "[query_txt_from file]"
conn = psycopg2.connect(connection_factory=LoggingConnection, **db_settings)
conn.initialize(logger)
cur = conn.cursor()
cur.execute(query_txt)
and I get
DEBUG:__main__: [the query executed]

Easy enough to set timestamp at start of execution and calculate duration at end. You'll need your own simple subclasses of LoggingConnection and LoggingCursor. See my example code.
This is based on source of MinTimeLoggingConnection you can find in psycopg2/extras.py source.
import time
import psycopg2
import psycopg2.extensions
from psycopg2.extras import LoggingConnection, LoggingCursor
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# MyLoggingCursor simply sets self.timestamp at start of each query
class MyLoggingCursor(LoggingCursor):
def execute(self, query, vars=None):
self.timestamp = time.time()
return super(MyLoggingCursor, self).execute(query, vars)
def callproc(self, procname, vars=None):
self.timestamp = time.time()
return super(MyLoggingCursor, self).callproc(procname, vars)
# MyLogging Connection:
# a) calls MyLoggingCursor rather than the default
# b) adds resulting execution (+ transport) time via filter()
class MyLoggingConnection(LoggingConnection):
def filter(self, msg, curs):
return msg + " %d ms" % int((time.time() - curs.timestamp) * 1000)
def cursor(self, *args, **kwargs):
kwargs.setdefault('cursor_factory', MyLoggingCursor)
return LoggingConnection.cursor(self, *args, **kwargs)
db_settings = {
....
}
query_txt = "[query_text_from file]"
conn = psycopg2.connect(connection_factory=MyLoggingConnection, **db_settings)
conn.initialize(logger)
cur = conn.cursor()
cur.execute(query_text)
and you'll get:
DEBUG: __main__:[query] 3 ms
within your filter() you can change the formatting, or choose to not display, if less than some value.

python cassandra get big result of select * in generator (without storage result in ram)

I want to get all data in cassandra table "user"
i have 840000 users and i don't want to get all users in python list.
i want get users in packs of 100 users
in cassandra doc https://datastax.github.io/python-driver/query_paging.html
i see i can use fetch_size, but in my python code i have database object that contains all cql instruction
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
class Database:
def __init__(self, name, salary):
self.cluster = Cluster(['192.168.1.1', '192.168.1.2'])
self.session = cluster.connect()
def get_users(self):
users_list = []
query = "SELECT * FROM users"
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):
users_list.append(user_row.name)
return users_list
actually get_users return very big list of user name
but i want to transform return get_users to a "generator"
i don't want get all users name in 1 list and 1 call of function get_users, but i want to have lot of call get_users and return list with only 100 users max every call function
for example :
list1 = database.get_users()
list2 = database.get_users()
...
listn = database.get_users()
list1 contains 100 first user in query
list2 contains 100 "second" users in query
listn contains the latest elements in query (<=100)
is this possible ?
thanks for advance for your answer

According to Paging Large Queries:
Whenever there are no more rows in the current page, the next page
will be fetched transparently.
So, if you execute your code like this, you will still the whole result set, but this is paged in a transparent manner.
In order to achieve what you need to use callbacks. You can also find some code sample on the link above.
I added below the full code for reference.
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from threading import Event
class PagedResultHandler(object):
def __init__(self, future):
self.error = None
self.finished_event = Event()
self.future = future
self.future.add_callbacks(
callback=self.handle_page,
errback=self.handle_error)
def handle_page(self, rows):
for row in rows:
process_row(row)
if self.future.has_more_pages:
self.future.start_fetching_next_page()
else:
self.finished_event.set()
def handle_error(self, exc):
self.error = exc
self.finished_event.set()
def process_row(user_row):
print user_row.name, user_row.age, user_row.email
cluster = Cluster()
session = cluster.connect()
query = "SELECT * FROM myschema.users"
statement = SimpleStatement(query, fetch_size=5)
future = session.execute_async(statement)
handler = PagedResultHandler(future)
handler.finished_event.wait()
if handler.error:
raise handler.error
cluster.shutdown()
Moving to next page is done in handle_page when start_fetching_next_page is called.
If you replace the if statement with self.finished_event.set() you will see that the iteration stops after the first 5 rows as defined in fetch_size

how do i call the function in specified period of time?

I am trying to build a learner which will call the function and store the weights into the DB, now the problem is, it at least takes from 30 to 60 seconds to learn, so if i want to store i need to wait and i decided to call the function with threading timer which will call the function after specified time period,
Example of code:
def learn(myConnection):
'''
Derive all the names,images where state = 1
Learn and Store
Delete all the column where state is 1
'''
id = 0
with myConnection:
cur = myConnection.cursor()
cur.execute("Select name, image FROM images WHERE state = 1")
rows = cur.fetchall()
for row in rows:
print "%s, %s" % (row[0], row[1])
name ='images/Output%d.jpg' % (id,)
names = row[0]
with open(name, "wb") as output_file:
output_file.write(row[1])
unknown_image = face_recognition.load_image_file(name)
unknown_encoding = face_recognition.face_encodings(unknown_image)[0]
# here i give a timer and call the function
threading=Timer(60, storeIntoSQL(names,unknown_encoding) )
threading.start()
id += 1
the thing that did not work with this is that it just worked as if i did not specify the timer it did not wait 60 seconds it just worked normal as if i called the function without the timer, Any ideas on how i can make this work or what alternatives i can use ? ... PS i have already used time.sleep it just stops the main thread i need the Project to be running while this is training
Example of the function that is being called:
def storeIntoSQL(name,unknown_face_encoding):
print 'i am printing'
# connect to the database
con = lite.connect('users2.db')
# store new person into the database rmena
with con:
cur = con.cursor()
# get the new id
cur.execute("SELECT DISTINCT id FROM Users ")
rows = cur.fetchall()
newId = len(rows)+1
# store into the Database
query = "INSERT INTO Users VALUES (?,?,?)"
cur.executemany(query, [(newId,name,r,) for r in unknown_face_encoding])
con
I was also told that MUTEX synchronization could help, where i can make one thread to work only if the other thread has finished it's job but i am not sure how to implement it and am open to any suggestions

I would suggest to use the threading library of python and implement a time.sleep(60) somewhere inside your function or in a wrapper function. For example
import time
import threading
def delayed_func(name,unknown_face_encoding):
time.sleep(60)
storeIntoSQL(name,unknown_face_encoding)
timer_thread = threading.Thread(target=delayed_func, args=(name,unknown_face_encoding))
timer_thread.start()

Parallelizing pandas pyodbc SQL database calls

I am currently querying data into dataframe via the pandas.io.sql.read_sql() command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))
Something like (very general):
pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)
Is something like that possible?

Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:
Old Code:
def fetchall(self):
results = [None] * len(self.cursors)
def do_work(index, cursor):
results[index] = cursor.fetchall()
self._do_parallel(do_work)
return list(chain(*[rs for rs in results]))
New Code:
def fetchall(self):
results = [None] * len(self.sql_connections)
def do_work(index, sql_connection):
sql, conn = sql_connection # Store tuple of sql/conn instead of cursor
results[index] = pd.read_sql(sql, conn)
self._do_parallel(do_work)
return pd.DataFrame().append([rs for rs in results])
Repo: https://github.com/godatadriven/ParallelConnection

Redis still fills up when results_ttl=0, Why?

Question: Why is redis filling up if the results of jobs are discarded immediately?
I'm using redis as a queue to create PDFs asynchronously and then save the result to my database. Since its saved, I don't need to access the object a later date and so I don't need to keep store the result in Redis after its been processed.
To keep the result from staying in redis I've set the TTL to 0:
parameter_dict = {
"order": serializer.object,
"photo": base64_image,
"result_ttl": 0
}
django_rq.enqueue(procces_template, **parameter_dict)
The problem is although the redis worker says the job expires immediately:
15:33:35 Job OK, result = John Doe's nail order to 568 Broadway
15:33:35 Result discarded immediately.
15:33:35
15:33:35 *** Listening on high, default, low...
Redis still fills up and throws:
ResponseError: command not allowed when used memory > 'maxmemory'
Is there another parameter that I need to set in redis / django-rq to keep redis from filling up if the job result is already not stored?
Update:
Following this post I expect the memory might be filling up because of the failed jobs in redis.
Using this code snippet:
def print_redis_failed_queue():
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
print job
here is a paste bin of a dump of the keys in redis:
http://pastebin.com/Bc4bRyRR
Its too long to be pragmatic to post here. Its size seems to support my theory. But using:
def delete_redis_failed_queue():
q = django_rq.get_failed_queue()
count = 0
while True:
job = q.dequeue()
if not job:
print "{} Jobs deleted.".format(count)
break
job.delete()
count += 1
Doest clear redis like i expect. How can I get a more accurate dump of the keys in redis? Am I clearing the jobs correctly?

It turns out Redis was filling up because of orphaned jobs, ie. jobs that were not assigned to a particular queue.
Although the cause of the orphaned jobs is unknown, the problem is solved with this snippet:
import redis
from rq.queue import Queue, get_failed_queue
from rq.job import Job
redis = Redis()
for i, key in enumerate(self.redis.keys('rq:job:*')):
job_number = key.split("rq:job:")[1]
job = Job.fetch(job_number, connection=self.redis)
job.delete()
In my particular situation, calling this snippet, (actually the delete_orphaned_jobs() method below ), after the competition of each job ensured that Redis would not fill up, and that orphaned jobs would be taken care of. For more details on the issue, here's a link to the conversation in the opened django-rq issue.
In the process of diagnosing this issue, I also created a utility class for inspecting and deleting jobs / orphaned jobs with ease:
class RedisTools:
'''
A set of utility tools for interacting with a redis cache
'''
def __init__(self):
self._queues = ["default", "high", "low", "failed"]
self.get_redis_connection()
def get_redis_connection(self):
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
self.redis = redis.from_url(redis_url)
def get_queues(self):
return self._queues
def get_queue_count(self, queue):
return Queue(name=queue, connection=self.redis).count
def msg_print_log(self, msg):
print msg
logger.info(msg)
def get_key_count(self):
return len(self.redis.keys('rq:job:*'))
def get_queue_job_counts(self):
queues = self.get_queues()
queue_counts = [self.get_queue_count(queue) for queue in queues]
return zip(queues, queue_counts)
def has_orphanes(self):
job_count = sum([count[1] for count in self.get_queue_job_counts()])
return job_count < self.get_key_count()
def print_failed_jobs(self):
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
print job
def print_job_counts(self):
for queue in self.get_queue_job_counts():
print "{:.<20}{}".format(queue[0], queue[1])
print "{:.<20}{}".format('Redis Keys:', self.get_key_count())
def delete_failed_jobs(self):
q = django_rq.get_failed_queue()
count = 0
while True:
job = q.dequeue()
if not job:
self.msg_print_log("{} Jobs deleted.".format(count))
break
job.delete()
count += 1
def delete_orphaned_jobs(self):
if not self.has_orphanes():
return self.msg_print_log("No orphan jobs to delete.")
for i, key in enumerate(self.redis.keys('rq:job:*')):
job_number = key.split("rq:job:")[1]
job = Job.fetch(job_number, connection=self.redis)
job.delete()
self.msg_print_log("[{}] Deleted job {}.".format(i, job_number))

You can use the "Black Hole" exception handler from http://python-rq.org/docs/exceptions/ with job.cancel():
def black_hole(job, *exc_info):
# Delete the job hash on redis, otherwise it will stay on the queue forever
job.cancel()
return False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySQLPool and Celery, proper way to use it? - python

Related

How to get execution time of postgres cursor (python) [duplicate]

python cassandra get big result of select * in generator (without storage result in ram)

how do i call the function in specified period of time?

Parallelizing pandas pyodbc SQL database calls

Redis still fills up when results_ttl=0, Why?

Categories

Resources