My application receives one or more URLs (typically 3-4 URLs) from the user, scrapes certain data from those URLs and writes those data to the database. However, because scraping those data take a little while, I was thinking of running each of those scraping in a separate thread so that the scraping + writing to the database can keep on going in the background so that the user does not have to keep on waiting.
To implement that, I have (relevant parts only):
#view_config(route_name="add_movie", renderer="templates/add_movie.jinja2")
def add_movie(request):
post_data = request.POST
if "movies" in post_data:
movies = post_data["movies"].split(os.linesep)
for movie_id in movies:
movie_thread = Thread(target=store_movie_details, args=(movie_id,))
movie_thread.start()
return {}
def store_movie_details(movie_id):
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
print new_movie # Works fine.
print DBSession.add(movies(**movie_details)) # Returns None.
While the line new_movie does print correct scrapped data, DBSession.add() doesn't work. In fact, it just returns None.
If I remove the threads and just call the method store_movie_details(), it works fine.
What's going on?
Firstly, the SA docs on Session.add() do not mention anything about the method's return value, so I would assume it is expected to return None.
Secondly, I think you meant to add new_movie to the session, not movies(**movie_details), whatever that is :)
Thirdly, the standard Pyramid session (the one configured with ZopeTransactionExtension) is tied to Pyramid's request-response cycle, which may produce unexpected behavior in your situation. You need to configure a separate session which you will need to commit manually in store_movie_details. This session needs to use scoped_session so the session object is thread-local and is not shared across threads.
from sqlalchemy.orm import scoped_session
from sqlalchemy.orm import sessionmaker
session_factory = sessionmaker(bind=some_engine)
AsyncSession = scoped_session(session_factory)
def store_movie_details(movie_id):
session = AsyncSession()
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
session.add(new_movie)
session.commit()
And, of course, this approach is only suitable for very light-weight tasks, and if you don't mind occasionally losing a task (when the webserver restarts, for example). For anything more serious have a look at Celery etc. as Antoine Leclair suggests.
The transaction manager closes the transaction when the response is returned. DBSession has no transaction in the other threads when the response has been returned. Also, it's probably not a good idea to share a transaction across threads.
This is a typical use case for a worker. Check out Celery or RQ.
Related
I have an (not web) application that's continuously (once every few seconds) scanning a MySQL table for commands to handle certain jobs. Those jobs can take a while and/or may start delayed. For that reason I'm starting a thread for each job. The thread first removes the command from the table to avoid more threads for the same job to be created and then (or after a given delay) begins with the execution of the job. It is my understanding that for this purpose Contextual/Thread-local Sessions (https://docs.sqlalchemy.org/en/14/orm/contextual.html) are a good way to handle it.
So far so good, my question aims at when and where to use Session.remove() because, while it's said in the article that, "if the application thread itself ends, the “storage” for that thread is also garbage collected", I don't think relying on garbage collection is good coding and that I should clean up properly myself whenever possible.
# database.py
from sqlalchemy.orm import sessionmaker, scoped_session
session_factory = sessionmaker(bind=some_engine)
Session = scoped_session(session_factory)
# Model definitions below
# command_handler.py
from threading import Thread
from database import Session
from classes import Helper, Table1, Commands
def job(id):
session = Session() # creating a thread-local session
cmd = session.query(Commands).get(id) # for this unique job thread
command, timestamp = cmd.command, cmd.timestamp # get contents of the command row
session.delete(cmd) # before deleting it from the db
session.commit()
# potential delay via sleep() funtion here if timestamp is in the future.
for element in session.query(Table1).all(): # running a static class method
Helper.some_method(element) # on a class that's imported
Session.remove() # remove session after
# the job is done
while True:
with Session() as session: # create a thread-local session
commands = session.query(Commands).all() # for the __main__ thread
for cmd in commands:
if cmd.command.startswith('SOMECOMMAND'):
Thread(target=job, args=[cmd.id]).start() # create a thread to handle the job
# Other commands can be added here and functions
# like job could also be imported from other files
sleep(10) # wait 10 sec before checking again
# helper.py
from database import Session
from classes import Table2
class Helper:
#classmethod
def some_method(cls, element):
session = Session() # should be the same session object
row = Table2(name=element.name) # as in the calling job function (?)
session.add(row) # do some arbitrary stuff
session.commit() # no Session.remove() required here (?)
Now I wonder if this is the correct way to handle it. The way I understand it, the advantage of having thread-local sessions is not having to pass them as function arguments since the same thread will always get the same session object when Session() is called, right?
If that's right then only one Session.remove() at the end of the job function should be required to clean up the session that was used within that particular thread.
Is there an easy way to check how many sessions the scoped_session registry contains at any given point in time?
(github minimal verifiable complete example at bottom)
I have a flask endpoint that takes a data submission (just a dict), stores it, spins up a celery task to process the data later, and returns.
The celery tasks' job is to update the fields of a contact using that data, and log the attributes it changes (name, oldvalue, newvalue)
In production, multiple tasks may come into being for the same update-job, and I'd like to prevent them from all submitting updates for the same contact
I tried to do this using a row-lock on the update-job row. After calculating all of the fields that need to be changed, and changing the python object itself, it will then:
Wait for lock on the update-job row
After acquiring the lock re-check if the row is still unprocessed
If yes, add the updated contact object, new change log objects, and the job object (now marked completed) to the session and commit
3.5) If no, abandon, do not try and commit and end the task.
Unfortunately I see multiple tasks committing their staged changes though.
I had assumed that db.session is a scoped session, and is therefore safe for use by different celery tasks forked from the same worker.
Are they instead performing operations on each other's sessions?
Relevant task function:
#celery.task(name="process_update")
def process_update(update_id):
print("Starting")
time.sleep(.5)
update = UpdateJob.query.get(update_id)
# Wait for previous updates to finish and abort after timeout
for attempt in range(40):
unfinished_changes_to_contact = db.session.query(UpdateJob).filter(
UpdateJob.contact_id == update.contact_id,
UpdateJob.id < update.id,
UpdateJob.contact_was_updated == False # noqa
).order_by(UpdateJob.id.desc()).first()
if unfinished_changes_to_contact is None:
break
time.sleep(.5)
else:
print(f"{update.id} waiting for {unfinished_changes_to_contact.id} to finish")
raise Exception("Already existing update to contact that is not finishing")
print("We are clear")
time.sleep(.5)
contact = db.session.query(Contact).get(update.contact_id)
for key, value in json.loads(update.data).items():
setattr(contact, key, value)
changes = field_changes_from_contact(contact)
for change in changes:
change.update_job_id = update.id
# Did another task already cover this update?
print("Rechecking. Locking")
processed_update = UpdateJob.query.with_for_update().get(update_id)
if processed_update.contact_was_updated is True:
print(processed_update)
print("Already done. Abandoning.")
return
print("Looks good. Committing.")
processed_update.contact_was_updated = True
db.session.add(processed_update)
db.session.add(contact)
db.session.bulk_save_objects(changes)
db.session.commit()
print("Commited")
In the logfile (https://ideone.com/wXg3SS) I can see that worker-1 was the first to issue an actual SELECT FOR UPDATE statement to the db, and worker-1 was indeed the first to commit. This is correct behavior.
But afterward, worker 2 should acquire lock, recheck that row to find it's "updated" field was now true, and abandon the whole thing. Instead it commits.
Also there are many statements being issued to the DB in between these events and I don't understand why.
MVCE on github
https://github.com/kfieldm/flask-race
I have a Flask web app that might need to execute some heavy SQL queries via Sqlalchemy given user input. I would like to set a timeout for the query, let's say 20 seconds so if a query takes more than 20 seconds, the server will display an error message to user so they can try again later or with smaller inputs.
I have tried with both multiprocessing and threading modules, both with the Flask development server and Gunicorn without success: the server keeps blocking and no error message is returned. You'll find below an excerpt of the code.
How do you handle slow SQL query in Flask in a user friendly way?
Thanks.
from multiprocessing import Process
#app.route("/long_query")
def long_query():
query = db.session(User)
def run_query():
nonlocal query
query = query.all()
p = Process(target=run_query)
p.start()
p.join(20) # timeout of 20 seconds
if p.is_alive():
p.terminate()
return render_template("error.html", message="please try with smaller input")
return render_template("result.html", data=query)
I would recommend using Celery or something similar (people use python-rq for simple workflows).
Take a look at Flask documentation regarding Celery: http://flask.pocoo.org/docs/0.12/patterns/celery/
As for dealing with the results of a long-running query: you can create an endpoint for requesting task results and have client application periodically checking this endpoint until the results are available.
As leovp mentioned Celery is the way to go if you are working on a long-term project. However, if you are working on a small project where you want something easy to setup, I would suggest going with RQ and it's flask plugin. Also think very seriously about terminating processes while they are querying the database, since they might not be able to clean up after themselves (e.g. releasing locks they have on the db)
Well if you really want to terminate queries on a timeout, I would suggest you use a database that supports it (PostgreSQL is one). I will assume you use PostgreSQL for this section.
from sqlalchemy.interfaces import ConnectionProxy
class ConnectionProxyWithTimeouts(ConnectionProxy):
def cursor_execute(self, execute, cursor, statement, parameters, context, executemany):
timeout = context.execution_options.get('timeout', None)
if timeout:
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO %d;' % int(timeout * 1000))
c.close()
ret = execute(cursor, statement, parameters, context)
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO 0')
c.close()
return ret
else:
return execute(cursor, statement, parameters, context)
Then when you created an engine you would your own connection proxy
engine = create_engine(URL, proxy=TimeOutProxy(), pool_size=1, max_overflow=0)
And you could query then like this
User.query.execution_options(timeout=20).all()
If you want to use the code above, use it only as a base for your own implementation, since I am not 100% sure it's bug free.
I'm currently using Andy McCurdy's redis.py module to interact with Redis from Django
I offload certain tasks into the background using Celery.
Here is one of my tasks:
import redis
pool = redis.ConnectionPool(host='XX.XXX.XXX.X', port=6379, db=0, password='password')
r_server = redis.Redis(connection_pool=pool)
pipe = r_server.pipeline()
# The number of seconds for two months
seconds = 5356800
#shared_task
def Activity(userID, object_id, timestamp):
timestamp = int(timestamp)
# Create Individual activity for each.
activity_key = 'invite:activity:%s:%s' % (userID, timestamp)
mapping = dict(
user_id = userID,
object_id = object_id)
pipe.hmset(activity_key, mapping).expire(activity_key, seconds).execute()
Whenever this task is invoked, I get the following error:
AttributeError: 'bool' object has no attribute 'expire'
What could possibly be causing this?
I later did a test in a python console to see if there was something wrong with my syntax, but everything worked just as I had planned. So what could possibly be causing this error?
UPDATE
I think the expire is evaluating the result of hmset(activity_key, mapping). This is weird! expire is a method for pipe.
SECOND UPDATE
I found a work around for the time being. It seems this only occurs within Celery. Native Django views and the python console don't exhibit this behavior. It seems to evaluate the result of the expression before it. If any of you come across this problem, here is a work around.
pipe.hmset(activity_key, mapping)
pipe.lpush('mylist', 1)
pipe.expire('mylist', 300)
pipe.execute()
This should work and not give you any issues. Happy Coding!
pipe.hmset() does not return the pipeline; you cannot chain the calls here. Instead, a boolean is returned (probably indicating if the hmset() call succeeded).
Call the .expire() and .execute() methods separately:
pipe.hmset(activity_key, mapping)
pipe.expire(activity_key, seconds)
pipe.execute()
I suspect you need to create a pipeline within the Celery task instead of re-using a global here. Move the pipe = r_server.pipeline() to within the activity.
Scenario:
A .NET-based application server (Wonderware IAS/System Platform) hosts automation objects that communicate with various equipment on the factory floor.
CPython is hosted inside this application server (using Python for .NET).
The automation objects have scripting functionality built-in (using a custom, .NET-based language). These scripts call Python functions.
The Python functions are part of a system to track Work-In-Progress on the factory floor. The purpose of the system is to track the produced widgets along the process, ensure that the widgets go through the process in the correct order, and check that certain conditions are met along the process. The widget production history and widget state is stored in a relational database, this is where SQLAlchemy plays its part.
For example, when a widget passes a scanner, the automation software triggers the following script (written in the application server's custom scripting language):
' wiget_id and scanner_id provided by automation object
' ExecFunction() takes care of calling a CPython function
retval = ExecFunction("WidgetScanned", widget_id, scanner_id);
' if the python function raises an Exception, ErrorOccured will be true
' in this case, any errors should cause the production line to stop.
if (retval.ErrorOccured) then
ProductionLine.Running = False;
InformationBoard.DisplayText = "ERROR: " + retval.Exception.Message;
InformationBoard.SoundAlarm = True
end if;
The script calls the WidgetScanned python function:
# pywip/functions.py
from pywip.database import session
from pywip.model import Widget, WidgetHistoryItem
from pywip import validation, StatusMessage
from datetime import datetime
def WidgetScanned(widget_id, scanner_id):
widget = session.query(Widget).get(widget_id)
validation.validate_widget_passed_scanner(widget, scanner) # raises exception on error
widget.history.append(WidgetHistoryItem(timestamp=datetime.now(), action=u"SCANNED", scanner_id=scanner_id))
widget.last_scanner = scanner_id
widget.last_update = datetime.now()
return StatusMessage("OK")
# ... there are a dozen similar functions
My question is: How do I best manage SQLAlchemy sessions in this scenario? The application server is a long-running process, typically running months between restarts. The application server is single-threaded.
Currently, I do it the following way:
I apply a decorator to the functions I make avaliable to the application server:
# pywip/iasfunctions.py
from pywip import functions
def ias_session_handling(func):
def _ias_session_handling(*args, **kwargs):
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
return _ias_session_handling
# ... actually I populate this module with decorated versions of all the functions in pywip.functions dynamically
WidgetScanned = ias_session_handling(functions.WidgetScanned)
Question: Is the decorator above suitable for handling sessions in a long-running process? Should I call session.remove()?
The SQLAlchemy session object is a scoped session:
# pywip/database.py
from sqlalchemy.orm import scoped_session, sessionmaker
session = scoped_session(sessionmaker())
I want to keep the session management out of the basic functions. For two reasons:
There is another family of functions, sequence functions. The sequence functions call several of the basic functions. One sequence function should equal one database transaction.
I need to be able to use the library from other environments. a) From a TurboGears web application. In that case, session management is done by TurboGears. b) From an IPython shell. In that case, commit/rollback will be explicit.
(I am truly sorry for the long question. But I felt I needed to explain the scenario. Perhaps not necessary?)
The described decorator is suitable for long running applications, but you can run into trouble if you accidentally share objects between requests. To make the errors appear earlier and not corrupt anything it is better to discard the session with session.remove().
try:
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
finally:
session.remove()
Or if you can use the with context manager:
try:
with session.registry().transaction:
return func(*args, **kwargs)
finally:
session.remove()
By the way, you might want to use .with_lockmode('update') on the query so your validate doesn't run on stale data.
Ask your WonderWare administrator to give you access to the Wonderware Historian, you can track the values of the tags pretty easily via MSSQL calls over sqlalchemy that you can poll every so often.
Another option is to use the archestra toolkit to listen for the internal tag updates and have a server deployed as a platform in the galaxy which you can listen from.