Django: AttributeError: 'bool' object has no attribute 'expire' - python

I'm currently using Andy McCurdy's redis.py module to interact with Redis from Django
I offload certain tasks into the background using Celery.
Here is one of my tasks:
import redis
pool = redis.ConnectionPool(host='XX.XXX.XXX.X', port=6379, db=0, password='password')
r_server = redis.Redis(connection_pool=pool)
pipe = r_server.pipeline()
# The number of seconds for two months
seconds = 5356800
#shared_task
def Activity(userID, object_id, timestamp):
timestamp = int(timestamp)
# Create Individual activity for each.
activity_key = 'invite:activity:%s:%s' % (userID, timestamp)
mapping = dict(
user_id = userID,
object_id = object_id)
pipe.hmset(activity_key, mapping).expire(activity_key, seconds).execute()
Whenever this task is invoked, I get the following error:
AttributeError: 'bool' object has no attribute 'expire'
What could possibly be causing this?
I later did a test in a python console to see if there was something wrong with my syntax, but everything worked just as I had planned. So what could possibly be causing this error?
UPDATE
I think the expire is evaluating the result of hmset(activity_key, mapping). This is weird! expire is a method for pipe.
SECOND UPDATE
I found a work around for the time being. It seems this only occurs within Celery. Native Django views and the python console don't exhibit this behavior. It seems to evaluate the result of the expression before it. If any of you come across this problem, here is a work around.
pipe.hmset(activity_key, mapping)
pipe.lpush('mylist', 1)
pipe.expire('mylist', 300)
pipe.execute()
This should work and not give you any issues. Happy Coding!

pipe.hmset() does not return the pipeline; you cannot chain the calls here. Instead, a boolean is returned (probably indicating if the hmset() call succeeded).
Call the .expire() and .execute() methods separately:
pipe.hmset(activity_key, mapping)
pipe.expire(activity_key, seconds)
pipe.execute()
I suspect you need to create a pipeline within the Celery task instead of re-using a global here. Move the pipe = r_server.pipeline() to within the activity.

Related

multiprocessing / psycopg2 TypeError: can't pickle _thread.RLock objects

I followed the below code in order to implement a parallel select query on a postgres database:
https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/
My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. However it takes a lot longer to run time wise, which is less of an issue for my use case. Even so I am trying to reduce the time as much as possible.
This is what my code looks like:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.engine = self.init_connection()
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS)
def init_connection(self):
LOGGER.info('Creating Postgres engine')
return create_engine(self.db_url)
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
self.pool.close()
self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
con = psycopg2.connect(self.db_url)
cur = con.cursor()
cur.execute(query)
records = cur.fetchall()
con.close()
return list(records)
However whenever this runs, I get the following error:
TypeError: can't pickle _thread.RLock objects
I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong.
The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method.
The top answer to a similar question:
Accessing a MySQL connection pool from Python multiprocessing
Shows an almost identical implementation to my own, except using MySql instead of Postgres.
Am I doing something wrong?
Thanks!
EDIT:
I've found this answer:
Python Postgres psycopg2 ThreadedConnectionPool exhausted
which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem?
EDIT 2:
So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. Very frustrating.
Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database.
To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe.
If you want to use multiprocessing, you should initialize the engine in each child processes after the fork.
You should use multithreading instead if you want to share engines.
Refer to Thread and process safety in psycopg2 documentation:
libpq connections
shouldn’t be used by a forked processes, so when using a module such
as multiprocessing or a forking web deploy method such as FastCGI make
sure to create the connections after the fork.
If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. Try this:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))
#classmethod
def init_connection(cls, db_url):
def _init_connection():
LOGGER.info('Creating Postgres engine')
cls.engine = create_engine(db_url)
return _init_connection
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
pass
#self.pool.close()
#self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
with self.engine.connect() as conn:
with conn.begin():
result = conn.execute(query)
return result.fetchall()
def __getstate__(self):
# this is a hack, if you want to remove this method, you should
# remove self.pool and just pass pool explicitly
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
Now, to address the XY problem.
Initially it was a single query with the where id in (...) contained
all 6k predicate IDs but I ran into issues with the query using up >
4GB of RAM on the machine it ran on, so I decided to split it out into
6k individual queries which when synchronously keeps a steady memory
usage.
What you may want to do instead is one of these options:
write a subquery that generates all 6000 IDs and use the subquery in your original bulk query.
as above, but write the subquery as a CTE
if your ID list comes from an external source (i.e. not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table
However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. Instead, you may want to try to chunk the queries. Send 500 IDs at once for example. You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget.

Pyramid: Multi-threaded Database Operation

My application receives one or more URLs (typically 3-4 URLs) from the user, scrapes certain data from those URLs and writes those data to the database. However, because scraping those data take a little while, I was thinking of running each of those scraping in a separate thread so that the scraping + writing to the database can keep on going in the background so that the user does not have to keep on waiting.
To implement that, I have (relevant parts only):
#view_config(route_name="add_movie", renderer="templates/add_movie.jinja2")
def add_movie(request):
post_data = request.POST
if "movies" in post_data:
movies = post_data["movies"].split(os.linesep)
for movie_id in movies:
movie_thread = Thread(target=store_movie_details, args=(movie_id,))
movie_thread.start()
return {}
def store_movie_details(movie_id):
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
print new_movie # Works fine.
print DBSession.add(movies(**movie_details)) # Returns None.
While the line new_movie does print correct scrapped data, DBSession.add() doesn't work. In fact, it just returns None.
If I remove the threads and just call the method store_movie_details(), it works fine.
What's going on?
Firstly, the SA docs on Session.add() do not mention anything about the method's return value, so I would assume it is expected to return None.
Secondly, I think you meant to add new_movie to the session, not movies(**movie_details), whatever that is :)
Thirdly, the standard Pyramid session (the one configured with ZopeTransactionExtension) is tied to Pyramid's request-response cycle, which may produce unexpected behavior in your situation. You need to configure a separate session which you will need to commit manually in store_movie_details. This session needs to use scoped_session so the session object is thread-local and is not shared across threads.
from sqlalchemy.orm import scoped_session
from sqlalchemy.orm import sessionmaker
session_factory = sessionmaker(bind=some_engine)
AsyncSession = scoped_session(session_factory)
def store_movie_details(movie_id):
session = AsyncSession()
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
session.add(new_movie)
session.commit()
And, of course, this approach is only suitable for very light-weight tasks, and if you don't mind occasionally losing a task (when the webserver restarts, for example). For anything more serious have a look at Celery etc. as Antoine Leclair suggests.
The transaction manager closes the transaction when the response is returned. DBSession has no transaction in the other threads when the response has been returned. Also, it's probably not a good idea to share a transaction across threads.
This is a typical use case for a worker. Check out Celery or RQ.

Reporting yielded results of long-running Celery task

Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.

Sending signal to long running method in Django

I want to send a "pause" signal to a long running task in Celery and I'm trying to figure out the best way to do it. In the view I can pull an instance of the object from the database and tell that to save, but it's not the same as the instance of the object in Celery. The object doesn't check back to see if it's paused.
Polling the database from within the long-running class and task feels weird and impractical so I'm looking at sending my instance a message. I looked at using pubsub but I would prefer to use Django signals as it's already a Django project. I might be approaching this the wrong way.
Here's an example that does not work:
Models.py
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Views.py
def pause_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = True
lrc.save()
return HttpResponse(json.dumps({'is_paused': lrc.is_paused}))
def resume_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = False
lrc.save()
# Pretend this is a Celery task
lrc.long_task()
So if I modify models.py to use signals, I can add these lines but it still does not quite work:
pause_signal = django.dispatch.Signal(providing_args=['is_paused'])
#django.dispatch.receiver(pause_signal)
def pause_callback(sender, **kwargs):
if 'is_paused' in kwargs:
sender.is_paused = kwargs['is_paused']
sender.save()
That doesn't affect the instantiated class that's already running either. How can I tell the instance of my model running within the task to pause?
Celery task is a separate process. Django signals is standard "observer" pattern, which works within one thread, so there is no way to orginize communication betwean threads using signals. You need to load object from database to know if its properties has changed.
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def get_is_paused(self):
db_obj = LongRunningClass.objects.get(pk=self.pk)
return db_obj.is_paused
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.get_is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Not very good by design - you better to move long_task to other place, and operate with newly loaded LongRunningClass instance, but it will do the job. You could add some memcache here - if you don't want to disturb your database so often.
BTW: I'm not 100% sure but you may have another design issue here. This is rather rare case when you have really long running tasks with this kind of cycle. Think about removing cycle from your program (you have queues!). Take a look:
#celery.task(run_every=2minutes) # adding XX files for processing every XX minutes
def scheduled_task(lr_pk):
lr = LongRunningClass.objects.get(pk=lr_pk)
if not lr.is paused:
remaining_files = self.total_files - self.processed_files
for i in xrange(lr.files_per_iteration):
process_file.delay(lr.pk,i)
#celery.task(rate=1/m,queue='process_file') # processing each file
def process_file(lr_pk,i):
# do somthing with i
lr = LongRunningClass.objects.get(pk=lr_pk)
lr.processed_files += 1
lr.save()
You have to set up celerybeat, and create separate queue for this types of tasks, to implement this solution. But as a result you will have a lot of control over your program - speed rates, parallel execution and your code would not hang for sleep(1). If you create another model for each file you could control what files are processed and what are not, handle errors etc,etc.
Take a look at celery.contrib.abortable -- this is an alternate base class for Celery tasks that implements a signal between caller and task to handle terminations, that could also be used to implement a "pause".
When caller calls abort(), a status is marked in the backend. Task calls self.is_aborted() to see if that special status has been set; and then implements whatever action is appropriate (terminate, pause, ignore etc.). The action is under the task's control; this is not automated task termination.
This could be used as-is if it is sensible for the specific task to interpret the ABORT signal as a request for a pause. Or you could extend the class to provide more signals, not just the existing ABORT.

SQLAlchemy session management in long-running process

Scenario:
A .NET-based application server (Wonderware IAS/System Platform) hosts automation objects that communicate with various equipment on the factory floor.
CPython is hosted inside this application server (using Python for .NET).
The automation objects have scripting functionality built-in (using a custom, .NET-based language). These scripts call Python functions.
The Python functions are part of a system to track Work-In-Progress on the factory floor. The purpose of the system is to track the produced widgets along the process, ensure that the widgets go through the process in the correct order, and check that certain conditions are met along the process. The widget production history and widget state is stored in a relational database, this is where SQLAlchemy plays its part.
For example, when a widget passes a scanner, the automation software triggers the following script (written in the application server's custom scripting language):
' wiget_id and scanner_id provided by automation object
' ExecFunction() takes care of calling a CPython function
retval = ExecFunction("WidgetScanned", widget_id, scanner_id);
' if the python function raises an Exception, ErrorOccured will be true
' in this case, any errors should cause the production line to stop.
if (retval.ErrorOccured) then
ProductionLine.Running = False;
InformationBoard.DisplayText = "ERROR: " + retval.Exception.Message;
InformationBoard.SoundAlarm = True
end if;
The script calls the WidgetScanned python function:
# pywip/functions.py
from pywip.database import session
from pywip.model import Widget, WidgetHistoryItem
from pywip import validation, StatusMessage
from datetime import datetime
def WidgetScanned(widget_id, scanner_id):
widget = session.query(Widget).get(widget_id)
validation.validate_widget_passed_scanner(widget, scanner) # raises exception on error
widget.history.append(WidgetHistoryItem(timestamp=datetime.now(), action=u"SCANNED", scanner_id=scanner_id))
widget.last_scanner = scanner_id
widget.last_update = datetime.now()
return StatusMessage("OK")
# ... there are a dozen similar functions
My question is: How do I best manage SQLAlchemy sessions in this scenario? The application server is a long-running process, typically running months between restarts. The application server is single-threaded.
Currently, I do it the following way:
I apply a decorator to the functions I make avaliable to the application server:
# pywip/iasfunctions.py
from pywip import functions
def ias_session_handling(func):
def _ias_session_handling(*args, **kwargs):
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
return _ias_session_handling
# ... actually I populate this module with decorated versions of all the functions in pywip.functions dynamically
WidgetScanned = ias_session_handling(functions.WidgetScanned)
Question: Is the decorator above suitable for handling sessions in a long-running process? Should I call session.remove()?
The SQLAlchemy session object is a scoped session:
# pywip/database.py
from sqlalchemy.orm import scoped_session, sessionmaker
session = scoped_session(sessionmaker())
I want to keep the session management out of the basic functions. For two reasons:
There is another family of functions, sequence functions. The sequence functions call several of the basic functions. One sequence function should equal one database transaction.
I need to be able to use the library from other environments. a) From a TurboGears web application. In that case, session management is done by TurboGears. b) From an IPython shell. In that case, commit/rollback will be explicit.
(I am truly sorry for the long question. But I felt I needed to explain the scenario. Perhaps not necessary?)
The described decorator is suitable for long running applications, but you can run into trouble if you accidentally share objects between requests. To make the errors appear earlier and not corrupt anything it is better to discard the session with session.remove().
try:
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
finally:
session.remove()
Or if you can use the with context manager:
try:
with session.registry().transaction:
return func(*args, **kwargs)
finally:
session.remove()
By the way, you might want to use .with_lockmode('update') on the query so your validate doesn't run on stale data.
Ask your WonderWare administrator to give you access to the Wonderware Historian, you can track the values of the tags pretty easily via MSSQL calls over sqlalchemy that you can poll every so often.
Another option is to use the archestra toolkit to listen for the internal tag updates and have a server deployed as a platform in the galaxy which you can listen from.

Categories

Resources