Notify browser/page after long task has ended - python

I am using a Python WebServer (CherryPy), but I guess the question is more open and is fairly general. At the moment, I have some Ajax call trough JQuery load on a button click, that triggers some computation, ending in files generation.
At the moment, as soon as the processing starts in a background thread, my load returns on the page the links to the future files generated on the server. There are several files to be generated, and the whole process can take minutes. How would one manage to display links to files only when they get available, progressively, file by file ? ... At the moment, the links are dead until there are files behind, and I have no way of telling the user when the links get alive.
UPDATE : Thanks JB Nizet. Now could anyone advise about Python Thread safe data structures writing ? Don't know much about the subject, and don't know where to get started.

Poll the server to get the latest generated files (or the complete list of generated files) every n seconds, and stop the polling once the list is complete, or once the first ajax query (the one which starts the generation process) has completed.
The thread which generates the file should make the list of generated files available in a shared, thread-safe, data-structure.

Related

Heroku - downloaded files are taking over 30 seconds to process with AWS

So I am using AWS as a cloud on my website. his main purpose is to be a storage unit (s3) everything works great until I have a large file(5mb or 7mb) that passes Heroku's 30 seconds time limit and sends a H12 error.
s3.Object(BUCKET, file_full_name).put(Body=file_to_put)
the problem starts there. here I am writing the file to the cloud. and because it takes to long to write it the site continues to try and load the file and never does. file_to_put is byte type. How can I fix it so I could upload larger files to the cloud?
Note I am need to read the file but first I need to fix this
problem
backend framework - flask
This is where worker process types and Task queues come in (So you can use Celery+Redis with Flask or something similar).
Basically, you queue up the task of writing the file in a Task Queue (say Redis) and your web process returns 200 OK to the website visitor immediately. In the meantime your worker process picks the task from the queue and starts performing the time taking task (writing the file to S3).
On the front-end, you'll have to ask the visitor to "Come back after some time" or show a wait "spinner" or something that indicates to the visitor that file is not available yet. Once the file is written, you can send a signal to refresh the page, or maybe you can use JavaScript on the web page to check if the file is ready say every second, or simply ask the visitor to refresh the page after say a minute.
I know all this might sound complicated, but this is the way it is done. Your web process shouldn't be waiting on long running tasks.

Software Paradigm for Pushing Data Through a System

tl-dr: I wanted you feedback if the correct software design pattern to use would be a Push/Pull Pipeline pattern.
Details:
Let's say I have several software algorithms/blocks which process data coming into a software system:
[Download Data] --> [Pre Process Data] --> [ML Classification] --> [Post Results]
The download data block simply loiters until midnight when new data is available and then downloads new data. The pre-process data simply loiters until newly available downloaded data is present, and then preprocesses the data. The Machine Learning (ML) Classification block simply loiters until new data is available to classify, etc.
The entire system seems to be event driven and I think fits the push/pull paradigm perfectly?
The [Download Data] block would be a producer? The consumers would be all the subsequent blocks with the exception of the [Plot Results] which would be a results collector?
Producer = pull
Consumer = pull then push
result collector = pull
I'm working within a python framework. This implementation looked ideal:
https://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/patterns/pushpull.html
https://github.com/ashishrv/pyzmqnotes
Push/Pull Pipeline Pattern
I'm totally open to using another software paradigm other than push/pull if I've missed the mark here. I'm also open to using another repo as well.
Thanks in advance for your help with the above!
I've done similar pipelines many many times and very much like to break it into blocks like that. Why? Mainly for automatic recovery from any errors. If something gets delayed, it will auto recover next hour. If something needs to be fixed mid-pipeline, fix it and name it so it gets picked up next cycle. (That and the fact smaller blocks are easier to design, build, and test).
For example, your [Download Data] should run every hour to look for waiting data: if none, go back to sleep; if some, download it to a file with a name containing a timestamp and state: 2020-0103T2153.downloaded.json. [Pre Process Data] should run every hour to look for files named *.downloaded.json: if none, go back to sleep; if one or more, pre-processes each in increasing timestamp order with output to <same-timestamp>.pre-processed.json. Etc, etc for each step.
Doing it this way meant may unplanned events auto recovered and nobody would know unless they looked in the log files (you should log each so you know what happened). Easy to sleep at night :)
In these scenarios, the event driving this is just time-of-day via crontab. When "awoken", each step in the pipeline just looks to see if it has any work waiting for it. Trying to make the file-creation event initiate things was non-simple especially if you need to re-initiate things (would need to re-create the file).
I wouldn't use a message queue as that's more complicated and more suited when you have to handle incoming messages as they arrive. Your case is more simple batch file processing so keep it simple and sleep at night.

How to show a 'processing' or 'in progress' view while pyramid is running a process?

I've got a simple pyramid app up and running, most of the views are a fairly thin wrapper around an sqlite database, with forms thrown in to edit/add some information.
A couple of times a month a new chunk of data will need to be added to this system (by csv import). The data is saved in an SQL table (the whole process right till commit takes about 4 seconds).
Every time a new chunk of data is uploaded, this triggers a recalculation of other tables in the database. The recalculation process takes a fairly long time (about 21-50 seconds for a month's worth of data).
Currently I just let the browser/client sit there waiting for the process to finish, but I do foresee the calculation process taking more and more time as the system gets more usage. From a UI perspective, this obviously looks like a hung process.
What can I do to indicate to the user that:-
That the long wait is normal/expected?
How MUCH longer they should have to wait (progress bar etc.)?
Note: I'm not asking about long-polling or websockets here, as this isn't really an interactive application and based on my basic knowledge websockets/async are overkill for my purposes.
I guess a follow-on question at this point, am I doing the wrong thing running processes in my view functions? Hardly seem to see that being done in examples/tutorials around the web. Am I supposed to be using celery or similar in this situation?
You're right, doing long calculations in a view function is generally frowned upon - I mean, if it's a typical website with random visitors who are able to hung a webserver thread for a minute then it's a recipe for a DoS vulnerability. But in some situations (internal website, few users, only admin has access to the "upload csv" form) you may get away with it. In fact, I used to have maintenance scripts which ran for hours :)
The trick here is to avoid browser timeouts - at the moment your client sends the data to the server and just sits there waiting for any reply, without any idea whether their request is being processed or not. Generally, at about 60 seconds the browser (or proxy, or frontend webserver) may become impatient and close the connection. Your server process will then get an error trying writing anything to the already closed connection and crash/raise an error.
To prevent this from happening the server needs to write something to the connection periodically, so the client sees that the server is alive and won't close the connection.
"Normal" Pyramid templates are buffered - i.e. the output is not sent to the client until the whole template to generated. Because of that you need to directly use response.app_iter / response.body_file and output some data there periodically.
As an example, you can duplicate the Todo List Application in One File example from Pyramid Cookbook and replace the new_view function with the following code (which itself has been borrowed from this question):
#view_config(route_name='new', request_method='GET', renderer='new.mako')
def new_view(request):
return {}
#view_config(route_name='new', request_method='POST')
def iter_test(request):
import time
if request.POST.get('name'):
request.db.execute(
'insert into tasks (name, closed) values (?, ?)',
[request.POST['name'], 0])
request.db.commit()
def test_iter():
i = 0
while True:
i += 1
if i == 5:
yield str('<p>Done! Click here to see the results</p>')
raise StopIteration
yield str('<p>working %s...</p>' % i)
print time.time()
time.sleep(1)
return Response(app_iter=test_iter())
(of cource, this solution is not too fancy UI-wise, but you said you didn't want to mess with websockets and celery)
So is the long running process triggered by browser action? I.e., the user is uploading the CSV that gets processed and then the view is doing the processing right there? For short-ish running browser processes I've used a loading indicator via jQuery or javascript, basically popping a modal animated spinner or something while a process runs, then when it completes hiding the spinner.
But if you're getting into longer and longer processes I think you should really look at some sort of background processing that will offload it from the UI. It doesn't have to be a message based worker, but even something like the end user uploads the file and a "to be processed" entry gets set in a database. Then you could have a pyramid script scheduled periodically in the background polling the status table and running anything it finds. You can move your file processing that is in the view to a separate method, and that can be called from the command line script. Then when the processing is finished it can update the status table indicating it is finished and that feedback could be presented back to the user somewhere, and not blocking their UI the whole time.

Patterns for waiting for a server side script to finish in flask. How to handle errors and premature termination

I'm writing a web app in flask that will be used to manipulate some data server side. Basically it accepts a path to a zip, decompresses it and places it in a temporary folder to be cleaned up later, asks you which folders in the zip to use and then runs a script over the files in that folder. This could take a long time for the script to run but I'm not sure how to communicate this to the user. I don't really want to use javascript, as I don't know any.
Also this process may fail. How do I communicate the failure back to the flask app and how can I make sure that if the python script does fail it cleans up after itself by deleting the uncompressed temporary files.
Are there any good patterns, examples or packages that handle such tasks in Flask?
Celery is the canonical answer to all "how do I manage long-running jobs in Python web apps" questions.
But it's going to be difficult to avoid Javascript. The basic idea is to offload the job to a separate process - ie Celery - and set a flag somewhere (probably in the database) when it's finished. But if you want your front end to know when that flag is set, it's probably going to have to do a repeated call to the back end to check the status: and that does mean Javascript.
If you use a library like jQuery though, it's only going to be a few lines. It's worth learning in any case.

Save a deque in a text file

I am writing a crawler in Python, in order to make Ctrl+C not to cause my crawler to start over in next run, I need to save the processing deque in a text file (one item per line) and update it every iteration, the update operation needs to be super fast. In order not to reinvent the wheel, I am asking if there is an established module to do this?
As an alternative, you could set up an exit function, and pickle the deque on exit.
Exit function
Pickle
You should be able to use pickle to serialize your lists.
I am not sure if I understood the question right, I am just curious, so here are few questions and suggestions:
Are you planning to catch the Ctrl+C interrupt and do the deque?
What happens if the crawler crashes for some arbitrary reason like an unhandled exception or crash? You loose the queue status and start over again?
from the documentation:
Note
The exit function is not called when
the program is killed by a signal,
when a Python fatal internal error is
detected, or when os._exit() is
called.
What happens when you happen to visit the same URI again, are you maintaining a visited list or something?
I think you should be maintaining some kind of visit and session information / status for each URI you crawl.
You can use the visit information to decide to crawl a URI or not when you visit the same URI next time.
The other info - session information - for the last session with that URI will help in picking up only the incremental stuff and if the page is not change no need to pick it up saving some db I/O costs, duplicates, etc.
That way you won't have to worry about the ctrl+C or a crash. If the crawler goes down for any reason, lets say after crawling 60K posts when 40K more were left, the next time crawler fills in the queue, though the queue may be huge but the crawler can check if the it has already visited the URI or not and what was the state of the page when it was crawled - optimization - does the page requires a new pick up coz it has changed or not.
I hope that is of some help.
Some things that come to my mind:
leave the file handle open (don't close the file everytime you wrote something)
or write the file every n items and catch a close signal to write the current non-written items

Categories

Resources