So I am using AWS as a cloud on my website. his main purpose is to be a storage unit (s3) everything works great until I have a large file(5mb or 7mb) that passes Heroku's 30 seconds time limit and sends a H12 error.
s3.Object(BUCKET, file_full_name).put(Body=file_to_put)
the problem starts there. here I am writing the file to the cloud. and because it takes to long to write it the site continues to try and load the file and never does. file_to_put is byte type. How can I fix it so I could upload larger files to the cloud?
Note I am need to read the file but first I need to fix this
problem
backend framework - flask
This is where worker process types and Task queues come in (So you can use Celery+Redis with Flask or something similar).
Basically, you queue up the task of writing the file in a Task Queue (say Redis) and your web process returns 200 OK to the website visitor immediately. In the meantime your worker process picks the task from the queue and starts performing the time taking task (writing the file to S3).
On the front-end, you'll have to ask the visitor to "Come back after some time" or show a wait "spinner" or something that indicates to the visitor that file is not available yet. Once the file is written, you can send a signal to refresh the page, or maybe you can use JavaScript on the web page to check if the file is ready say every second, or simply ask the visitor to refresh the page after say a minute.
I know all this might sound complicated, but this is the way it is done. Your web process shouldn't be waiting on long running tasks.
Related
Short description:
Dataflow is processing the same input element many times, even at the same time in parallel (so this is not fail-retry build-in mechanism of dataflow, because previous process didn't fail).
Long description:
Pipeline gets pubsub message in which path to GCS file is stored.
In next step (DoFn class) this file is open and read line by line, so sometimes for very big files this is long process and takes up to 1 hour (per file).
Many times (very often) those big files are processing at the same time.
I see it based on logs messages, that first process loads already 500k rows, another one 300k rows and third one just started, all of them are related to the same file and all of them based on the same pubsub message (the same message_id).
Also pubsub queue chart is ugly, those messages are not acked so unacked chart does not decrease.
Any idea what is going on? Have you experienced something similar?
I want to underline that this is not a issue related to fail and retry process.
If first process fails and second one started for the same file - that is fine and expected.
Unexpected is, if those two processes lives at the same time.
When a file is added on Cloud storage and fire an automatic notification to Pub Sub, multiple notifications can be sent :
- OBJECT_FINALIZE Sent when a new object (or a new generation of an existing object) is successfully created in the bucket. This includes copying or rewriting an existing object. A failed upload does not trigger this event.
- OBJECT_METADATA_UPDATE Sent when the metadata of an existing object changes.
...
pubsub-notifications doc
You can access to the attributes of the PubSubMessage in Beam and filter only messages with attribute event type and OBJECT_FINALIZE value.
In this case, only one message per file will be treaten by your Dataflow job and then the DoFn will open this file and treat elements only once.
Here is a likely possibility:
reading the file is being "fused" with reading the message from Cloud Pubsub, so that the hour of processing happens before the result is saved to Dataflow's internal storage and the message can be ACKed
since your processing is so long, Cloud Pubsub will deliver the message again
there is no way that Dataflow cancel's your DoFn processing, so you will see them both processing at the same time, even though one of them is expired and will be rejected when processing is complete
What you really want is for the large file reads to be split and parallelized. Beam can do this easily (and currently I believe is the only framework that can). You pass the filenames to TextIO.readFiles() transform and the reading of each large file will be split and performed in parallel, and there will be enough checkpointing that the pubsub message will be ACKed before it expires.
One thing you might try is to put a Reshuffle in between the PubsubIO.read() and your processing.
I deployed a Django app on Heroku. I have a function (inside views) in my app that take some time (3m-5m) before it returns.
The problem is that function doesn't return when the app is deployed to Heroku. On my PC it works fine.
Heroku is not giving me useful feedback. There is no 'timeout' or anything in the logs.
Three to five minutes is way too long for a request to take. Heroku will kill such requests:
Best practice is to get the response time of your web application to be under 500ms, this will free up the application for more requests and deliver a high quality user experience to your visitors. Occasionally a web request may hang or take an excessive amount of time to process by your application. When this happens the router will terminate the request if it takes longer than 30 seconds to complete.
I'm not sure why you aren't seeing timeouts in the logs, but if you truly need that much time to compute something you'll need to do it asynchronously.
There are lots of ways to do that, e.g. you could queue the work and then respond immediately with a "loading" state, then poll the back-end and update the view when the result is ready.
Start by reading Worker Dynos, Background Jobs and Queueing and then decide how you wish to proceed. We can't tell you the "right" way of doing this; it's something you need to decide about your application.
I've got a simple pyramid app up and running, most of the views are a fairly thin wrapper around an sqlite database, with forms thrown in to edit/add some information.
A couple of times a month a new chunk of data will need to be added to this system (by csv import). The data is saved in an SQL table (the whole process right till commit takes about 4 seconds).
Every time a new chunk of data is uploaded, this triggers a recalculation of other tables in the database. The recalculation process takes a fairly long time (about 21-50 seconds for a month's worth of data).
Currently I just let the browser/client sit there waiting for the process to finish, but I do foresee the calculation process taking more and more time as the system gets more usage. From a UI perspective, this obviously looks like a hung process.
What can I do to indicate to the user that:-
That the long wait is normal/expected?
How MUCH longer they should have to wait (progress bar etc.)?
Note: I'm not asking about long-polling or websockets here, as this isn't really an interactive application and based on my basic knowledge websockets/async are overkill for my purposes.
I guess a follow-on question at this point, am I doing the wrong thing running processes in my view functions? Hardly seem to see that being done in examples/tutorials around the web. Am I supposed to be using celery or similar in this situation?
You're right, doing long calculations in a view function is generally frowned upon - I mean, if it's a typical website with random visitors who are able to hung a webserver thread for a minute then it's a recipe for a DoS vulnerability. But in some situations (internal website, few users, only admin has access to the "upload csv" form) you may get away with it. In fact, I used to have maintenance scripts which ran for hours :)
The trick here is to avoid browser timeouts - at the moment your client sends the data to the server and just sits there waiting for any reply, without any idea whether their request is being processed or not. Generally, at about 60 seconds the browser (or proxy, or frontend webserver) may become impatient and close the connection. Your server process will then get an error trying writing anything to the already closed connection and crash/raise an error.
To prevent this from happening the server needs to write something to the connection periodically, so the client sees that the server is alive and won't close the connection.
"Normal" Pyramid templates are buffered - i.e. the output is not sent to the client until the whole template to generated. Because of that you need to directly use response.app_iter / response.body_file and output some data there periodically.
As an example, you can duplicate the Todo List Application in One File example from Pyramid Cookbook and replace the new_view function with the following code (which itself has been borrowed from this question):
#view_config(route_name='new', request_method='GET', renderer='new.mako')
def new_view(request):
return {}
#view_config(route_name='new', request_method='POST')
def iter_test(request):
import time
if request.POST.get('name'):
request.db.execute(
'insert into tasks (name, closed) values (?, ?)',
[request.POST['name'], 0])
request.db.commit()
def test_iter():
i = 0
while True:
i += 1
if i == 5:
yield str('<p>Done! Click here to see the results</p>')
raise StopIteration
yield str('<p>working %s...</p>' % i)
print time.time()
time.sleep(1)
return Response(app_iter=test_iter())
(of cource, this solution is not too fancy UI-wise, but you said you didn't want to mess with websockets and celery)
So is the long running process triggered by browser action? I.e., the user is uploading the CSV that gets processed and then the view is doing the processing right there? For short-ish running browser processes I've used a loading indicator via jQuery or javascript, basically popping a modal animated spinner or something while a process runs, then when it completes hiding the spinner.
But if you're getting into longer and longer processes I think you should really look at some sort of background processing that will offload it from the UI. It doesn't have to be a message based worker, but even something like the end user uploads the file and a "to be processed" entry gets set in a database. Then you could have a pyramid script scheduled periodically in the background polling the status table and running anything it finds. You can move your file processing that is in the view to a separate method, and that can be called from the command line script. Then when the processing is finished it can update the status table indicating it is finished and that feedback could be presented back to the user somewhere, and not blocking their UI the whole time.
I have an app on GAE that takes csv input from a web form and stores it to a blob, does some stuff to obtain new information using input from the csv file, then uses csv.writer on self.response.out to write a new csv file and prompt the user to download it. It works well, but my problem is if it takes over 60 seconds it times out. I've tried to setup the do some stuff part as a task in task queue, and it would work, except I can't make the user wait while this is running, and there's no way of calling the post that would write out the new csv file automatically when the task queue is complete, and having the user periodically push a button to see if it is done is less than optimal.
Is there a better solution to a problem like this other than using the task queue and having the user have to manually push a button periodically to see if the task is complete?
You have many options:
Use a timer in your client to check periodically (i.e. every 15 seconds) if the file is ready. This is the simplest option that requires only a few lines of code.
Use the Channel API. It's elegant, but it's an overkill unless you face similar problems frequently.
Email the results to the user.
If your problem is 60s limit for requests, you could consider to use App Engine Modules that allow you to control scaling type of a module/version. Basically there are three scaling types available.
Manual Scaling
Such a module runs continuously. Requests can run indefinitely.
Basic Scaling
Such a module creates an instance when the application receives a request. The instance will be turned down when the app becomes idle. Requests can run indefinitely.
Automatic Scaling
The same scaling policy that App Engine has used since its inception. It is based on request rate, response latencies, and other application metrics. There is 60-second deadline for HTTP requests.
You can find more details here.
I am using a Python WebServer (CherryPy), but I guess the question is more open and is fairly general. At the moment, I have some Ajax call trough JQuery load on a button click, that triggers some computation, ending in files generation.
At the moment, as soon as the processing starts in a background thread, my load returns on the page the links to the future files generated on the server. There are several files to be generated, and the whole process can take minutes. How would one manage to display links to files only when they get available, progressively, file by file ? ... At the moment, the links are dead until there are files behind, and I have no way of telling the user when the links get alive.
UPDATE : Thanks JB Nizet. Now could anyone advise about Python Thread safe data structures writing ? Don't know much about the subject, and don't know where to get started.
Poll the server to get the latest generated files (or the complete list of generated files) every n seconds, and stop the polling once the list is complete, or once the first ajax query (the one which starts the generation process) has completed.
The thread which generates the file should make the list of generated files available in a shared, thread-safe, data-structure.