Django, sleep() pauses all processes, but only if no GET parameter? - python

Using Django (hosted by Webfaction), I have the following code
import time
def my_function(request):
time.sleep(10)
return HttpResponse("Done")
This is executed via Django when I go to my url, www.mysite.com
I enter the url twice, immediately after each other. The way I see it, both of these should finish after 10 seconds. However, the second call waits for the first one and finishes after 20 seconds.
If, however, I enter some dummy GET parameter, www.mysite.com?dummy=1 and www.mysite.com?dummy=2 then they both finish after 10 seconds. So it is possible for both of them to run simultaneously.
It's as though the scope of sleep() is somehow global?? Maybe entering a parameter makes them run as different processes instead of the same???
It is hosted by Webfaction. httpd.conf has:
KeepAlive Off
Listen 30961
MaxSpareThreads 3
MinSpareThreads 1
ServerLimit 1
SetEnvIf X-Forwarded-SSL on HTTPS=1
ThreadsPerChild 5
I do need to be able to use sleep() and trust that it isn't stopping everything. So, what's up and how to fix it?
Edit: Webfaction runs this using Apache.

As Gjordis pointed out, sleep will pause the current thread. I have looked at Webfaction and it looks like their are using WSGI for running the serving instance of Django. This means, every time a request comes in, Apache will look at how many worker processes (that are processes that each run a instance of Django) are currently running. If there are none/to view it will spawn additonally workers and hand the requests to them.
Here is what I think is happening in you situation:
first GET request for resource A comes in. Apache uses a running worker (or starts a new one)
the worker sleeps 10 seconds
during this, a new request for resource A comes in. Apache sees it is requesting the same resource and sends it to the same worker as for request A. I guess the assumption here is that a worker that recently processes a request for a specific resource it is more likely that the worker has some information cached/preprocessed/whatever so it can handle this request faster
this results in a 20 second block since there is only one worker that waits 2 times 10 seconds
This behavior makes complete sense 99% of the time so it's logical to do this by default.
However, if you change the requested resource for the second request (by adding GET parameter) Apache will assume that this is a different resource and will start another worker (since the first one is already "busy" (Apache can not know that you are not doing any hard work). Since there are now two worker, both waiting 10 seconds the total time goes down to 10 seconds.
Additionally I assume that something is **wrong** with your design. There are almost no cases which I can think of where it would be sensible to not respond to a HTTP request as fast as you can. After all, you want to serve as many requests as possible in the shortest amount of time, so sleeping 10 seconds is the most counterproductive thing you can do. I would recommend the you create a new question and state what you actual goal is that you are trying to achieve. I'm pretty sure there is a more sensible solution to this!

Assuming you run your Django-server just with run() , by default this makes a single threaded server. If you use sleep on a single threaded process, the whole application freezes for that sleep time.

It may simply be that your browser is queuing the second request to be performed only after the first one completes. If you are opening your URLs in the same browser, try using the two different ones (e.g. Firefox and Chrome), or try performing requests from the command line using wget or curl instead.

Related

Handling time consuming requests in Flask-UWSGI app

Am running an app with Flask , UWSGI and Nginx. My UWSGI is set to spawn out 4 parallel processes to handle multiple requests at the same time. Now I have one request that takes lot of time and that changes important data concerning the application. So, when one UWSGI process is processing that request and say all others are also busy, the fifth request would have to wait. The problem here is I cannot change this request to run in an offline mode as it changes important data and the user cannot simply remain unknown about it. What is the best way to handle this situation ?
As an option you can do the following:
Separate the heavy logic from the function which is being called
upon #route and move it into a separate place (a file, another
function, etc)
Introduce Celery to run that pieces of heavy logic
(it will be processed in a separate thread from the #route-decorated functions).
A quick way of doing this is using Redis as a message broker.
Schedule the time-consuming functions from your #route-decorated
functions in Celery (it is possible to pass parameters as well)
This way the HTTP requests won't be blocked for the complete function execution time.

how to write endless loop crawler in python?

EDITED:
I have a crawler.py that crawls certain sites every 10 minutes and sends me some emails regarding these site. The crawler is ready and working locally.
How can I adjust it so that the following two things will happen :
It will run in endless loop on the hosting that I'll upload it to?
Sometimes I will be able to stop it ( e.g. for debugging).
At first, I thought of doing endless loop e.g.
crawler.py:
while True:
doCarwling()
sleep(10 minutes)
However, according to answers I got below, this would be impossible since hosting providers kill processes after a while (just for the question sake, let's assume proccesses are killed every 30 min). Therefore, my endless loop process would be killed at some point.
Therefore, I have thought pf a different solution:
Lets assume that my crawler is located at "www.example.com\crawler.py" and each time it is accessed, it executes the function run():
run()
doCarwling()
sleep(10 minutes)
call URL "www.example.com\crawler.py"
Thus, there will be no endless loop. In fact, every time my crawler runs, it would also access the URL which will execute the same crawler again. Therefore, there would be no endless loop, no process with a long-running time, and my crawler will continue operating forever.
Will my idea work?
Are there any hidden drawbacks I haven't thought of?
Thanks!
Thanks
As you stated in the comments, you are running on a public shared server like GoDaddy and so on. Therefore cron is not available there and long running scripts are usually forbidden - your process would be killed even if you were using sleep.
Therefore, the only solution I see is to use an external server on which you have to control to connect to your public server and run the script, every 10 minutes. One solution could be using cron on your local machine to connect with wget or curl to a specific page on your host. **
Maybe you can find on-line services that allow running a script periodically, and use those, but I know none.
** Bonus: you can get the results directly as response without having to send yourself an email.
Update
So, in your updated question you propose yo use your script to call itself with an HTTP request. I thought of it before, but I didn't consider it in my previous answer because I believe it won't work (in general).
My concern is: will the server kill a script if the HTTP connection requesting it is closed before the script terminates?
In other words: if you open yoursite.com/script.py and it takes 60 seconds to run, and you close the connection with the server after 10 seconds, will the script run till its regular end?
I thought that the answer was obviously "no, the script will be killed", therefore that method would be useless, because you should guarantee that a script calling itself via a HTTP request stays alive longer than the called script. I did a little experiment using flask, and it proved me wrong:
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
import time
print('Script started...')
time.sleep(5)
print('5 seconds passed...')
time.sleep(5)
print('Script finished')
return 'Script finished'
if __name__ == '__main__':
app.run()
If I run this script and make an HTTP request to localhost:5000, and close the connection after 2 seconds, the scripts continues to run until the end and the messages are still printed.
Therefore, with flask, if you can do an asynchronous request to yourself, you should be able to have an "infinite loop" script.
I don't know the behavior on other servers, though. You should make a test.
Control
Assuming your server allows you to do a GET request and have the script running even if the connection is closed, you have few things to take care of, for example that your script still has to run fast enough to complete during the maximum server time allowance, and that to make your script run every 10 minutes, with a maximum allowance of 1 minute, you have to count every time 10 calls.
In addition, this mechanism has to be controlled, because you cannot interrupt it for debug as you requested. At least, not directly.
Therefore, I suggest you to use files: use a file to split your crawling in smaller steps, each capable to finish in less than one minute, and then continue again when the script is called again.
Use a file to count how many times the script is called, before actually doing the crawling. This is necessary if, for example, the script is allowed to live 90 seconds, but you want to crawl every 10 hours.
Use a file to control the script: store a boolean flag that you use to stop the recursion mechanism if you need to.
If you're using Linux you should just do a cron job for your script. Info: http://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800
If you are running linux I would setup and upstart script http://upstart.ubuntu.com/getting-started.html to turn it into a service.
It offers a lot of advantages like:
-Starting at system boot
-Auto restart on crashes
-Manageable: service mycrawler restart
...
Or if you would prefer to have it run every 10 minutes forget about the endless loop and do a cronjob http://en.wikipedia.org/wiki/Cron

Do I need celery when I am using gevent?

I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.
In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.

Time out issues with chrome and flask

I have a web application which acts as an interface to an offsite server which runs a very long task. The user enters information and hits submit and then chrome waits for the response, and loads a new webpage when it receives it. However depending on the network, input of the user, the task can take a pretty long time and occasionally chrome loads a "no data received page" before the data is returned (though the task is still running).
Is there a way to put either a temporary page while my task is thinking or simply force chrome to continue waiting? Thanks in advance
While you could change your timeout on the server or other tricks to try to keep the page "alive", keep in mind that there might be other parts of the connection that you have no control over that could timeout the request (such as the timeout value of the browser, or any proxy between the browser and server, etc). Also, you might need to constantly up your timeout value if the task takes longer to complete (becomes more advanced, or just slower because more people use it).
In the end, this sort of problem is typically solved by a change in your architecture.
Use a Separate Process for Long-Running Tasks
Rather than submitting the request and running the task in the handling view, the view starts the running of the task in a separate process, then immediately returns a response. This response can bring the user to a "Please wait, we're processing" page. That page can use one of the many push technologies out there to determine when the task was completed (long-polling, web-sockets, server-sent events, an AJAX request every N seconds, or the dead-simplest: have the page reload every 5 seconds).
Have your Web Request "Kick Off" the Separate Process
Anyway, as I said, the view handling the request doesn't do the long action: it just kicks off a background process to do the task for it. You can create this background process dispatch yourself (check out this Flask snippet for possible ideas), or use a library like Celery or (RQ).
Once the task is complete, you need some way of notifying the user. This will be dependent on what sort of notification method you picked above. For a simple "ajax request every N seconds", you need to create a view that handles the AJAX request that checks if the task is complete. A typical way to do this is to have the long-running task, as a last step, make some update to a database. The requests for checking the status can then check this part of the database for updates.
Advantages and Disadvantages
Using this method (rather than trying to fit the long-running task into a request) has a few benefits:
1.) Handling long-running web requests is a tricky business due to the fact that there are multiple points that could time out (besides the browser and server). With this method, all your web requests are very short and much less likely to timeout.
2.) Flask (and other frameworks like it) is designed to only support a certain number of threads that can respond to web queries. Assume it has 8 threads: if four of them are handling the long requests, that only leaves four requests to actually handle more typical requests (like a user getting their profile page). Half of your web server could be tied up doing something that is not serving web content! At worse, you could have all eight threads running a long process, meaning your site is completely unable to respond to web requests until one of them finishes.
The main drawback: there is a little more set up work in getting a task queue up and running, and it does make your entire system slightly more complex. However, I would highly recommend this strategy for long-running tasks that run on the web.
I believe this is due to your web server (apache in most cases) which has a timeout to small. Try to increase this number
For apache, have a look at the timeout option
EDIT: I don't think you can do set this time out in Chrome (see this topic on google forums even though it's really old)
In firefox, on the about:config page, type timeout and you'll have some options you can set. I have no idea about Internet Explorer.
Let's assume:
This is not a server issue, so we don't have to go fiddle with Apache, nginx, etc. timeout settings.
The delay is minutes, not hours or days, just to make the scenario manageable.
You control the web page on which the user hits submit, and from which user interaction is managed.
If those obtain, I'd suggest not using a standard HTML form submission, but rather have the submit button kick off a JavaScript function to oversee processing. It would put up a "please be patient...this could take a little while" style message, then use jQuery.ajax, say, to call the long-time-taking server with a long timeout value. jQuery timeouts are measured in milliseconds, so 60000 = 60 seconds. If it's longer than that, increase your specified timeout accordingly. I have seen reports that not all clients will allow super-extra-long timeouts (e.g. Safari on iOS apparently has a 60-second limitation). But in general, this will give you a platform from which to manage the interactions (with your user, with the slow server) rather than being at the mercy of simple web form submission.
There are a few edge cases here to consider. The web server timeouts may indeed need to be adjusted upward (Apache defaults to 300 seconds aka 5 minutes, and nginx less, IIRC). Your client timeouts (on iOS, say) may have maximums too low for the delays you're seeing. Etc. Those cases would require either adjusting at the server, or adopting a different interaction strategy. But an AJAX-managed interaction is where I would start.

A daemon to call a function every 2 minutes with start and stop capablities

I am working on a django web application.
A function 'xyx' (it updates a variable) needs to be called every 2 minutes.
I want one http request should start the daemon and keep calling xyz (every 2 minutes) until I send another http request to stop it.
Appreciate your ideas.
Thanks
Vishal Rana
There are a number of ways to achieve this. Assuming the correct server resources I would write a python script that calls function xyz "outside" of your django directory (although importing the necessary stuff) that only runs if /var/run/django-stuff/my-daemon.run exists. Get cron to run this every two minutes.
Then, for your django functions, your start function creates the above mentioned file if it doesn't already exist and the stop function destroys it.
As I say, there are other ways to achieve this. You could have a python script on loop waiting for approx 2 minutes... etc. In either case, you're up against the fact that two python scripts run on two different invocations of cpython (no idea if this is the case with mod_wsgi) cannot communicate with each other and as such IPC between python scripts is not simple, so you need to use some sort of formal IPC (like semaphores, files etc) rather than just common variables (which won't work).
Probably a little hacked but you could try this:
Set up a crontab entry that runs a script every two minutes. This script will check for some sort of flag (file existence, contents of a file, etc.) on the disk to decide whether to run a given python module. The problem with this is it could take up to 1:59 to run the function the first time after it is started.
I think if you started a daemon in the view function it would keep the httpd worker process alive as well as the connection unless you figure out how to send a connection close without terminating the django view function. This could be very bad if you want to be able to do this in parallel for different users. Also to kill the function this way, you would have to somehow know which python and/or httpd process you want to kill later so you don't kill all of them.
The real way to do it would be to code an actual daemon in w/e language and just make a system call to "/etc/init.d/daemon_name start" and "... stop" in the django views. For this, you need to make sure your web server user has permission to execute the daemon.
If the easy solutions (loop in a script, crontab signaled by a temp file) are too fragile for your intended usage, you could use Twisted facilities for process handling and scheduling and networking. Your Django app (using a Twisted client) would simply communicate via TCP (locally) with the Twisted server.

Categories

Resources