how to write endless loop crawler in python?

how to write endless loop crawler in python? - python

EDITED:
I have a crawler.py that crawls certain sites every 10 minutes and sends me some emails regarding these site. The crawler is ready and working locally.
How can I adjust it so that the following two things will happen :
It will run in endless loop on the hosting that I'll upload it to?
Sometimes I will be able to stop it ( e.g. for debugging).
At first, I thought of doing endless loop e.g.
crawler.py:
while True:
doCarwling()
sleep(10 minutes)
However, according to answers I got below, this would be impossible since hosting providers kill processes after a while (just for the question sake, let's assume proccesses are killed every 30 min). Therefore, my endless loop process would be killed at some point.
Therefore, I have thought pf a different solution:
Lets assume that my crawler is located at "www.example.com\crawler.py" and each time it is accessed, it executes the function run():
run()
doCarwling()
sleep(10 minutes)
call URL "www.example.com\crawler.py"
Thus, there will be no endless loop. In fact, every time my crawler runs, it would also access the URL which will execute the same crawler again. Therefore, there would be no endless loop, no process with a long-running time, and my crawler will continue operating forever.
Will my idea work?
Are there any hidden drawbacks I haven't thought of?
Thanks!
Thanks

As you stated in the comments, you are running on a public shared server like GoDaddy and so on. Therefore cron is not available there and long running scripts are usually forbidden - your process would be killed even if you were using sleep.
Therefore, the only solution I see is to use an external server on which you have to control to connect to your public server and run the script, every 10 minutes. One solution could be using cron on your local machine to connect with wget or curl to a specific page on your host. **
Maybe you can find on-line services that allow running a script periodically, and use those, but I know none.
** Bonus: you can get the results directly as response without having to send yourself an email.
Update
So, in your updated question you propose yo use your script to call itself with an HTTP request. I thought of it before, but I didn't consider it in my previous answer because I believe it won't work (in general).
My concern is: will the server kill a script if the HTTP connection requesting it is closed before the script terminates?
In other words: if you open yoursite.com/script.py and it takes 60 seconds to run, and you close the connection with the server after 10 seconds, will the script run till its regular end?
I thought that the answer was obviously "no, the script will be killed", therefore that method would be useless, because you should guarantee that a script calling itself via a HTTP request stays alive longer than the called script. I did a little experiment using flask, and it proved me wrong:
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
import time
print('Script started...')
time.sleep(5)
print('5 seconds passed...')
time.sleep(5)
print('Script finished')
return 'Script finished'
if __name__ == '__main__':
app.run()
If I run this script and make an HTTP request to localhost:5000, and close the connection after 2 seconds, the scripts continues to run until the end and the messages are still printed.
Therefore, with flask, if you can do an asynchronous request to yourself, you should be able to have an "infinite loop" script.
I don't know the behavior on other servers, though. You should make a test.
Control
Assuming your server allows you to do a GET request and have the script running even if the connection is closed, you have few things to take care of, for example that your script still has to run fast enough to complete during the maximum server time allowance, and that to make your script run every 10 minutes, with a maximum allowance of 1 minute, you have to count every time 10 calls.
In addition, this mechanism has to be controlled, because you cannot interrupt it for debug as you requested. At least, not directly.
Therefore, I suggest you to use files: use a file to split your crawling in smaller steps, each capable to finish in less than one minute, and then continue again when the script is called again.
Use a file to count how many times the script is called, before actually doing the crawling. This is necessary if, for example, the script is allowed to live 90 seconds, but you want to crawl every 10 hours.
Use a file to control the script: store a boolean flag that you use to stop the recursion mechanism if you need to.

If you're using Linux you should just do a cron job for your script. Info: http://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800

If you are running linux I would setup and upstart script http://upstart.ubuntu.com/getting-started.html to turn it into a service.
It offers a lot of advantages like:
-Starting at system boot
-Auto restart on crashes
-Manageable: service mycrawler restart
...
Or if you would prefer to have it run every 10 minutes forget about the endless loop and do a cronjob http://en.wikipedia.org/wiki/Cron

Related

AutobahnPython + Twisted 'Publish' floods messages after script is finished

I have a Python script that sometimes runs a process that lasts ~5-60 seconds. During this time, ten calls to session.publish() are ignored until the script is done. As soon as the script finishes, all ten messages are published in a flood.
I have corroborated this by opening the Crossbar.io router in debug mode, and it shows logs corresponding to the published messages after the time is over (not during its run as expected).
The script in question is long, complex and includes a combined frontend and backend for Crossbar/Twisted/AutobahnPython. I feel I would risk misreporting the problem if I tried to condense and include it here.
What reasons are there for publish to not happen instantaneously?
A couple of unsuccessful tries so far:
Source: Twisted needs 'non-blocking code'. So, I try to incorporate reactor.callLater but without success (I also don't really know how to do this for a publish event).
I looked into the idea of using Pool to spawn workers to perform the publish.
The AutobahnPython repo doesn't seem to have any examples that really include this kind of situation.
Thanks!

What reasons are there for publish to not happen instantaneously?
The reactor has to get a chance to run for I/O to happen. The example code doesn't let the reactor run because it keeps execution in a while loop in user code for a long time.

( ndb, python, gae) - cron job timeout using more than one module

Is there something special that I need to do when working with cron jobs for separated modules? I can't figure out why I can make a request to the cron job at localhost:8083/tasks/crontask (localhost:8083 runs the workers module), which is supposed to just print a simple line, and it doesn't print to the console, although it says that the request was successful if I run it by going to http://localhost:8000/cron and hitting the run button.. but even that still doesn't hit make it print to the console.
If I refresh the page localhost:8083/tasks/crontask as a way of triggering the cron job, it times out.
again, If I go to localhost:8001 and hit the run button, it says request to /tasks/crontask was successful, but it doesn't print to the console like it's supposed to
In send_notifications_handler.py within in workers/handlers directory
class CronTaskHandler(BaseApiHandler):
def get(self):
print "hello, this is a cron job"
in cron.yaml outside the workers module
cron:
- description: something
url: /tasks/crontask
schedule: every 1 minutes
target: workers
in init.py in the workers/handlers directory
from send_notifications_handler import CronTaskHandler
#--- Packaging
__all__ = [
CounterWorker,
DeleteGamesCronHandler,
CelebrityCountsCronTaskHandler,
QuestionTypeCountsCronHandler,
CronTaskHandler
]
in workers/routes.py
Route('/tasks/crontask', handlers.CronTaskHandler, methods=['GET']),
//++++++++++++++++++++ Updates / resolution +++++++++++++
The print statement is fine and does print to the console
Yes, the cron job will fire once under the using the dev server, although it doesn't repeat
The problem was that _ah/start in that module was routed to a pull queue that never stops. removing the pull queue fixed the issue.

That is actually the expected behavior when executing cron jobs locally.
If you take a look to the docs, it says the following:
The development server doesn't automatically run your cron jobs. You can use your local desktop's cron or scheduled tasks interface to trigger the URLs of your jobs with curl or a similar tool.
You will need to manually execute cron jobs on local server by visiting http://localhost:8000/cron, as you mentioned in your post.

/++++++++++++++++++++ Updates / resolution +++++++++++++
The print statement is fine and does print to the console
Yes, the cron job will fire once when using the dev server, although it doesn't repeat, which is normal behavior for dev servers
The problem was that _ah/start in that module was routed to a pull queue that never stops. removing the pull queue fixed the issue.
Thanks for suggestions

Restart python script if not running/stopped/error with simple cron job

Summary: I have a python script which collects tweets using Twitter API and i have postgreSQL database in the backend which collects all the streamed tweets. I have custom code which overcomes the ratelimit issue and i made it to run 24/7 for months.
Issue: Sometimes streaming breaks and sleeps for given secs but it is not helpful. I do not want to check it manually.
def on_error(self,status)://tweepy method
self.mailMeIfError(['me <me#localhost'],'listen.py <root#localhost>','Error Occured on_error method',str(error))
time.sleep(300)
return True
Assume mailMeIfError is a method which takes care of sending me a mail.
I want a simple cron script which always checks the process and restart the python script if not running/error/breaks. I have gone through some answers from stackoverflow where they have used Process ID. In my case process ID still exists because this script sleeps if Error.
Thanks in advance.

Using Process ID is much easier and safer. Try using watchdog.

This can all be done in your one script. Cron would need to be configured to start your script periodically, say every minute. The start of your script then just needs to determine if it is the only copy of itself running on the machine. If it spots that another copy is running, it just silently terminates. Else it continues to run.
This behaviour is called a Singleton pattern. There are a number of ways to achieve this for example Python: single instance of program

Django, sleep() pauses all processes, but only if no GET parameter?

Using Django (hosted by Webfaction), I have the following code
import time
def my_function(request):
time.sleep(10)
return HttpResponse("Done")
This is executed via Django when I go to my url, www.mysite.com
I enter the url twice, immediately after each other. The way I see it, both of these should finish after 10 seconds. However, the second call waits for the first one and finishes after 20 seconds.
If, however, I enter some dummy GET parameter, www.mysite.com?dummy=1 and www.mysite.com?dummy=2 then they both finish after 10 seconds. So it is possible for both of them to run simultaneously.
It's as though the scope of sleep() is somehow global?? Maybe entering a parameter makes them run as different processes instead of the same???
It is hosted by Webfaction. httpd.conf has:
KeepAlive Off
Listen 30961
MaxSpareThreads 3
MinSpareThreads 1
ServerLimit 1
SetEnvIf X-Forwarded-SSL on HTTPS=1
ThreadsPerChild 5
I do need to be able to use sleep() and trust that it isn't stopping everything. So, what's up and how to fix it?
Edit: Webfaction runs this using Apache.

As Gjordis pointed out, sleep will pause the current thread. I have looked at Webfaction and it looks like their are using WSGI for running the serving instance of Django. This means, every time a request comes in, Apache will look at how many worker processes (that are processes that each run a instance of Django) are currently running. If there are none/to view it will spawn additonally workers and hand the requests to them.
Here is what I think is happening in you situation:
first GET request for resource A comes in. Apache uses a running worker (or starts a new one)
the worker sleeps 10 seconds
during this, a new request for resource A comes in. Apache sees it is requesting the same resource and sends it to the same worker as for request A. I guess the assumption here is that a worker that recently processes a request for a specific resource it is more likely that the worker has some information cached/preprocessed/whatever so it can handle this request faster
this results in a 20 second block since there is only one worker that waits 2 times 10 seconds
This behavior makes complete sense 99% of the time so it's logical to do this by default.
However, if you change the requested resource for the second request (by adding GET parameter) Apache will assume that this is a different resource and will start another worker (since the first one is already "busy" (Apache can not know that you are not doing any hard work). Since there are now two worker, both waiting 10 seconds the total time goes down to 10 seconds.
Additionally I assume that something is **wrong** with your design. There are almost no cases which I can think of where it would be sensible to not respond to a HTTP request as fast as you can. After all, you want to serve as many requests as possible in the shortest amount of time, so sleeping 10 seconds is the most counterproductive thing you can do. I would recommend the you create a new question and state what you actual goal is that you are trying to achieve. I'm pretty sure there is a more sensible solution to this!

Assuming you run your Django-server just with run() , by default this makes a single threaded server. If you use sleep on a single threaded process, the whole application freezes for that sleep time.

It may simply be that your browser is queuing the second request to be performed only after the first one completes. If you are opening your URLs in the same browser, try using the two different ones (e.g. Firefox and Chrome), or try performing requests from the command line using wget or curl instead.

A daemon to call a function every 2 minutes with start and stop capablities

I am working on a django web application.
A function 'xyx' (it updates a variable) needs to be called every 2 minutes.
I want one http request should start the daemon and keep calling xyz (every 2 minutes) until I send another http request to stop it.
Appreciate your ideas.
Thanks
Vishal Rana

There are a number of ways to achieve this. Assuming the correct server resources I would write a python script that calls function xyz "outside" of your django directory (although importing the necessary stuff) that only runs if /var/run/django-stuff/my-daemon.run exists. Get cron to run this every two minutes.
Then, for your django functions, your start function creates the above mentioned file if it doesn't already exist and the stop function destroys it.
As I say, there are other ways to achieve this. You could have a python script on loop waiting for approx 2 minutes... etc. In either case, you're up against the fact that two python scripts run on two different invocations of cpython (no idea if this is the case with mod_wsgi) cannot communicate with each other and as such IPC between python scripts is not simple, so you need to use some sort of formal IPC (like semaphores, files etc) rather than just common variables (which won't work).

Probably a little hacked but you could try this:
Set up a crontab entry that runs a script every two minutes. This script will check for some sort of flag (file existence, contents of a file, etc.) on the disk to decide whether to run a given python module. The problem with this is it could take up to 1:59 to run the function the first time after it is started.
I think if you started a daemon in the view function it would keep the httpd worker process alive as well as the connection unless you figure out how to send a connection close without terminating the django view function. This could be very bad if you want to be able to do this in parallel for different users. Also to kill the function this way, you would have to somehow know which python and/or httpd process you want to kill later so you don't kill all of them.
The real way to do it would be to code an actual daemon in w/e language and just make a system call to "/etc/init.d/daemon_name start" and "... stop" in the django views. For this, you need to make sure your web server user has permission to execute the daemon.

If the easy solutions (loop in a script, crontab signaled by a temp file) are too fragile for your intended usage, you could use Twisted facilities for process handling and scheduling and networking. Your Django app (using a Twisted client) would simply communicate via TCP (locally) with the Twisted server.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.