Using apscheduler in a periodic short-running script - python

My (simplified and generalized) need is the following:
Given is a main program which starts my Python script every minute. By design, this script is intended to be a short-runner.
Its goal is to call different web hooks - not on every call, but in individual intervals, e.g. every 5 minutes. (The script can read the configuration from a file.)
So it is not possible for me to use apscheduler in a permanent running program.
On the contrary, the program itself must determine on each run which API calls are overdue and have to be made now.
Is it possible to use apscheduler for this?

Related

How do I stop Airflow from triggering my python scripts?

I am currently using Airflow to run a DAG (say dag.py) which has a few tasks, and then, it has a python script to execute (done via bash_operator). The python script (say report.py) basically takes data from a cloud (s3) location as a dataframe, does a few transformations, and then sends them out as a report over email.
But the issue I'm having is that airflow is basically running this python script, report.py, everytime Airflow scans the repository for changes (i.e. every 2 mins). So, the script is being run every 2 mins (and hence the email is being sent out every two minutes!).
Is there any work around to this? Can we use something apart from a bash operator (bare in mind that we need to do a few dataframe transformations before sending out the report)?
Thanks!
Just make sure you do everything serious in the tasks. It in the python script. The script will be executed often by scheduler but it should simply create tasks and build dependencies between them. The actual work is done in the 'execute' methods of the tasks.
For example rather than sending email in the script you should add the 'EmailOperator' as a task and the right dependencies, so the execute method of the operator will be executed not when the file is parsed by scheduler, but when all dependencies (other tasks ) will complete

How can I schedule or queue api calls to maintain rate limit?

I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-
Stay within api limit (5 calls/sec)
Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
Each call will be with different parameters (params will be fetched from db or in-memory cache)
Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB
For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).
My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.
I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.
You needn't add an external dependency just for rate-limiting, as your use case is rather straightforward.
I can think of two options:
Modify the script (that currently wakes up every minute and makes 10-20 API calls) to wake up every second and make 5 calls (sequentially or in parallel).
In your current design, your API calls might not be properly distributed across 1 minute, i.e. you might be making all your 10-20 calls in the first, say, 20 seconds.
If you change that script to run every second, your API call rate will be more balanced.
Change your Python script to a long running daemon, and use a Rate Limiter library, such as this. You can configure the latter to make 1 call per x seconds.

Methods to schedule a task prior to runtime

What are the best methods to set a .py file to run at one specific time in the future? Ideally, its like to do everything within a single script.
Details: I often travel for business so I built a program to automatically check me in to my flights 24 hours prior to takeoff so I can board earlier. I currently am editing my script to input my confirmation number and then setting up cron jobs to run said script at the specified time. Is there a better way to do this?
Options I know of:
• current method
• put code in the script to delay until x time. Run the script immediately after booking the flight and it would stay open until the specified time, then check me in and close. This would prevent me from shutting down my computer, though, and my machine is prone to overheating.
Ideal method: input my confirmation number & flight date, run the script, have it set up whatever cron automatically, be done with it. I want to make sure whatever method I use doesn't include keeping a script open and running in the background.
cron is best for jobs that you want to repeat periodically. For one-time jobs, use at or batch.

Restart python script if not running/stopped/error with simple cron job

Summary: I have a python script which collects tweets using Twitter API and i have postgreSQL database in the backend which collects all the streamed tweets. I have custom code which overcomes the ratelimit issue and i made it to run 24/7 for months.
Issue: Sometimes streaming breaks and sleeps for given secs but it is not helpful. I do not want to check it manually.
def on_error(self,status)://tweepy method
self.mailMeIfError(['me <me#localhost'],'listen.py <root#localhost>','Error Occured on_error method',str(error))
time.sleep(300)
return True
Assume mailMeIfError is a method which takes care of sending me a mail.
I want a simple cron script which always checks the process and restart the python script if not running/error/breaks. I have gone through some answers from stackoverflow where they have used Process ID. In my case process ID still exists because this script sleeps if Error.
Thanks in advance.
Using Process ID is much easier and safer. Try using watchdog.
This can all be done in your one script. Cron would need to be configured to start your script periodically, say every minute. The start of your script then just needs to determine if it is the only copy of itself running on the machine. If it spots that another copy is running, it just silently terminates. Else it continues to run.
This behaviour is called a Singleton pattern. There are a number of ways to achieve this for example Python: single instance of program

A daemon to call a function every 2 minutes with start and stop capablities

I am working on a django web application.
A function 'xyx' (it updates a variable) needs to be called every 2 minutes.
I want one http request should start the daemon and keep calling xyz (every 2 minutes) until I send another http request to stop it.
Appreciate your ideas.
Thanks
Vishal Rana
There are a number of ways to achieve this. Assuming the correct server resources I would write a python script that calls function xyz "outside" of your django directory (although importing the necessary stuff) that only runs if /var/run/django-stuff/my-daemon.run exists. Get cron to run this every two minutes.
Then, for your django functions, your start function creates the above mentioned file if it doesn't already exist and the stop function destroys it.
As I say, there are other ways to achieve this. You could have a python script on loop waiting for approx 2 minutes... etc. In either case, you're up against the fact that two python scripts run on two different invocations of cpython (no idea if this is the case with mod_wsgi) cannot communicate with each other and as such IPC between python scripts is not simple, so you need to use some sort of formal IPC (like semaphores, files etc) rather than just common variables (which won't work).
Probably a little hacked but you could try this:
Set up a crontab entry that runs a script every two minutes. This script will check for some sort of flag (file existence, contents of a file, etc.) on the disk to decide whether to run a given python module. The problem with this is it could take up to 1:59 to run the function the first time after it is started.
I think if you started a daemon in the view function it would keep the httpd worker process alive as well as the connection unless you figure out how to send a connection close without terminating the django view function. This could be very bad if you want to be able to do this in parallel for different users. Also to kill the function this way, you would have to somehow know which python and/or httpd process you want to kill later so you don't kill all of them.
The real way to do it would be to code an actual daemon in w/e language and just make a system call to "/etc/init.d/daemon_name start" and "... stop" in the django views. For this, you need to make sure your web server user has permission to execute the daemon.
If the easy solutions (loop in a script, crontab signaled by a temp file) are too fragile for your intended usage, you could use Twisted facilities for process handling and scheduling and networking. Your Django app (using a Twisted client) would simply communicate via TCP (locally) with the Twisted server.

Categories

Resources