apscheduler scheduler timeout - python

i have a problem regarding to pythons' apscheduler.
i'm running a task that includes pulling data from db. The dbs' response time varies, because of different operations on it, from different sources, and predicting when the dbs' response time will be low, is not possible.
for example when running
scheduler.add_interval_job(self.readFromDb, start_date = now(), seconds=60)
The seconds parameter stops the task, if it didn't finish, and starts the next task
is there a way of changing the seconds parameter dynamically? or should i use the default value of 0?
cheers

The "seconds" parameter does not in any way limit how long the job can take, and it certainly does not terminate it prematurely. However, it will, with the default settings, prevent another instance of the job from being spawned if the previous instance is taking longer than the specified interval (60 seconds here). The way I see it, you have two options here:
Ignore the fact that a new instance of the task sometimes fails to start
Increase the max_instances parameter from the default of 1 so more than one instance of the task can run concurrently

Related

Run multiple schedule jobs at same time using Python Schedule

I am using cx Oracle and schedule module in python. Following is the psuedo code.
import schedule,cx_Oracle
def db_operation(query):
'''
Some DB operations like
1. Get connection
2. Execute query
3. commit result (in case of DML operations)
'''
schedule.every().hour.at(":10").do(db_operation,query='some_query_1') # Runs at 10th minute in every hour
schedule.every().day.at("13:10").do(db_operation,query='some_query_2') # Runs at 1:10 p.m every day
Both the above scheduled jobs calls the same function (which does some DB operations) and will coincide at 13:10.
Questions:
So how does the scheduler handles this scenario? Like running 2 jobs at the same time. Does it puts in some sort of queue and runs one by one even though time is same? or are they in parallel?
Which one gets picked first? and if I would want the priority of first job over second, how to do it?
Also, important thing is that at a time only one of these should be accessing the database, otherwise it may lead to inconsistent data. How to take care of this scenario? Like is it possible to put a sort of lock while accessing the function or should the table be locked somehow?
I took a look at the code of schedule and I have come to the following conclusions:
The schedule library does not work in parallel or concurrent. Therefore, jobs that have expired are processed one after the other. They are sorted according to their due date. The job that should be performed furthest in the past is performed first.
If jobs are due at the same time, schedule execute the jobs according to the FIFO scheme, regarding the creation of the jobs. So in your example, some_query_1 would be executed before some_query_2.
Question three is actually self-explanatory as only one function can be executed at a time. Therefore, the functions should not actually get in each other's way.

Airflow / Python - How to resume DAG flow based on external process

Using Airflow 1.8.0 and python 2.7
Having the following DAG (simplified):
(Phase 1)-->(Phase 2)
On phase 1 I'm triggering an async process that is time consuming and can run for up to 2 days, when the process ends it writes some payload on S3. On that period I want the DAG to wait and continue to phase 2 only when the S3 payload exists.
I thought of 2 solutions:
When phase 1 start pause the DAG using the experimental REST API and resume once the process ends.
Wait using an operator that checks for the S3 payload every X minuets.
I can't use option 1 since my admin does not allow the experimental API usage and option 2 seems like a bad practice (checking every X minuets).
Are there any other options to solve my task?
I think Option (2) is the "correct way", you may optimize it a bit:
BaseSensorOperator supports poke_interval, so it should be usable for S3KeySensor to increase the time between tries.
Poke_interval - Time in seconds that the job should wait in between
each tries
Additionally, you could try to use mode and switch it to reschedule:
mode: How the sensor operates.
Options are: { poke | reschedule }, default is poke.
When set to poke the sensor is taking up a worker slot for its
whole execution time and sleeps between pokes. Use this mode if the
expected runtime of the sensor is short or if a short poke interval
is required. Note that the sensor will hold onto a worker slot and
a pool slot for the duration of the sensor's runtime in this mode.
When set to reschedule the sensor task frees the worker slot when
the criteria is not yet met and it's rescheduled at a later time. Use
this mode if the time before the criteria is met is expected to be
quite long. The poke interval should be more than one minute to
prevent too much load on the scheduler.
Not sure about Airflow 1.8.0 - couldn't find the old documentation (I assume poke_interval is supported, but not mode).

Django run tasks (possibly) in the far future

Suppose I have a model Event. I want to send a notification (email, push, whatever) to all invited users once the event has elapsed. Something along the lines of:
class Event(models.Model):
start = models.DateTimeField(...)
end = models.DateTimeField(...)
invited = models.ManyToManyField(model=User)
def onEventElapsed(self):
for user in self.invited:
my_notification_backend.sendMessage(target=user, message="Event has elapsed")
Now, of course, the crucial part is to invoke onEventElapsed whenever timezone.now() >= event.end.
Keep in mind, end could be months away from the current date.
I have thought about two basic ways of doing this:
Use a periodic cron job (say, every five minutes or so) which checks if any events have elapsed within the last five minutes and executes my method.
Use celery and schedule onEventElapsed using the eta parameter to be run in the future (within the models save method).
Considering option 1, a potential solution could be django-celery-beat. However, it seems a bit odd to run a task at a fixed interval for sending notifications. In addition I came up with a (potential) issue that would (probably) result in a not-so elegant solution:
Check every five minutes for events that have elapsed in the previous five minutes? seems shaky, maybe some events are missed (or others get their notifications send twice?). Potential workaroung: add a boolean field to the model that is set to True once notifications have been sent.
Then again, option 2 also has its problems:
Manually take care of the situation when an event start/end datetime is moved. When using celery, one would have to store the taskID (easy, ofc) and revoke the task once the dates have changed and issue a new task. But I have read, that celery has (design-specific) problems when dealing with tasks that are run in the future: Open Issue on github. I realize how this happens and why it is everything but trivial to solve.
Now, I have come across some libraries which could potentially solve my problem:
celery_longterm_scheduler (But does this mean I cannot use celery as I would have before, because of the differend Scheduler class? This also ties into the possible usage of django-celery-beat... Using any of the two frameworks, is it still possible to queue jobs (that are just a bit longer-running but not months away?)
django-apscheduler, uses apscheduler. However, I was unable to find any information on how it would handle tasks that are run in the far future.
Is there a fundemantal flaw with the way I am approaching this? Im glad for any inputs you might have.
Notice: I know this is likely to be somehwat opinion based, however, maybe there is a very basic thing that I have missed, regardless of what could be considered by some as ugly or elegant.
We're doing something like this in the company i work for, and the solution is quite simple.
Have a cron / celery beat that runs every hour to check if any notification needs to be sent.
Then send those notifications and mark them as done. This way, even if your notification time is years ahead, it will still be sent. Using ETA is NOT the way to go for a very long wait time, your cache / amqp might loose the data.
You can reduce your interval depending on your needs, but do make sure they dont overlap.
If one hour is too huge of a time difference, then what you can do is, run a scheduler every hour. Logic would be something like
run a task (lets call this scheduler task) hourly that gets all notifications that needs to be sent in the next hour (via celery beat) -
Schedule those notifications via apply_async(eta) - this will be the actual sending
Using that methodology would get you both of best worlds (eta and beat)

Celery + Python: Queue time consuming tasks within another task

I want to query an api (which is time consuming) with lots of items (~100) but not all at once. Instead I want a little delay between the queries.
What I currently have is a task that gets executed asynchronously and iterates over the queries and after each iteration waits some time:
#shared_task
def query_api_multiple(values):
delay_between_queries = 1
query_results = []
for value in values:
time.sleep(delay_between_queries)
response = query_api(value)
if response['result']:
query_results.append(response)
return query_results
My question is, when multiple of those requests come in, will the second request gets executed after the first is finished or while the first is still running? And when they are not getting executed at the same time, how can I achieve this?
You should not use time.sleep but rate limit your task instead:
Task.rate_limit
Set the rate limit for this task type (limits the
number of tasks that can be run in a given time frame).
The rate limits can be specified in seconds, minutes or hours by
appending “/s”, “/m” or “/h” to the value. Tasks will be evenly
distributed over the specified time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum
delay of 600ms between starting two tasks on the same worker instance.
So if you want to limit it to 1 query per second, try this:
#shared_task(rate_limit='1/s')
def query_api_multiple(values):
...
Yes, if you create multiple tasks then they may run at the same time.
You can rate limit on a task type basis with celery if you want to limit the number of tasks that run per period of time. Alternatively, you could implement a rate limiting pattern using something like redis, combined with celery retries, if you need more flexibility than what celery provides OOtB.

How could I do this kind of scheduled task in AppEngine Python?

I have a form with a text field for entering a number without decimal
places representing an amount of minutes that will have to be added to the current time
and will be inserted into a table named Alarm.
When the resulting time comes, my web app must make an insert operation over another table.
For example, if the user enters 20 minutes, and the current time is 22:10, the result time
will have to be 22:30 and will be inserted into Alarm table. So, when the 22:30 arrives, a new insert will have to be made over the another table.
How can I do this on AppEngine using Python?
Depending on your requirements, you may also want to consider using Tasks with an eta or countdown.
If you plan to allow users to cancel the action, you'd need to use some type of no-op marker the task checks for before adding to the "other" table. Or, make the task check the Alarm table before performing the add.
Also, note that the countdown / eta are not precise, they are more like polite requests. So if your queues are backing up with tasks, your adds will happen after they are supposed to. (though cron, particularly 1 minute jobs, also periodically suffer timing issues).
The advantage of this method is that you don't have to figure out how to avoid missing work. Each task represents one add (or a related set of adds). Also, if a write fails the task will retry, which is nice.
Cron may be a better solution for your particular problem though.
You've said that you're storing the target time in the Alarm table. So, your cron just has to run every minute (or every 5 or 10, depending on the resolution of your alarms) and check if there's an alarm matching the current time, and if so then do the action.

Categories

Resources