Overlapping cron job that runs the same Django management command: problematic? - python

I have a recurring cron job that runs a Django management command. The command interacts with the ORM, sends email with sendmail, and sends SMS with Twilio. It's possible that the cron jobs will begin to overlap. In other words, the job (that runs this command) might still be executing when the next job starts to run. Will this cause any issues?
(I don't want to wait for the management command to finish executing before running the management command again with cron).
EDIT:
The very beginning of the management command gets a timestamp of when the command was run. At a minimum, this timestamp needs to be accurate. It would be nice if the rest of the command didn't wait for the previous cron job to finish running, but that's non-critical.
EDIT 2:
The cron job only reads from the DB, it doesn't write to it. The application has to continue to work while the cron job is running. The application reads and writes from the DB.

My understanding of cron is that it will fork off a job as a background process, allowing multiple jobs to run at the same time. This can be problematic if the second job depends on the first job to be done (if the second is running a daily report of aggregated data provided by the first job etc...). If you don't want them to run concurrently, there are workarounds to that:
How to prevent the cron job execution, if it is already running.
Will Cron start a new job if the current job is not complete?

Yes. This could definitely cause issues. You have a race condition. If you wish, you could acquire a lock somehow on a critical section which would prevent the next invocation from entering a section of code until the first invocation of the command finished. You may be able to do a row lock or a table lock for the underlying data.
Let's presume you're using MySQL which has specific lock syntax (DB dependent) and you have this model:
class Email(models.Model):
sent = models.BooleanField(default=False)
subj = models.CharField(max_length=140)
msg = models.TextField()
You can create a lock object like this:
from django.db import connection
[...]
class EmailLocks(object):
def __init__(self):
self.c = connection.cursor()
def __enter__(self):
self.c.execute('''lock tables my_app_email write''')
def __exit__(self, *err):
self.c.execute('unlock tables')
Then lock all of your critical sections like:
with EmailLocks():
# read the email table and decide if you need to process it
for e in Email.objects.filter(sent=False):
# send the email
# mark the email as sent
e.sent = True
e.save()
The lock object will automatically unlock the table on exit. Also, if you throw an exception in your code, the table will still be unlocked.

So you have a cron that runs django management command and you dont want them to overlap.
You can use flock, Which generates a lockfile and deletes it after executing the cron.If the second cron starts before the first one has ended it will see that there a lockfile already created and thus not execute the second one.
Below is the cron i used:
* * * * * /usr/bin/flock -n /tmp/fcj.lockfile /usr/bin/python /home/txuser/dev/Project1/projectnew/manage.py flocktest
There is lot more you can do with this.
more on this

Related

Should I use an infinite loop or a cron job for a python crawler?

I have written a crawler in python that goes over 60 websites, parses HTML, and saves data to Database.
Right now I am using a cron job to run the crawler every 15 minutes. The problem is that I have no way to tell how much it is going to take the crawler to finish (it may take more than 15min sometimes), I don't want to run another crawler if one is already running.
I have been wondering whether I would be better off using an infinite loop and make the crawler a permanent process always running (but how would I make sure the crawler doesn't fail and exit? and how to restart it every time it exits?).
Which one is more efficient? infinite loop or cron job?
you could do a infinite loop in a bash script like:
#!/bin/bash
while ((1)) ; do
python3 -u /path/to/file.py > /path/to/logs.txt
sleep 2
done
it will execute the script, and once the script end(error or not), it will execute again
https://unix.stackexchange.com/questions/521497/how-should-i-run-a-cron-command-which-has-forever-loop
We can add control in the cron job python script to keep track of its running status (e.g. crawl start-time and end-time) in the database. Using a control structure like this may be easier to maintain:
# query select crawlStartTime and crawlEndTime from DB
# ...
if(crawlEndTime >= crawlStartTime): # the previous crawl job is finished
# update DB set crawlStartTime = now
# do crawling tasks ...
# crawl finished, update DB set crawlEndTime = now
else: # the previous crawl job is not finished
# do not crawl
# in case the job elapsed for far too long
if(now - crawlStartTime >= threshold):
# send alert, kill process, or reset the time records

How run an abaqus job just after another has ended?

I wrote a script for create several models in abaqus and then run the jobs created using a simple python loop, but when running the script the programm runs all the jobs at the same time and the computer memory isn't enough so it aborts the jobs. I want to know how create a srcipt where the next job is submitted just after the first has ended.
It depends on how you are invoking Abaqus. If you are creating the Abaqus processes directly you can add the -interactive argument to your command so it doesn't run the solver in a background process and return immediately. For example:
abq2018 -j my_job_name -interactive
On the other hand, if you are using the Abaqus API and the Job object to create and run jobs you can use the waitForCompletion method to wait until a Job completes. Here is the excerpt from the Abaqus documentation:
waitForCompletion() This method interrupts the execution of
the script until the end of the analysis. If you call the
waitForCompletion method and the status member is neither SUBMITTED
nor RUNNING, Abaqus assumes the analysis has either completed or
aborted and returns immediately.
Here's a short example of how to create Job objects and use the waitForCompletion method:
from abaqus import *
# Create a Job from a Model definition
j1 = mdb.Job(name='my_job_name', model=mdb.models['my_model_name'])
# or create a Job from an existing input file
j2 = mdb.JobFromInputFile(name='my_job_name', inputFileName='my_job_name.inp')
# Submit the first job - this returns immediately
j1.submit()
# Now wait for the first job - this will block until the job completes
j1.waitForCompletion()
# Same process for the second Job
j2.submit()
j2.waitForCompletion()
I developed a Graphic User Interface to improve the queueing process of Abaqus analyses.
It's available on GitHub here.
Install:
1. Download or clone the package from GitHub.
2. Find the core.py file and edit the lines 31, 36, 37 and 38 according to your machine and your config.
3. Run it.
If you have any issues using it, please submit them on the GitHub repository.

( ndb, python, gae) - cron job timeout using more than one module

Is there something special that I need to do when working with cron jobs for separated modules? I can't figure out why I can make a request to the cron job at localhost:8083/tasks/crontask (localhost:8083 runs the workers module), which is supposed to just print a simple line, and it doesn't print to the console, although it says that the request was successful if I run it by going to http://localhost:8000/cron and hitting the run button.. but even that still doesn't hit make it print to the console.
If I refresh the page localhost:8083/tasks/crontask as a way of triggering the cron job, it times out.
again, If I go to localhost:8001 and hit the run button, it says request to /tasks/crontask was successful, but it doesn't print to the console like it's supposed to
In send_notifications_handler.py within in workers/handlers directory
class CronTaskHandler(BaseApiHandler):
def get(self):
print "hello, this is a cron job"
in cron.yaml outside the workers module
cron:
- description: something
url: /tasks/crontask
schedule: every 1 minutes
target: workers
in init.py in the workers/handlers directory
from send_notifications_handler import CronTaskHandler
#--- Packaging
__all__ = [
CounterWorker,
DeleteGamesCronHandler,
CelebrityCountsCronTaskHandler,
QuestionTypeCountsCronHandler,
CronTaskHandler
]
in workers/routes.py
Route('/tasks/crontask', handlers.CronTaskHandler, methods=['GET']),
//++++++++++++++++++++ Updates / resolution +++++++++++++
The print statement is fine and does print to the console
Yes, the cron job will fire once under the using the dev server, although it doesn't repeat
The problem was that _ah/start in that module was routed to a pull queue that never stops. removing the pull queue fixed the issue.
That is actually the expected behavior when executing cron jobs locally.
If you take a look to the docs, it says the following:
The development server doesn't automatically run your cron jobs. You can use your local desktop's cron or scheduled tasks interface to trigger the URLs of your jobs with curl or a similar tool.
You will need to manually execute cron jobs on local server by visiting http://localhost:8000/cron, as you mentioned in your post.
/++++++++++++++++++++ Updates / resolution +++++++++++++
The print statement is fine and does print to the console
Yes, the cron job will fire once when using the dev server, although it doesn't repeat, which is normal behavior for dev servers
The problem was that _ah/start in that module was routed to a pull queue that never stops. removing the pull queue fixed the issue.
Thanks for suggestions

Restart python script if not running/stopped/error with simple cron job

Summary: I have a python script which collects tweets using Twitter API and i have postgreSQL database in the backend which collects all the streamed tweets. I have custom code which overcomes the ratelimit issue and i made it to run 24/7 for months.
Issue: Sometimes streaming breaks and sleeps for given secs but it is not helpful. I do not want to check it manually.
def on_error(self,status)://tweepy method
self.mailMeIfError(['me <me#localhost'],'listen.py <root#localhost>','Error Occured on_error method',str(error))
time.sleep(300)
return True
Assume mailMeIfError is a method which takes care of sending me a mail.
I want a simple cron script which always checks the process and restart the python script if not running/error/breaks. I have gone through some answers from stackoverflow where they have used Process ID. In my case process ID still exists because this script sleeps if Error.
Thanks in advance.
Using Process ID is much easier and safer. Try using watchdog.
This can all be done in your one script. Cron would need to be configured to start your script periodically, say every minute. The start of your script then just needs to determine if it is the only copy of itself running on the machine. If it spots that another copy is running, it just silently terminates. Else it continues to run.
This behaviour is called a Singleton pattern. There are a number of ways to achieve this for example Python: single instance of program

Creating a processing queue in python

I have an email account set up that triggers a python script whenever it receives an email. The script goes through several functions which can take about 30 seconds and writes an entry into a MYSQL database.
Everything runs smoothly until a second email is sent in less than 30 seconds after the first. The second email is processed correctly, but the first email creates a corrupted entry into the database.
I'm looking to hold the email data,
msg=email.message_from_file(sys.stdin)
in a queue if the script has not finished processing the prior email.
I'm using python 2.5.
Can anyone recommend a package/script that would accomplish this?
I find this a simple way to avoid running a cronjob while the previous cronjob is still running.
fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
This will raise an IOError that I then handle by having the process kill itself.
See http://docs.python.org/library/fcntl.html#fcntl.lockf for more info.
Anyways you can easily use the same idea to only allow a single job to run at a time, which really isn't the same as a queue (since any process waiting could potentially acquire the lock), but it achieves what you want.
import fcntl
import time
fd = open('lock_file', 'w')
fcntl.lockf(fd, fcntl.LOCK_EX)
# optionally write pid to another file so you have an indicator
# of the currently running process
print 'Hello'
time.sleep(1)
You could also just use http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes, which does exactly what you want.
While Celery is a very fine piece of software, using it in this scenario is akin to driving in a nail with a sledgehammer. At a conceptual level, you are looking for a job queue (which is what celery provides) but the e-mail inbox you are using to trigger the script is also a capable job-queue.
The more direct solution is to have the Python worker script poll the mail server itself (using the built in poplib for example) retrieve all new mail every few seconds, then process any new e-mails one at a time. This will serialize the work your script is doing, thereby preventing two copies from running at once.
For example, you would wrap your existing script in a function like this (from the documentation linked above):
import getpass, poplib
from time import sleep
M = poplib.POP3('localhost')
M.user(getpass.getuser())
M.pass_(getpass.getpass())
while True:
numMessages = len(M.list()[1])
for i in range(numMessages):
email = '\n'.join(M.retr(i+1)[1])
# This is what your script normally does:
do_work_for_message(email)
sleep(5)
edit: grammar
I would look into http://celeryproject.org/
I'm fairly certain that will meet your needs exactly.

Categories

Resources