I am working with scrapy 0.20 on python 2.7
Question
What is the best python scheduler.
My need
I need to run my spider, which is a python script, each 3 hours.
What I have thought
I tried using windows scheduler features which comes with Windows 7 and It works good. I am able to run a python script each 3 hours but I may deploy my python script on Linux server so I may not be able to use this option.
I create a Java application using Quartz-Scheduler. It works good but this is a third library, which my manager may refuse.
I created a windows service and I made it fire the script each three hours. It works but I may deploy my python script on a Linux server so I may not be able to use this option.
I am asking about the best practice to fire a python script
I tried using windows scheduler features which comes with Windows 7 and it works good.
So that works fine for you already. Good, no need to change your script to do scheduling work yourself.
but I may deploy my python script on Linux server so I may not be able to use this option.
On Linux, you can use cron jobs to achieve this.
The other way would be to simply keep your script running the whole time, but pause it for the three hours in which you are doing nothing. So you don’t need to set up anything on the target machine, but just need to run the script in the background, and it will keep running and doing its job.
This is exactly how job schedulers work btw. They are launched early when the operating system starts, and then they just keep running forever and every short time interval (a minute or so) they check if there is any job on their list that needs to run now. And if that’s the case, they spawn a new process and run the job.
So if you wanted to make such a scheduler in Python, you would just keep it running forever, and once every time interval (in your case 3 hours because you only have a single job anyway), you start your job. That can be in a separate process, in a separate thread, or indirectly in a separate thread using asynchronous functions.
The best way to deploy/schedule your scrapy project is to use scrapyd server.
You should install scrapyd.
sudo-apt get install scrapyd
You change your project config file to something like this :
[deploy:somename]
url = http://localhost:6800/ ## this the default
project = scrapy_project
you deploy your project under the scrapyd server:
scrapy deploy somename
You change your poll interval in /etc/scrapyd/conf.d/default-000 to 3 hours ( default to 5 seconds):
poll_interval = 10800
You configure your spider something like :
curl http://localhost:6800/schedule.json -d project=scrapy_project -d spider=myspider
You can use the web service to monitor your jobs:
http://localhost:6800/
PS: I just test it under ubuntu So I am not sure that a windows version exist. If not you can install a VM with ubuntu to launch the spiders.
Well, there's always the charming
sched
(docs) module, which provides a generic scheduling interface.
Give it a time function and a sleep function, and it'll give you back a pretty nice and extensible scheduler.
It's not system-level, but if you can run it as a service, it should suffice.
Related
I have a project in which one of the tests consists of running a process indefinitely in order to collect data on the program execution.
It's a Python script that runs locally on a Linux machine, but I'd like for other people in my team to have access to it as well because there are specific moments where the process needs to be restarted.
Is there a way to set up a workflow on this machine that when dispatched, stops and restarts the process?
You can execute commands on your Linux host via GH Actions and SSH. Take a look at this action.
I have a python script that reads data from an OPCDA server and then push it to InfluxDB.
So basically it connects to the OPCDA using the OpenOPC library and to InfluxDB using the InfluxDB Python client and then starts an infinite while loop that runs every 5 seconds to read and push data to the database.
I have installed the script as a Service using NSSM. What is the best practice to ensure that the script is running 24/7 ? How to avoid crashes ?
Should i daemonize the script ?
Thank you in advance,
Bnjroos
I suggest at least to add logging at the script level. You could also use custom Exit Codes from python so NSSM knows to report failure. Your failure would probably be when connecting to your services so, i.e. netowrk down or something so you could write custom exceptions for NSSM to restart. If it's running every 5 seconds you would probably know very soon.
Ensuring availability and avoiding crashes is about your code more than infrastructure, hence the above recommendations.
I believe using NSSM (for scheduling and such) is better than daemonizing, since you're basically adding functionality of NSSM in your script and potentially adding more code that may fail.
I have coded a Python Script for Twitter Automation using Tweepy. Now, when i run on my own Linux Machine as python file.py The file runs successfully and it keeps on running because i have specified repeated Tasks inside the Script and I also don't want to stop the script either. But as it is on my Local Machine, the script might get stopped when my Internet Connection is off or at Night. So i couldn't keep running the Script Whole day on my PC..
So is there any way or website or Method where i could deploy my Script and make it Execute forever there ? I have heard about CRON JOBS before in Cpanel which can Help repeated Tasks but here in my case i want to keep running my Script on the Machine till i don't close the script .
Are their any such solutions. Because most of twitter bots i see are running forever, meaning their Script is getting executed somewhere 24x7 . This is what i want to know, How is that Task possible?
As mentioned by Jon and Vincent, it's better to run the code from a cloud service. But either way, I think what you're looking for is what to put into the terminal to run the code even after you close the terminal. This is what worked for me:
nohup python code.py &
You can add a systemd .service file, which can have the added benefit of:
logging (compressed logs at a central place, or over network to a log server)
disallowing access to /tmp and /home-directories
restarting the service if it fails
starting the service at boot
setting capabilities (ref setcap/getcap), disallowing file access if the process only needs network access, for instance
I have a simple python script to send data from a Windows 7 box to a remote computer via SFTP. The script is set to continuously send a single file every 5 minutes. This all works fine but I'm worried about the off chance that the process stops or fails and the customer doesn't notice the data files have stopped coming in. I've found several ways to monitor python processes in a ubuntu/unix environment but nothing for Windows.
If there are no other mitigating factors in your design or requirements, my suggestion would be to simplify the script so that it doesn't do the polling; it simply sends the file when invoked, and use Windows Scheduler to invoke the script on whatever schedule you need. By relying on a core Windows service, you can factor that complexity out of your script.
You can check out restartme the following link shows how you can use it
http://www.howtogeek.com/130665/quickly-and-automatically-restart-a-windows-program-when-it-crashes/
I need to install on one of my Windows PC's some software that will periodically send a short HTTP POST request to my remote development server. The request is always the same and should get sent every minute.
What would you recommend as the best approach for that?
The things I considered are:
1. Creating a Windows service
2. Using a script in python (I have cygwin installed)
3. Scheduled task using a batch file (although I don't want the black cmd window to pop up in my face every minute)
Thanks for any additional ideas or hints on how to best implement it.
import urllib
import time
while True:
urllib.urlopen(url, post_data)
time.sleep(60)
If you have cygwin, you probably have cron - run a python script from your crontab.
This is trivially easy with a scheduled task which is the native Windows way to schedule tasks! There's no need for cygwin or Python or anything like that.
I have such a task running on my machine which pokes my Wordpress blog every few hours. The script is just a .bat file which calls wget. The task is configured to "Run whether user is logged on or not" which ensures that it runs when I'm not logged on. There's no "black cmd window".
You didn't say which version of Windows you are on and if you are on XP (unlucky for you if you are) then the configuration is probably different since the scheduled task interface changed quite a bit when Vista came out.