Running a Python web scraper every hour [duplicate]

Running a Python web scraper every hour [duplicate] - python

I'm looking for a library in Python which will provide at and cron like functionality.
I'd quite like have a pure Python solution, rather than relying on tools installed on the box; this way I run on machines with no cron.
For those unfamiliar with cron: you can schedule tasks based upon an expression like:
0 2 * * 7 /usr/bin/run-backup # run the backups at 0200 on Every Sunday
0 9-17/2 * * 1-5 /usr/bin/purge-temps # run the purge temps command, every 2 hours between 9am and 5pm on Mondays to Fridays.
The cron time expression syntax is less important, but I would like to have something with this sort of flexibility.
If there isn't something that does this for me out-the-box, any suggestions for the building blocks to make something like this would be gratefully received.
Edit
I'm not interested in launching processes, just "jobs" also written in Python - python functions. By necessity I think this would be a different thread, but not in a different process.
To this end, I'm looking for the expressivity of the cron time expression, but in Python.
Cron has been around for years, but I'm trying to be as portable as possible. I cannot rely on its presence.

If you're looking for something lightweight checkout schedule:
import schedule
import time
def job():
print("I'm working...")
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
while 1:
schedule.run_pending()
time.sleep(1)
Disclosure: I'm the author of that library.

You could just use normal Python argument passing syntax to specify your crontab. For example, suppose we define an Event class as below:
from datetime import datetime, timedelta
import time
# Some utility classes / functions first
class AllMatch(set):
"""Universal set - match everything"""
def __contains__(self, item): return True
allMatch = AllMatch()
def conv_to_set(obj): # Allow single integer to be provided
if isinstance(obj, (int,long)):
return set([obj]) # Single item
if not isinstance(obj, set):
obj = set(obj)
return obj
# The actual Event class
class Event(object):
def __init__(self, action, min=allMatch, hour=allMatch,
day=allMatch, month=allMatch, dow=allMatch,
args=(), kwargs={}):
self.mins = conv_to_set(min)
self.hours= conv_to_set(hour)
self.days = conv_to_set(day)
self.months = conv_to_set(month)
self.dow = conv_to_set(dow)
self.action = action
self.args = args
self.kwargs = kwargs
def matchtime(self, t):
"""Return True if this event should trigger at the specified datetime"""
return ((t.minute in self.mins) and
(t.hour in self.hours) and
(t.day in self.days) and
(t.month in self.months) and
(t.weekday() in self.dow))
def check(self, t):
if self.matchtime(t):
self.action(*self.args, **self.kwargs)
(Note: Not thoroughly tested)
Then your CronTab can be specified in normal python syntax as:
c = CronTab(
Event(perform_backup, 0, 2, dow=6 ),
Event(purge_temps, 0, range(9,18,2), dow=range(0,5))
)
This way you get the full power of Python's argument mechanics (mixing positional and keyword args, and can use symbolic names for names of weeks and months)
The CronTab class would be defined as simply sleeping in minute increments, and calling check() on each event. (There are probably some subtleties with daylight savings time / timezones to be wary of though). Here's a quick implementation:
class CronTab(object):
def __init__(self, *events):
self.events = events
def run(self):
t=datetime(*datetime.now().timetuple()[:5])
while 1:
for e in self.events:
e.check(t)
t += timedelta(minutes=1)
while datetime.now() < t:
time.sleep((t - datetime.now()).seconds)
A few things to note: Python's weekdays / months are zero indexed (unlike cron), and that range excludes the last element, hence syntax like "1-5" becomes range(0,5) - ie [0,1,2,3,4]. If you prefer cron syntax, parsing it shouldn't be too difficult however.

More or less same as above but concurrent using gevent :)
"""Gevent based crontab implementation"""
from datetime import datetime, timedelta
import gevent
# Some utility classes / functions first
def conv_to_set(obj):
"""Converts to set allowing single integer to be provided"""
if isinstance(obj, (int, long)):
return set([obj]) # Single item
if not isinstance(obj, set):
obj = set(obj)
return obj
class AllMatch(set):
"""Universal set - match everything"""
def __contains__(self, item):
return True
allMatch = AllMatch()
class Event(object):
"""The Actual Event Class"""
def __init__(self, action, minute=allMatch, hour=allMatch,
day=allMatch, month=allMatch, daysofweek=allMatch,
args=(), kwargs={}):
self.mins = conv_to_set(minute)
self.hours = conv_to_set(hour)
self.days = conv_to_set(day)
self.months = conv_to_set(month)
self.daysofweek = conv_to_set(daysofweek)
self.action = action
self.args = args
self.kwargs = kwargs
def matchtime(self, t1):
"""Return True if this event should trigger at the specified datetime"""
return ((t1.minute in self.mins) and
(t1.hour in self.hours) and
(t1.day in self.days) and
(t1.month in self.months) and
(t1.weekday() in self.daysofweek))
def check(self, t):
"""Check and run action if needed"""
if self.matchtime(t):
self.action(*self.args, **self.kwargs)
class CronTab(object):
"""The crontab implementation"""
def __init__(self, *events):
self.events = events
def _check(self):
"""Check all events in separate greenlets"""
t1 = datetime(*datetime.now().timetuple()[:5])
for event in self.events:
gevent.spawn(event.check, t1)
t1 += timedelta(minutes=1)
s1 = (t1 - datetime.now()).seconds + 1
print "Checking again in %s seconds" % s1
job = gevent.spawn_later(s1, self._check)
def run(self):
"""Run the cron forever"""
self._check()
while True:
gevent.sleep(60)
import os
def test_task():
"""Just an example that sends a bell and asd to all terminals"""
os.system('echo asd | wall')
cron = CronTab(
Event(test_task, 22, 1 ),
Event(test_task, 0, range(9,18,2), daysofweek=range(0,5)),
)
cron.run()

None of the listed solutions even attempt to parse a complex cron schedule string. So, here is my version, using croniter. Basic gist:
schedule = "*/5 * * * *" # Run every five minutes
nextRunTime = getNextCronRunTime(schedule)
while True:
roundedDownTime = roundDownTime()
if (roundedDownTime == nextRunTime):
####################################
### Do your periodic thing here. ###
####################################
nextRunTime = getNextCronRunTime(schedule)
elif (roundedDownTime > nextRunTime):
# We missed an execution. Error. Re initialize.
nextRunTime = getNextCronRunTime(schedule)
sleepTillTopOfNextMinute()
Helper routines:
from croniter import croniter
from datetime import datetime, timedelta
# Round time down to the top of the previous minute
def roundDownTime(dt=None, dateDelta=timedelta(minutes=1)):
roundTo = dateDelta.total_seconds()
if dt == None : dt = datetime.now()
seconds = (dt - dt.min).seconds
rounding = (seconds+roundTo/2) // roundTo * roundTo
return dt + timedelta(0,rounding-seconds,-dt.microsecond)
# Get next run time from now, based on schedule specified by cron string
def getNextCronRunTime(schedule):
return croniter(schedule, datetime.now()).get_next(datetime)
# Sleep till the top of the next minute
def sleepTillTopOfNextMinute():
t = datetime.utcnow()
sleeptime = 60 - (t.second + t.microsecond/1000000.0)
time.sleep(sleeptime)

I know there are a lot of answers, but another solution could be to go with decorators. This is an example to repeat a function everyday at a specific time. The cool think about using this way is that you only need to add the Syntactic Sugar to the function you want to schedule:
#repeatEveryDay(hour=6, minutes=30)
def sayHello(name):
print(f"Hello {name}")
sayHello("Bob") # Now this function will be invoked every day at 6.30 a.m
And the decorator will look like:
def repeatEveryDay(hour, minutes=0, seconds=0):
"""
Decorator that will run the decorated function everyday at that hour, minutes and seconds.
:param hour: 0-24
:param minutes: 0-60 (Optional)
:param seconds: 0-60 (Optional)
"""
def decoratorRepeat(func):
#functools.wraps(func)
def wrapperRepeat(*args, **kwargs):
def getLocalTime():
return datetime.datetime.fromtimestamp(time.mktime(time.localtime()))
# Get the datetime of the first function call
td = datetime.timedelta(seconds=15)
if wrapperRepeat.nextSent == None:
now = getLocalTime()
wrapperRepeat.nextSent = datetime.datetime(now.year, now.month, now.day, hour, minutes, seconds)
if wrapperRepeat.nextSent < now:
wrapperRepeat.nextSent += td
# Waiting till next day
while getLocalTime() < wrapperRepeat.nextSent:
time.sleep(1)
# Call the function
func(*args, **kwargs)
# Get the datetime of the next function call
wrapperRepeat.nextSent += td
wrapperRepeat(*args, **kwargs)
wrapperRepeat.nextSent = None
return wrapperRepeat
return decoratorRepeat

I like how the pycron package solves this problem.
import pycron
import time
while True:
if pycron.is_now('0 2 * * 0'): # True Every Sunday at 02:00
print('running backup')
time.sleep(60) # The process should take at least 60 sec
# to avoid running twice in one minute
else:
time.sleep(15) # Check again in 15 seconds

There isn't a "pure python" way to do this because some other process would have to launch python in order to run your solution. Every platform will have one or twenty different ways to launch processes and monitor their progress. On unix platforms, cron is the old standard. On Mac OS X there is also launchd, which combines cron-like launching with watchdog functionality that can keep your process alive if that's what you want. Once python is running, then you can use the sched module to schedule tasks.

Another trivial solution would be:
from aqcron import At
from time import sleep
from datetime import datetime
# Event scheduling
event_1 = At( second=5 )
event_2 = At( second=[0,20,40] )
while True:
now = datetime.now()
# Event check
if now in event_1: print "event_1"
if now in event_2: print "event_2"
sleep(1)
And the class aqcron.At is:
# aqcron.py
class At(object):
def __init__(self, year=None, month=None,
day=None, weekday=None,
hour=None, minute=None,
second=None):
loc = locals()
loc.pop("self")
self.at = dict((k, v) for k, v in loc.iteritems() if v != None)
def __contains__(self, now):
for k in self.at.keys():
try:
if not getattr(now, k) in self.at[k]: return False
except TypeError:
if self.at[k] != getattr(now, k): return False
return True

I don't know if something like that already exists. It would be easy to write your own with time, datetime and/or calendar modules, see http://docs.python.org/library/time.html
The only concern for a python solution is that your job needs to be always running and possibly be automatically "resurrected" after a reboot, something for which you do need to rely on system dependent solutions.

Related

How to slow down asynchrounous API calls to match API limits?

I have a list of ~300K URLs for an API i need to get data from.
The API limit is 100 calls per second.
I have made a class for the asynchronous but this is working to fast and I am hitting an error on the API.
How do I slow down the asynchronous, so that I can make 100 calls per second?
import grequests
lst = ['url.com','url2.com']
class Test:
def __init__(self):
self.urls = lst
def exception(self, request, exception):
print ("Problem: {}: {}".format(request.url, exception))
def async(self):
return grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
def collate_responses(self, results):
return [x.text for x in results]
test = Test()
#here we collect the results returned by the async function
results = test.async()
response_text = test.collate_responses(results)

The first step that I took was to create an object who can distribute a maximum of n coins every t ms.
import time
class CoinsDistribution:
"""Object that distribute a maximum of maxCoins every timeLimit ms"""
def __init__(self, maxCoins, timeLimit):
self.maxCoins = maxCoins
self.timeLimit = timeLimit
self.coin = maxCoins
self.time = time.perf_counter()
def getCoin(self):
if self.coin <= 0 and not self.restock():
return False
self.coin -= 1
return True
def restock(self):
t = time.perf_counter()
if (t - self.time) * 1000 < self.timeLimit:
return False
self.coin = self.maxCoins
self.time = t
return True
Now we need a way of forcing function to only get called if they can get a coin.
To do that we can write a decorator function that we could use like that:
#limitCalls(callLimit=1, timeLimit=1000)
def uniqFunctionRequestingServer1():
return 'response from s1'
But sometimes, multiple functions are calling requesting the same server so we would want them to get coins from the the same CoinsDistribution object.
Therefor, another use of the decorator would be by supplying the CoinsDistribution object:
server_2_limit = CoinsDistribution(3, 1000)
#limitCalls(server_2_limit)
def sendRequestToServer2():
return 'it worked !!'
#limitCalls(server_2_limit)
def sendAnOtherRequestToServer2():
return 'it worked too !!'
We now have to create the decorator, it can take either a CoinsDistribution object or enough data to create a new one.
import functools
def limitCalls(obj=None, *, callLimit=100, timeLimit=1000):
if obj is None:
obj = CoinsDistribution(callLimit, timeLimit)
def limit_decorator(func):
#functools.wraps(func)
def limit_wrapper(*args, **kwargs):
if obj.getCoin():
return func(*args, **kwargs)
return 'limit reached, please wait'
return limit_wrapper
return limit_decorator
And it's done ! Now you can limit the number of calls any API that you use and you can build a dictionary to keep track of your CoinsDistribution objects if you have to manage a lot of them (to differrent API endpoints or to different APIs).
Note: Here I have choosen to return an error message if there are no coins available. You should adapt this behaviour to your needs.

You can just keep track of how much time has passed and decide if you want to do more requests or not.
This will print 100 numbers per second, for example:
from datetime import datetime
import time
start = datetime.now()
time.sleep(1);
counter = 0
while (True):
end = datetime.now()
s = (end-start).seconds
if (counter >= 100):
if (s <= 1):
time.sleep(1) # You can keep track of the time and sleep less, actually
start = datetime.now()
counter = 0
print(counter)
counter += 1

This other question in SO shows exactly how to do this. By the way, what you need is usually called throttling.

python threading for elevator simulation

I am trying to make an elevator simulation because of an interesting problem I saw on CareerCup. My problem is that I want the elevator to "take time" to move from one floor to another. Right now it just instantly moves to the next floor in its "to visit" list. I'm not sure how to program it so that "pickup requests" can be coming in while the elevator is moving. I think this may require threading, and the time.sleep() function. How do I make one thread that makes random requests to the elevator, and another thread that has the elevator trying to meet all of the requests? This is what I have so far:
import time
from random import *
import math
class Elevator:
def __init__(self, num_floors):
self.going_up = False
self.going_down = False
self.docked = True
self.curr_floor = 0
self.num_floors = num_floors
self.floors_to_visit = []
self.people_waiting = []
def print_curr_status(self):
for i in range(self.num_floors):
if i == self.curr_floor:
print('. []')
else:
print('.')
print ("to_visit: ", self.floors_to_visit)
def handle_call_request(self, person):
if not self.going_up and not self.going_down:
self.floors_to_visit = [person.curr_floor] + self.floors_to_visit
self.going_up = True
self.docked = False
self.people_waiting.append(person)
else:
self.floors_to_visit.append(person.curr_floor)
self.people_waiting.append(person)
def handle_input_request(self, floor_num):
self.floors_to_visit.append(floor_num)
def go_to_next(self):
if not self.floors_to_visit:
self.print_curr_status()
return
self.curr_floor = self.floors_to_visit.pop(0)
for i,person in enumerate(self.people_waiting):
if person.curr_floor == self.curr_floor:
person.riding = True
person.press_floor_num()
self.people_waiting.pop(i)
return
class Person:
def __init__(self, assigned_elevator, curr_floor):
self.curr_floor = curr_floor
self.desired_floor = math.floor(random() * 10)
self.assigned_elevator = assigned_elevator
self.riding = False
def print_floor(self):
print(self.desired_floor)
def call_elevator(self):
self.assigned_elevator.handle_call_request(self)
def press_floor_num(self):
self.assigned_elevator.handle_input_request(self.desired_floor)
my_elevator = Elevator(20)
while True:
for i in range(3):
some_person = Person(my_elevator, math.floor(random() * 10))
some_person.call_elevator()
my_elevator.go_to_next()
my_elevator.print_curr_status()
time.sleep(1)

No threding is neccessary. You can introduce 2 new variables: one keeping track on the time the elevator started and one for the time an elevator ride should take. Then just just check when the elevator has run long enough. You can do this calling the function time.time(); it'll return the time in seconds since January 1, 1970 (since you're only interested in the difference it doesn't matter; you just need a function that increment in time). Although, this function often can't give a more accurate time period than 1 second. If you feel it's to inaccurate on your machine then you could use datetime.
class Elevator:
def __init__(self, num_floors):
self.start_time = 0
self.ride_duration = 1
...
def call_elevator(self):
self.start_time = time.time()
self.assigned_elevator.handle_call_request(self)
def go_to_next(self):
if time.time() - self.start_time < self.ride_duration:
return # Do nothing.
else:
...
You'll probably need to refactor the code to suit your needs and add some logic on what to do when the elevator is in use, etc.

timeit eats return value

I want to measure execution time of a function on the cheap, something like this:
def my_timeit(func, *args, **kwargs):
t0 = time.time()
result = func(*args, **kwargs)
delta = time.time() - t0
return delta, result
def foo():
time.sleep(1.23)
return 'potato'
delta, result = my_timeit(foo)
But I want to use timeit, profile or other built-in to handle whatever are the common pitfalls due to platform differences, and it would probably be also better to get the actual execution time not the wall time.
I tried using timeit.Timer(foo).timeit(number=1) but the interface seems to obscure the return value.

This is my current attempt. But I would welcome any suggestions, because this feels too hacky and could probably do with improvement.
import time
from timeit import Timer
def my_timeit(func, *args, **kwargs):
output_container = []
def wrapper():
output_container.append(func(*args, **kwargs))
timer = Timer(wrapper)
delta = timer.timeit(1)
return delta, output_container.pop()
def foo():
time.sleep(1.111)
return 'potato'
delta, result = my_timeit(foo)
edit: adapted to work as a decorator below:
def timeit_decorator(the_func):
#functools.wraps(the_func)
def my_timeit(*args, **kwargs):
output_container = []
def wrapper():
output_container.append(the_func(*args, **kwargs))
timer = Timer(wrapper)
delta = timer.timeit(1)
my_timeit.last_execution_time = delta
return output_container.pop()
return my_timeit

How about
>>time python yourprogram.py < input.txt
This is the output for a python script I ran
[20:13:29] praveen:jan$ time python mtrick.py < input_mtrick.txt
3 3 9
1 2 3 4
real 0m0.067s
user 0m0.016s
sys 0m0.012s

Get python unit test duration in seconds

Is there any way to get the total amount of time that "unittest.TextTestRunner().run()" has taken to run a specific unit test.
I'm using a for loop to test modules against certain scenarios (some having to be used and some not, so they run a few times), and I would like to print the total time it has taken to run all the tests.
Any help would be greatly appreciated.

UPDATED, thanks to #Centralniak's comment.
How about simple
from datetime import datetime
tick = datetime.now()
# run the tests here
tock = datetime.now()
diff = tock - tick # the result is a datetime.timedelta object
print(diff.total_seconds())

You could record start time in the setup function and then print elapsed time in cleanup.

Following Eric's one-line answer I have a little snippet I work with here:
from datetime import datetime
class SomeTests(unittest.TestCase):
"""
... write the rest yourself! ...
"""
def setUp(self):
self.tick = datetime.now()
def tearDown(self):
self.tock = datetime.now()
diff = self.tock - self.tick
print (diff.microseconds / 1000), "ms"
# all the other tests below
This works fine enough for me, for now, but I want to fix some minor formatting issues. The result ok is now on the next line, and FAIL has priority. This is ugly.

I do this exactly as Eric postulated -- here's a decorator I use for tests (often more functional-test-y than strict unit tests)...
# -*- coding: utf-8 -*-
from __future__ import print_function
from functools import wraps
from pprint import pprint
WIDTH = 60
print_separator = lambda fill='-', width=WIDTH: print(fill * width)
def timedtest(function):
"""
Functions so decorated will print the time they took to execute.
Usage:
import unittest
class MyTests(unittest.TestCase):
#timedtest
def test_something(self):
assert something is something_else
# … etc
# An optional return value is pretty-printed,
# along with the timing values:
return another_thing
"""
#wraps(function)
def wrapper(*args, **kwargs):
print()
print("TESTING: %s(…)" % getattr(function, "__name__", "<unnamed>"))
print_separator()
print()
t1 = time.time()
out = function(*args, **kwargs)
t2 = time.time()
dt = str((t2 - t1) * 1.00)
dtout = dt[:(dt.find(".") + 4)]
print_separator()
if out is not None:
print('RESULTS:')
pprint(out, indent=4)
print('Test finished in %s seconds' % dtout)
print_separator('=')
return out
return wrapper
That's the core of it -- from there, if you want, you can stash the times in a database for analysis, or draw graphs, et cetera. A decorator like this (using #wraps(…) from the functools module) won't interfere with any of the dark magic that unit-test frameworks occasionally resort to.

Besides using datetime, you could also use time
from time import time
t0 = time()
# do your stuff here
print(time() - t0) # it will show in seconds

Python, function quit if it has been run the last 5 minutes

I have a python script that gets data from a USB weather station, now it puts the data into MySQL whenever the data is received from the station.
I have a MySQL class with an insert function, what i want i that the function checks if it has been run the last 5 minutes if it has, quit.
Could not find any code on the internet that does this.
Maybe I need to have a sub-process, but I am not familiar with that at all.
Does anyone have an example that I can use?

Use this timeout decorator.
import signal
class TimeoutError(Exception):
def __init__(self, value = "Timed Out"):
self.value = value
def __str__(self):
return repr(self.value)
def timeout(seconds_before_timeout):
def decorate(f):
def handler(signum, frame):
raise TimeoutError()
def new_f(*args, **kwargs):
old = signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds_before_timeout)
try:
result = f(*args, **kwargs)
finally:
signal.signal(signal.SIGALRM, old)
signal.alarm(0)
return result
new_f.func_name = f.func_name
return new_f
return decorate
Usage:
import time
#timeout(5)
def mytest():
print "Start"
for i in range(1,10):
time.sleep(1)
print "%d seconds have passed" % i
if __name__ == '__main__':
mytest()

Probably the most straight-forward approach (you can put this into a decorator if you like, but that's just cosmetics I think):
import time
import datetime
class MySQLWrapper:
def __init__(self, min_period_seconds):
self.min_period = datetime.timedelta(seconds=min_period_seconds)
self.last_calltime = datetime.datetime.now() - self.min_period
def insert(self, item):
now = datetime.datetime.now()
if now-self.last_calltime < self.min_period:
print "not insert"
else:
self.last_calltime = now
print "insert", item
m = MySQLWrapper(5)
m.insert(1) # insert 1
m.insert(2) # not insert
time.sleep(5)
m.insert(3) # insert 3
As a side-note: Have you noticed RRDTool during your web-search for related stuff? It does apparantly what you want to achieve, i.e.
a database to store the most recent values of arbitrary resolution/update frequency.
extrapolation/interpolation of values if updates are too frequent or missing.
generates graphs from the data.
An approach could be to store all data you can get into your MySQL database and forward a subset to such RRDTool database to generate a nice time series visualization of it. Depending on what you might need.

import time
def timeout(f, k, n):
last_time = [time.time()]
count = [0]
def inner(*args, **kwargs):
distance = time.time() - last_time[0]
if distance > k:
last_time[0] = time.time()
count[0] = 0
return f(*args, **kwargs)
elif distance < k and (count[0]+1) == n:
return False
else:
count[0] += 1
return f(*args, **kwargs)
return inner
timed = timeout(lambda x, y : x + y, 300, 1)
print timed(2, 4)
First argument is the function you want run, second is the time interval, and the third is the number of times it's allowed to run in that time interval.

Each time the function is run save a file with the current time. When the function is run again check the time stored in the file and make sure it is old enough.

Just derive to a new class and override the insert function. In the overwriting function, check last insert time and call father's insert method if it has been more than five minutes, and of course update the most recent insert time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running a Python web scraper every hour [duplicate] - python

Related

How to slow down asynchrounous API calls to match API limits?

python threading for elevator simulation

timeit eats return value

Get python unit test duration in seconds

Python, function quit if it has been run the last 5 minutes

Categories

Resources