Does APScheduler Need a Function to Run? - python

I'm new to coding and this my first project. So far I've pieced together what I have through Googling, Tutorials and Stack.
I'm trying to add data from a pandas df of scraped RSS feeds to a remote sql database, then host the script on heroku or AWS and have the script running every hour.
Someone on here recommend that I use APScheduler as in this post.
I'm struggling though as there aren't any 'dummies' tutorials around APScheduler. This is what I've created so far.
I guess my question is does my script need to be in a function for APScheduler to trigger it or can it work another way.
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
#sched.scheduled_job('interval', minutes=1)
sched.configure()
sched.start()
import pandas as pd
from pandas.io import sql
import feedparser
import time
rawrss = ['http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
'https://www.uktech.news/feed'
]
time = time.strftime('%a %H:%M:%S')
summary = 'text'
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((time, post.title, post.link, summary))
df = pd.DataFrame(posts, columns=['article_time','article_title','article_url', 'article_summary']) # pass data to init
df.set_index(['article_time'], inplace=True)
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8', encoding = 'utf-8')
engine.execute("INSERT INTO rsstracker VALUES('%s', '%s', '%s','%s')" % (time, post.title, post.link, summary))
df.to_sql(con=engine, name='rsstracker', if_exists='append') #, flavor='mysql'

Yes. What you want to be executed must be a function (or another callable, like a method). The decorator syntax (#sched.…) needs a function definition (def …) to which the decorator is applied. The code in your example doesn't compile.
Then it's a blocking scheduler, meaning if you call sched.start() this method doesn't return (unless you stop the scheduler in some scheduled code) and nothing after the call is executed.
Imports should go to the top, then it's easier to see what the module depends on. And don' import things you don't actually use.
I'm not sure why you import and use pandas for data that doesn't really need DataFrame objects. Also SQLAlchemy without actually using anything this library offers and formatting values as strings into an SQL query which is dangerous!
Just using SQLAlchemy for the database it may look like this:
#!/usr/bin/env python
# coding: utf-8
from __future__ import absolute_import, division, print_function
from time import strftime
import feedparser
from apscheduler.schedulers.blocking import BlockingScheduler
from sqlalchemy import create_engine, MetaData
sched = BlockingScheduler()
RSS_URLS = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
'https://www.uktech.news/feed',
]
#sched.scheduled_job('interval', minutes=1)
def process_feeds():
time = strftime('%a %H:%M:%S')
summary = 'text'
engine = create_engine(
'mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8'
)
metadata = MetaData(engine, reflect=True)
rsstracker = metadata.tables['rsstracker']
for url in RSS_URLS:
feed = feedparser.parse(url)
for post in feed.entries:
(
rsstracker.insert()
.values(
time=time,
title=post.title,
url=post.link,
summary=summary,
)
.execute()
)
def main():
sched.configure()
sched.start()
if __name__ == '__main__':
main()
The time column seems a bit odd, I would have expected a TIMESTAMP or DATETIME here and not a string that throws away much of the information, just leaving the abbreviated week day and the time.

Related

how to create a custom metric in data dog using aws lambda

I am using aws lambda to calculate number of days for rds maintenance. Now i get an integer and i would like to send this to datadog so that it creates a metric. I am new to datadog and not sure how to do it. I already created lambda layer for datadog. Here is my code , since my lambda is doing alot of other stuff so i will only include the problematic block
import boto3
import json
import collections
import datetime
from dateutil import parser
import time
from datetime import timedelta
from datetime import timezone
from datadog_lambda.metric import lambda_metric
from datadog_lambda.wrapper import datadog_lambda_wrapper
import urllib.request, urllib.error, urllib.parse
import os
import sys
from botocore.exceptions import ClientError
from botocore.config import Config
from botocore.session import Session
import tracemalloc
<lambda logic, only including else block at which point i want to send the data>
print("WARNING! ForcedApplyDate is: ", d_fapply )
rdsForcedDate = parser.parse(d_fapply)
print (rdsForcedDate)
current_dateTime = datetime.datetime.now(timezone.utc)
difference = rdsForcedDate - current_dateTime
#print (difference)
if difference < timedelta(days=7):
rounded = difference.days
print (rounded)
lambda_metric(
"rds_maintenance.days", # Metric name
rounded, # Metric value
tags=['key:value', 'key:value'] # Associated tags
)
here i would like to send the number of days which could be 5, 10 15 any number. I have also added lambda extension layer, function runs perfectly but i dont see metric in DD
I have also tried using this
from datadog import statsd
if difference < timedelta(days=7):
try:
rounded = difference.days
statsd.gauge('rds_maintenance_alert', round(rounded,
2))
print ("data sent to datadog")
except Exception as e:
print(f"Error sending metric to Datadog: {e}")
Again i dont get error for this block too but cant see metric. Data dog api key, site are in lambda env variables

How to update RSS Feed every 5 seconds in Python using Flask

I did a lot of research and nothing relevant worked. Basically I am trying to scrape RSS Feed and populate the data in a table format on a webpage created using Python Flask. I have scraped the data in a dictionary form. But it does not fetch the data in real-time (or every 5 seconds) on the webpage.
Here is the code for scraping RSS Feed using formfeed, rss_feed.py.
import feedparser
import time
def feed_data():
RSSFeed = feedparser.parse("https://www.upwork.com/ab/feed/jobs/rss?sort=recency&paging=0%3B10&api_params=1&q=&securityToken=2c2762298fe1b719a51741dbacb7d4f5c1e42965918fbea8d2bf1185644c8ab2907f418fe6b1763d5fca3a9f0e7b34d2047f95b56d12e525bc4ba998ae63f0ff&userUid=424312217100599296&orgUid=424312217104793601")
feed_dict = {}
for i in range(len(RSSFeed.entries)):
feed_list = []
feed_list.append(RSSFeed.entries[i].title)
feed_list.append(RSSFeed.entries[i].link)
feed_list.append(RSSFeed.entries[i].summary)
published = RSSFeed.entries[i].published
feed_list.append(published[:len(published)-6])
feed_dict[i] = feed_list
return feed_dict
if __name__=='__main__':
while True:
feed_dict = feed_data()
#print(feed_dict)
#print("==============================")
time.sleep(5)
Using the time.sleep() works on this script. But when I import it in the app.py, it fails to reload every 5 seconds. Here is the code to run the Flask app, app.py:
from flask import Flask, render_template
import rss_feed
feed_dict = rss_feed.feed_data()
app = Flask(__name__)
#app.route("/")
def hello():
return render_template('home.html', feed_dict=feed_dict)
I tried using BackgroundScheduler from APScheduler as well. Nothing seems to be working. Formfeed's 'etag' and 'modified' not being recognized for some reason (is it deprecated?). I even tried using the 'refresh' attribute in the meta tag. But that of course only updates the Jinja2 template and not the code itself:
<meta http-equiv="refresh" content="5">
I am really stuck on this.
Here is a link to the (half complete) app: https://rss-feed-scraper.herokuapp.com/
Your
feed_dict = rss_feed.feed_data()
is at module level.
When Python starts, it executes these lines and won't reload it until you restart your app.
If you are interested in this topic, please google for runtime vs compile time python.
That said, I'd suggest that you do the polling with a JavaScript function which polls the remote RSS feed every 5 seconds.
This would look something like
setInterval(function(){
//code goes here that will be run every 5 seconds.
}, 5000);
I tried a bunch of things but this is what I found was the easiest solution to this problem:
from flask import Flask, render_template
import rss_feed
app = Flask(__name__)
feed_dict={}
def update_data(interval):
Timer(interval, update_data, [interval]).start()
global feed_dict
feed_dict = rss_feed.feed_data()
update_data(5)
#app.route("/")
def hello():
#feed_dict = rss_feed.feed_data()
#feed_dict=feed_data()
# time.sleep(5)
return render_template('home.html', feed_dict=feed_dict)
A simple update_data() solved the whole problem, did not need any additional module, JavaScript, AJAX etc. etc.

API Python script slows down after a while, how to make the code faster?

I am currently running a Python script to collect concert data via the Sonkick api. Although the script is working, after a while the speed slows down tremendously. I am looking for the most efficient way to solve this.
Below you can find my script:
import urllib2
import requests
import json
from tinydb import TinyDB, Query
db = TinyDB('concerts_songkick.json')
#retrieve concert data for every artist in artistid.txt
def load_events():
MIN_DATE = "2015-05-26"
MAX_DATE = "2017-04-25"
API_KEY= "##############"
with open('artistid.txt', 'r') as f:
for a in f:
artist = a.strip()
print(artist)
url_base = 'http://api.songkick.com/api/3.0/artists/{}/gigography.json?apikey={}&min_date={}&max_date={}'
url = url_base.format(artist, API_KEY, MIN_DATE, MAX_DATE)
try:
r = requests.get(url)
resp = r.json()
if(resp['resultsPage']['totalEntries']):
results = resp['resultsPage']['results']['event']
for x in results:
print(x)
db.insert(x)
except:
print('cannot fetch url',url);
load_events()
db.close()
print ("End of script")
Check the CPU utilisation and Memory Consumption of the script:
watch -n 0.5 ps -ur
Check exactly which part of the code is clogging the memory. Use Python Memory Profiler.
...
from memory_profiler import profile
#profile
def load_events()
...
And log the memory consumption over time with the command:
python your_script.py > memory_profile_logging.log
This should give you a pretty good idea about what you need to optimise.
Since you are performing io bound operation try to use grequests and fetch data from API asynchronously.

Retrieving Twitter data on the fly

Our company is trying to read in all live streams of data entered by random users, i.e., a random user sends off a tweet saying "ABC company".
Seeing as how you could use a twitter client to search for said text, I labour under the assumption that it's possible to aggregate all tweets that send off ones without using a client, i.e., to file, streaming in live without using hashtags.
What's the best way to do this? And if you've done this before, could you share your script? I reckon the simplest way would be via ruby/python script left running, but my understanding of ruby/python is limited at best.
Kindly help?
Here's a bare minimum:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import twitter
from threading import *
from os import _exit, urandom
from time import sleep
from logger import *
import unicodedata
## Based on: https://github.com/sixohsix/twitter
class twitt(Thread):
def __init__(self, tags = None, *args, **kwargs):
self.consumer_key = '...'
self.consumer_secret = '...'
self.access_key = '...'
self.access_secret = '...'
self.encoding = 'iso-8859-15'
self.args = args
self.kwargs = kwargs
self.searchapi = twitter.Twitter(domain="search.twitter.com").search
Thread.__init__(self)
self.start()
def search(self, tag):
try:
return self.searchapi(q=tag)['results']
except:
return {}
def run(self):
while 1:
sleep(3)
To use it, do something like:
if __name__ == "__main__":
t = twitt()
print t.search('#DHSupport')
t.alive = False
Note: The only reason this is threaded is because it's just a piece of code i had laying around for other projects, it gives you an idea how to work with the API and perhaps build a background service to fetch search results on twitter.
There's a lot of crap in my original code so the structure might look a bit odd.
Note that you don't really need the consumer_keys etc for just a search but you will need OAuth login for more features such as posting or checking messages.
The only two things you really need is:
import twitter
print twitter.Twitter(domain="search.twitter.com").search(q='#hashtag')['results']

What is the easiest way of deleting all my blobstore data?

What is your best way to remove all of the blob from blobstore? I'm using Python.
I have quite a lot of blobs and I'd like to delete them all. I'm
currently doing the following:
class deleteBlobs(webapp.RequestHandler):
def get(self):
all = blobstore.BlobInfo.all();
more = (all.count()>0)
blobstore.delete(all);
if more:
taskqueue.add(url='/deleteBlobs',method='GET');
Which seems to be using tons of CPU and (as far as I can tell) doing
nothing useful.
I use this approach:
import datetime
import logging
import re
import urllib
from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template
from google.appengine.api import taskqueue
from google.appengine.api import users
class IndexHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write('Hello. Blobstore is being purged.\n\n')
try:
query = blobstore.BlobInfo.all()
index = 0
to_delete = []
blobs = query.fetch(400)
if len(blobs) > 0:
for blob in blobs:
blob.delete()
index += 1
hour = datetime.datetime.now().time().hour
minute = datetime.datetime.now().time().minute
second = datetime.datetime.now().time().second
self.response.out.write(str(index) + ' items deleted at ' + str(hour) + ':' + str(minute) + ':' + str(second))
if index == 400:
self.redirect("/purge")
except Exception, e:
self.response.out.write('Error is: ' + repr(e) + '\n')
pass
APP = webapp.WSGIApplication(
[
('/purge', IndexHandler),
],
debug=True)
def main():
util.run_wsgi_app(APP)
if __name__ == '__main__':
main()
My experience is that more than 400 blobs at once will fail, so I let it reload for every 400. I tried blobstore.delete(query.fetch(400)), but I think there's a bug right now. Nothing happened at all, and nothing was deleted.
You're passing the query object to the delete method, which will iterate over it fetching it in batches, then submit a single enormous delete. This is inefficient because it requires multiple fetches, and won't work if you have more results than you can fetch in the available time or with the available memory. The task will either complete once and not require chaining at all, or more likely, fail repeatedly, since it can't fetch every blob at once.
Also, calling count executes the query just to determine the count, which is a waste of time since you're going to try fetching the results anyway.
Instead, you should fetch results in batches using fetch, and delete each batch. Use cursors to set the next batch and avoid the need for the query to iterate over all the 'tombstoned' records before finding the first live one, and ideally, delete multiple batches per task, using a timer to determine when you should stop and chain the next task.

Categories

Resources