Optimise python function fetching multi-level json attributes - python

I have a 3 level json file. I am fetching the values of some of the attributes from each of the 3 levels of json. At the moment, the execution time of my code is pathetic as it is taking about 2-3 minutes to get the results on my web page. I will be having a much larger json file to deal with in production.
I am new to python and flask and haven't done much of web programming. Please suggest me ways I could optimise my below code! Thanks for help, much appreciated.
import json
import urllib2
import flask
from flask import request
def Backend():
url = 'http://localhost:8080/surveillance/api/v1/cameras/'
response = urllib2.urlopen(url).read()
response = json.loads(response)
components = list(response['children'])
urlComponentChild = []
for component in components:
urlComponent = str(url + component + '/')
responseChild = urllib2.urlopen(urlComponent).read()
responseChild = json.loads(responseChild)
camID = str(responseChild['id'])
camName = str(responseChild['name'])
compChildren = responseChild['children']
compChildrenName = list(compChildren)
for compChild in compChildrenName:
href = str(compChildren[compChild]['href'])
ID = str(compChildren[compChild]['id'])
urlComponentChild.append([href,ID])
myList = []
for each in urlComponentChild:
response = urllib2.urlopen(each[0]).read()
response = json.loads(response)
url = each[0] + '/recorder'
responseRecorder = urllib2.urlopen(url).read()
responseRecorder = json.loads(responseRecorder)
username = str(response['subItems']['surveillance:config']['properties']['username'])
password = str(response['subItems']['surveillance:config']['properties']['password'])
manufacturer = str(response['properties']['Manufacturer'])
model = str(response['properties']['Model'])
status = responseRecorder['recording']
myList.append([each[1],username,password,manufacturer,model,status])
return myList
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
return flask.render_template('index.html', response = Backend())
if __name__ == '__main__':
APP.debug=True
APP.run(port=62000)

Ok, caching. So what we're going to do is start returning values to the user instantly based on data we already have, rather than generating new data every time. This means that the user might get slightly less up to date data than is theoretically possible to get, but it means that the data they do receive they receive as quickly as is possible given the system you're using.
So we'll keep your backend function as it is. Like I said, you could certainly speed it up with multithreading (If you're still interested in that, the 10 second version is that I would use grequests to asynchronously get data from a list of urls).
But, rather than call it in response to the user every time a user requests data, we'll just call it routinely every once in a while. This is almost certainly something you'd want to do eventually anyway, because it means you don't have to generate brand new data for each user, which is extremely wasteful. We'll just keep some data on hand in a variable, update that variable as often as we can, and return whatever's in that variable every time we get a new request.
from threading import Thread
from time import sleep
data = None
def Backend():
.....
def main_loop():
while True:
sleep(LOOP_DELAY_TIME_SECONDS)
global data
data = Backend()
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
# Return whatever data we currently have cached
return flask.render_template('index.html', response = data)
if __name__ == '__main__':
data = Backend() # Need to make sure we grab data before we start the server so we never return None to the user
Thread(target=main_loop).start() #Loop and grab new data at every loop
APP.debug=True
APP.run(port=62000)
DISCLAIMER: I've used Flask and threading before for a few projects, but I am by no means an expert on it or web development, at all. Test this code before using it for anything important (or better yet, find someone who knows that they're doing before using it for anything important)
Edit: data will have to be a global, sorry about that - hence the disclaimer

Related

I am unable to load full pickle list on azure webapp

I have an Azure Web app stacked on python and running a flask app to call a function and this function returns a list of country name which I have saved in pickle file. Lets say I have a total of 100 countries so whenever I run the app it reads 100 countries from that pickle file but sometimes it's stuck to 98 or 99 countries so not sure where I am loosing 1 or 2 countries from that list. This issue only happens on azure web app otherwise it retrieves full 100 countries. Below is the code I'm using to load the pickle file having country list of 100:
import pickle
path=os.getcwd()+'\\'
def example():
country_list=pickle.load(open(path+"support_file/country_list.p","rb"))
print(len(country_list))
return country_list
Here is my flask app.py to call the function:
from other_file import example
from flask import Flask, request
app = Flask(__name__)
#app.route("/", methods=["POST", "GET"])
def query():
if request.method == "POST":
return example()
else:
return "Hello!"
if __name__ == "__main__":
app.run()
The above list is then used in a function and my output depends on all the elements of this list but if an element or two goes missing while loading this pickle then my output changes. So I'm not missing out this elements consistently but it happens for say 1 in every 20 times, so is this a problem of Azure Web app or is something wrong with my pickle? I tried to recreate the pickle but same problem keeps on coming up once in a while.
It seems pickle load reads till it's buffer is full. So, you would have to iterate like below, until it gets an EOF exception. Unfortunately, I could not find a graceful way to run the loop without catching exception. You might also need to cache the list instead of unpickling on every request to optimize performance.
with open(os.getcwd()+'/support_file/country_list.p','rb') as f:
country_list = []
while True:
try:
country_list.append(pickle.load(f))
except EOFError:
break

How can I get my site to rerun some python flask code every time someone visits or every few minutes?

This is my first time developing a website and I am coming across some issues. I have some python code that scrapes some data with beautifulsoup4 and displays the number on my site with flask. However, I found that my site does not automatically update the values at all, and rather only updates when I manually make my host reload.
How can I make it so that my python script "re-scrapes" every time a visitor visits my site, or just every 5 minutes or so? Any help would be greatly appreciated!
emphasized text
Host- Pythonanywhere
Here is my current backend python code:
import bs4 as bs
import urllib.request
from flask import Flask, render_template
app = Flask(__name__)
link = urllib.request.urlopen('https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx')
soup = bs.BeautifulSoup(link, 'lxml')
body = soup.find('body') # get the body so you can do soup.find_all() inside it
tables = soup.find_all('table')
for table in tables:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
if row.count('Bucks') > 0:
print(row[1])
# Bucksnum shows the amount of cases in bucks county,
bucksnum = str(row[1])
data = bucksnum
# this is the part that connects the flask file to the html file
#app.route("/")
def home():
return render_template("template.html", data=data)
#app.route("/")
def index():
return bucksnum
if __name__ == '__main__':
app.run(host='0.0.0.0')
index()
Your application is only gathering the data once, when it starts up. If you want it to grab the data every single time someone visits the page, you could place your code in which you grab and process the table data into the relevant view function indicated by the #app.route('/route') wrapper, and the function will be run every time it's visited.
You would need to use schedulers, check this thread that discusses a similar issue, you can use it to call a certain function that updates your data every interval of time.

send data to html dashboard in python flask while having dictionary in a function parameter

I want to send data to html in flask framework, i have a function which receives dictionary as a parameter, then there are several functions applied on dictionary. after that final result i want to render in to html page. Its been 48 hours i am trying from different blogs but didn't get precise solution.
imports ...
from other file import other_functions
from other file import other_functions_2
from other file import other_functions_3
app = Flask(__name__, template_folder='templates/')
#app.route("/dashboard")
def calculate_full_eva_web(input:dict):
calculate_gap = other_functions(input)
calculate_matrix = other_functions_2(input)
average = other_functions_3(input)
data = dict{'calculate_gap':calculate_gap, 'calculate_matrix':calculate_matrix,'average':average}
return render_template('pages/dashboard.html', data = data)
if __name__ == "__main__":
app.run(debug=True)
TRACEBACK
Methods that Flask can route to don't take dictionaries as inputs, and the arguments they do take need to be matches by a pattern in the route. (See https://flask.palletsprojects.com/en/1.1.x/api/#url-route-registrations)
You'd get the same error if you changed
#app.route("/dashboard")
def calculate_full_eva_web(input:dict):
to
#app.route("/dashboard")
def calculate_full_eva_web(input):
Your path forward depends on how you want to pass data when you make the request. You can pass key/value pairs via URL parameters and retrieve them via the request.args object. That might be close enough to what you want. (You'll need to remove the argument declaration from calculate_full_eva_web())
Something like
from flask import request
#app.route('/dashboard')
def calculate_full_eva_web():
input = request.args
...

unable to save data in datastore but no errors

I'm building a web crawler. some of the the data I input into datastore get saved, others do not get saved and I have no idea what is the problem.
here is my crawler class
class Crawler(object):
def get_page(self, url):
try:
req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"}) # yessss!!! with the header, I am able to download pages
#response = urlfetch.fetch(url, method='GET')
#return response.content
#except urlfetch.InvalidURLError as iu:
# return iu.message
response = urllib2.urlopen(req)
return response.read()
except urllib2.HTTPError as e:
return e.reason
def get_all_links(self, page):
return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',page)
def union(self, lyst1, lyst2):
try:
for elmt in lyst2:
if elmt not in lyst1:
lyst1.append(elmt)
return lyst1
except e:
return e.reason
#function that crawls the web for links starting from the seed
#returns a dictionary of index and graph
def crawl_web(self, seed="http://tonaton.com/"):
query = Listings.query() #create a listings object from storage
if query.get():
objListing = query.get()
else:
objListing = Listings()
objListing.toCrawl = [seed]
objListing.Crawled = []
start_time = datetime.datetime.now()
while datetime.datetime.now()-start_time < datetime.timedelta(0,5):#tocrawl (to crawl can take forever)
try:
#while True:
page = objListing.toCrawl.pop()
if page not in objListing.Crawled:
content = self.get_page(page)
add_page_to_index(page, content)
outlinks = self.get_all_links(content)
graph = Graph() #create a graph object with the url
graph.url = page
graph.links = outlinks #save all outlinks as the value part of the graph url
graph.put()
self.union(objListing.toCrawl, outlinks)
objListing.Crawled.append(page)
except:
return False
objListing.put() #save to database
return True #return true if it works
the classes that define the various ndb Models are in this python module:
import os
import urllib
from google.appengine.ext import ndb
import webapp2
class Listings(ndb.Model):
toCrawl = ndb.StringProperty(repeated=True)
Crawled = ndb.StringProperty(repeated=True)
#let's see how this works
class Index(ndb.Model):
keyword = ndb.StringProperty() # keyword part of the index
url = ndb.StringProperty(repeated=True) # value part of the index
#class Links(ndb.Model):
# links = ndb.JsonProperty(indexed=True)
class Graph(ndb.Model):
url = ndb.StringProperty()
links = ndb.StringProperty(repeated=True)
it used to work fine when I had JsonProperty in place of StringProperty(repeated=true). but JsonProperty is limited to 1500 bytes so I had an error once.
now, when I run the crawl_web member function, it actually crawls but when I check datastore it's only the Index entity that is created. No Graph, no Listing. please help. thanks.
Putting your code together, adding the missing imports, and logging the exception, eventually shows the first killer problem:
Exception Indexed value links must be at most 500 characters
and indeed, adding a logging of outlinks, one easily eyeballs that several of them are far longer than 500 characters -- therefore they can't be items in an indexed property, such as a StringProperty. Changing each repeated StringProperty to a repeated TextProperty (so it does not get indexed and thus has no 500-characters-per-item limitation), the code runs for a while (making a few instances of Graph) but eventually dies with:
An error occured while connecting to the server: Unable to fetch URL: https://sb':'http://b')+'.scorecardresearch.com/beacon.js';document.getElementsByTagName('head')[0].appendChild(s); Error: [Errno 8] nodename nor servname provided, or not known
and indeed, it's pretty obvious tht the alleged "link" is actually a bunch of Javascript and as such cannot be fetched.
So, essentially, the core bug in your code is not at all related to app engine, but rather, the issue is that your regular expression:
'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
does not properly extract outgoing links given a web page containing Javascript as well as HTML.
There are many issues with your code, but to this point they're just slowing it down or making it harder to understand, not killing it -- what's killing it is using that regular expression pattern to try and extract links from the page.
Check out retrieve links from web page using python and BeautifulSoup -- most answers suggest, for the purpose of extracting links from a page, using BeautifulSoup, which may perhaps be a problem in app engine, but one shows how to do it with just Python and REs.

Retrieving Twitter data on the fly

Our company is trying to read in all live streams of data entered by random users, i.e., a random user sends off a tweet saying "ABC company".
Seeing as how you could use a twitter client to search for said text, I labour under the assumption that it's possible to aggregate all tweets that send off ones without using a client, i.e., to file, streaming in live without using hashtags.
What's the best way to do this? And if you've done this before, could you share your script? I reckon the simplest way would be via ruby/python script left running, but my understanding of ruby/python is limited at best.
Kindly help?
Here's a bare minimum:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import twitter
from threading import *
from os import _exit, urandom
from time import sleep
from logger import *
import unicodedata
## Based on: https://github.com/sixohsix/twitter
class twitt(Thread):
def __init__(self, tags = None, *args, **kwargs):
self.consumer_key = '...'
self.consumer_secret = '...'
self.access_key = '...'
self.access_secret = '...'
self.encoding = 'iso-8859-15'
self.args = args
self.kwargs = kwargs
self.searchapi = twitter.Twitter(domain="search.twitter.com").search
Thread.__init__(self)
self.start()
def search(self, tag):
try:
return self.searchapi(q=tag)['results']
except:
return {}
def run(self):
while 1:
sleep(3)
To use it, do something like:
if __name__ == "__main__":
t = twitt()
print t.search('#DHSupport')
t.alive = False
Note: The only reason this is threaded is because it's just a piece of code i had laying around for other projects, it gives you an idea how to work with the API and perhaps build a background service to fetch search results on twitter.
There's a lot of crap in my original code so the structure might look a bit odd.
Note that you don't really need the consumer_keys etc for just a search but you will need OAuth login for more features such as posting or checking messages.
The only two things you really need is:
import twitter
print twitter.Twitter(domain="search.twitter.com").search(q='#hashtag')['results']

Categories

Resources