What is the easiest way of deleting all my blobstore data? - python

What is your best way to remove all of the blob from blobstore? I'm using Python.
I have quite a lot of blobs and I'd like to delete them all. I'm
currently doing the following:
class deleteBlobs(webapp.RequestHandler):
def get(self):
all = blobstore.BlobInfo.all();
more = (all.count()>0)
blobstore.delete(all);
if more:
taskqueue.add(url='/deleteBlobs',method='GET');
Which seems to be using tons of CPU and (as far as I can tell) doing
nothing useful.

I use this approach:
import datetime
import logging
import re
import urllib
from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template
from google.appengine.api import taskqueue
from google.appengine.api import users
class IndexHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write('Hello. Blobstore is being purged.\n\n')
try:
query = blobstore.BlobInfo.all()
index = 0
to_delete = []
blobs = query.fetch(400)
if len(blobs) > 0:
for blob in blobs:
blob.delete()
index += 1
hour = datetime.datetime.now().time().hour
minute = datetime.datetime.now().time().minute
second = datetime.datetime.now().time().second
self.response.out.write(str(index) + ' items deleted at ' + str(hour) + ':' + str(minute) + ':' + str(second))
if index == 400:
self.redirect("/purge")
except Exception, e:
self.response.out.write('Error is: ' + repr(e) + '\n')
pass
APP = webapp.WSGIApplication(
[
('/purge', IndexHandler),
],
debug=True)
def main():
util.run_wsgi_app(APP)
if __name__ == '__main__':
main()
My experience is that more than 400 blobs at once will fail, so I let it reload for every 400. I tried blobstore.delete(query.fetch(400)), but I think there's a bug right now. Nothing happened at all, and nothing was deleted.

You're passing the query object to the delete method, which will iterate over it fetching it in batches, then submit a single enormous delete. This is inefficient because it requires multiple fetches, and won't work if you have more results than you can fetch in the available time or with the available memory. The task will either complete once and not require chaining at all, or more likely, fail repeatedly, since it can't fetch every blob at once.
Also, calling count executes the query just to determine the count, which is a waste of time since you're going to try fetching the results anyway.
Instead, you should fetch results in batches using fetch, and delete each batch. Use cursors to set the next batch and avoid the need for the query to iterate over all the 'tombstoned' records before finding the first live one, and ideally, delete multiple batches per task, using a timer to determine when you should stop and chain the next task.

Related

How does Django handles multiple request

This is not a duplicate of this question
I am trying to understand how django handles multiple requests. According to this answer django is supposed to be blocking parallel requests. But I have found this is not exactly true, at least for django 3.1. I am using django builtin sever.
So, in my code(view.py) I have a blocking code block that is only triggered in a particular situation. It takes a very long to complete the request for this case. This is the code for view.py
from django.shortcuts import render
import numpy as np
def insertionSort(arr):
for i in range(1, len(arr)):
key = arr[i]
j = i-1
while j >=0 and key < arr[j] :
arr[j+1] = arr[j]
j -= 1
arr[j+1] = key
def home(request):
a = request.user.username
print(a)
id = int(request.GET.get('id',''))
if id ==1:
arr = np.arange(100000)
arr = arr[::-1]
insertionSort(arr)
# print ("Sorted array is:")
# for i in range(len(arr)):
# print ("%d" %arr[i])
return render(request,'home/home.html')
so only for id=1 it will execute the blocking code block. But for other cases, it is supposed to work normally.
Now, what I found is, if I make two multiple requests, one with id=1 and another with id=2, second request does not really get blocked but takes longer time to get data from django. It takes ~2.5s to complete if there is another parallel blocking request. Otherwise, it takes ~0.02s to get data.
These are my python codes to make the request:
malicious request:
from concurrent.futures import as_completed
from pprint import pprint
from requests_futures.sessions import FuturesSession
session = FuturesSession()
futures=[session.get(f'http://127.0.0.1:8000/?id=1') for i in range(3)]
start = time.time()
for future in as_completed(futures):
resp = future.result()
# pprint({
# 'url': resp.request.url,
# 'content': resp.json(),
# })
roundtrip = time.time() - start
print (roundtrip)
Normal request:
import logging
import threading
import time
import requests
if __name__ == "__main__":
# start = time.time()
while(True):
print(requests.get("http://127.0.0.1:8000/?id=2").elapsed.total_seconds())
time.sleep(2)
I will be grateful if anyone can explain how Django is serving the parallel requests in this case.
There is an option to use --nothreading when you start the server. From what you described it's possible the blocking task finished in 2 seconds. Easier way to test is to just use time.sleep(10) for testing purposes.

I am unable to load full pickle list on azure webapp

I have an Azure Web app stacked on python and running a flask app to call a function and this function returns a list of country name which I have saved in pickle file. Lets say I have a total of 100 countries so whenever I run the app it reads 100 countries from that pickle file but sometimes it's stuck to 98 or 99 countries so not sure where I am loosing 1 or 2 countries from that list. This issue only happens on azure web app otherwise it retrieves full 100 countries. Below is the code I'm using to load the pickle file having country list of 100:
import pickle
path=os.getcwd()+'\\'
def example():
country_list=pickle.load(open(path+"support_file/country_list.p","rb"))
print(len(country_list))
return country_list
Here is my flask app.py to call the function:
from other_file import example
from flask import Flask, request
app = Flask(__name__)
#app.route("/", methods=["POST", "GET"])
def query():
if request.method == "POST":
return example()
else:
return "Hello!"
if __name__ == "__main__":
app.run()
The above list is then used in a function and my output depends on all the elements of this list but if an element or two goes missing while loading this pickle then my output changes. So I'm not missing out this elements consistently but it happens for say 1 in every 20 times, so is this a problem of Azure Web app or is something wrong with my pickle? I tried to recreate the pickle but same problem keeps on coming up once in a while.
It seems pickle load reads till it's buffer is full. So, you would have to iterate like below, until it gets an EOF exception. Unfortunately, I could not find a graceful way to run the loop without catching exception. You might also need to cache the list instead of unpickling on every request to optimize performance.
with open(os.getcwd()+'/support_file/country_list.p','rb') as f:
country_list = []
while True:
try:
country_list.append(pickle.load(f))
except EOFError:
break

Optimise python function fetching multi-level json attributes

I have a 3 level json file. I am fetching the values of some of the attributes from each of the 3 levels of json. At the moment, the execution time of my code is pathetic as it is taking about 2-3 minutes to get the results on my web page. I will be having a much larger json file to deal with in production.
I am new to python and flask and haven't done much of web programming. Please suggest me ways I could optimise my below code! Thanks for help, much appreciated.
import json
import urllib2
import flask
from flask import request
def Backend():
url = 'http://localhost:8080/surveillance/api/v1/cameras/'
response = urllib2.urlopen(url).read()
response = json.loads(response)
components = list(response['children'])
urlComponentChild = []
for component in components:
urlComponent = str(url + component + '/')
responseChild = urllib2.urlopen(urlComponent).read()
responseChild = json.loads(responseChild)
camID = str(responseChild['id'])
camName = str(responseChild['name'])
compChildren = responseChild['children']
compChildrenName = list(compChildren)
for compChild in compChildrenName:
href = str(compChildren[compChild]['href'])
ID = str(compChildren[compChild]['id'])
urlComponentChild.append([href,ID])
myList = []
for each in urlComponentChild:
response = urllib2.urlopen(each[0]).read()
response = json.loads(response)
url = each[0] + '/recorder'
responseRecorder = urllib2.urlopen(url).read()
responseRecorder = json.loads(responseRecorder)
username = str(response['subItems']['surveillance:config']['properties']['username'])
password = str(response['subItems']['surveillance:config']['properties']['password'])
manufacturer = str(response['properties']['Manufacturer'])
model = str(response['properties']['Model'])
status = responseRecorder['recording']
myList.append([each[1],username,password,manufacturer,model,status])
return myList
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
return flask.render_template('index.html', response = Backend())
if __name__ == '__main__':
APP.debug=True
APP.run(port=62000)
Ok, caching. So what we're going to do is start returning values to the user instantly based on data we already have, rather than generating new data every time. This means that the user might get slightly less up to date data than is theoretically possible to get, but it means that the data they do receive they receive as quickly as is possible given the system you're using.
So we'll keep your backend function as it is. Like I said, you could certainly speed it up with multithreading (If you're still interested in that, the 10 second version is that I would use grequests to asynchronously get data from a list of urls).
But, rather than call it in response to the user every time a user requests data, we'll just call it routinely every once in a while. This is almost certainly something you'd want to do eventually anyway, because it means you don't have to generate brand new data for each user, which is extremely wasteful. We'll just keep some data on hand in a variable, update that variable as often as we can, and return whatever's in that variable every time we get a new request.
from threading import Thread
from time import sleep
data = None
def Backend():
.....
def main_loop():
while True:
sleep(LOOP_DELAY_TIME_SECONDS)
global data
data = Backend()
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
# Return whatever data we currently have cached
return flask.render_template('index.html', response = data)
if __name__ == '__main__':
data = Backend() # Need to make sure we grab data before we start the server so we never return None to the user
Thread(target=main_loop).start() #Loop and grab new data at every loop
APP.debug=True
APP.run(port=62000)
DISCLAIMER: I've used Flask and threading before for a few projects, but I am by no means an expert on it or web development, at all. Test this code before using it for anything important (or better yet, find someone who knows that they're doing before using it for anything important)
Edit: data will have to be a global, sorry about that - hence the disclaimer

Get Latest Commit URL from PyGithub Efficiently

I'm using this function to get the latest commit url using PyGithub:
from github import Github
def getLastCommitURL():
encrypted = 'mypassword'
# naiveDecrypt defined elsewhere
g = Github('myusername', naiveDecrypt(encrypted))
org = g.get_organization('mycompany')
code = org.get_repo('therepo')
commits = code.get_commits()
last = commits[0]
return last.html_url
It works but it seems to make Github unhappy with my IP address and give me a slow response for the resulting url. Is there a more efficient way for me to do this?
This wouldn't work if you had no commits in the past 24 hours. But if you do, it seems to return faster and will request fewer commits, according to the Github API documentation:
from datetime import datetime, timedelta
def getLastCommitURL():
encrypted = 'mypassword'
g = Github('myusername', naiveDecrypt(encrypted))
org = g.get_organization('mycompany')
code = org.get_repo('therepo')
# limit to commits in past 24 hours
since = datetime.now() - timedelta(days=1)
commits = code.get_commits(since=since)
last = commits[0]
return last.html_url
You could directly make a request to the api.
from urllib.request import urlopen
import json
def get_latest_commit(owner, repo):
url = 'https://api.github.com/repos/{owner}/{repo}/commits?per_page=1'.format(owner=owner, repo=repo)
response = urlopen(url).read()
data = json.loads(response.decode())
return data[0]
if __name__ == '__main__':
commit = get_latest_commit('mycompany', 'therepo')
print(commit['html_url'])
In this case you would only being making one request to the api instead of 3 and you are only getting the last commit instead of all of them. Should be faster as well.

Retrieving Twitter data on the fly

Our company is trying to read in all live streams of data entered by random users, i.e., a random user sends off a tweet saying "ABC company".
Seeing as how you could use a twitter client to search for said text, I labour under the assumption that it's possible to aggregate all tweets that send off ones without using a client, i.e., to file, streaming in live without using hashtags.
What's the best way to do this? And if you've done this before, could you share your script? I reckon the simplest way would be via ruby/python script left running, but my understanding of ruby/python is limited at best.
Kindly help?
Here's a bare minimum:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import twitter
from threading import *
from os import _exit, urandom
from time import sleep
from logger import *
import unicodedata
## Based on: https://github.com/sixohsix/twitter
class twitt(Thread):
def __init__(self, tags = None, *args, **kwargs):
self.consumer_key = '...'
self.consumer_secret = '...'
self.access_key = '...'
self.access_secret = '...'
self.encoding = 'iso-8859-15'
self.args = args
self.kwargs = kwargs
self.searchapi = twitter.Twitter(domain="search.twitter.com").search
Thread.__init__(self)
self.start()
def search(self, tag):
try:
return self.searchapi(q=tag)['results']
except:
return {}
def run(self):
while 1:
sleep(3)
To use it, do something like:
if __name__ == "__main__":
t = twitt()
print t.search('#DHSupport')
t.alive = False
Note: The only reason this is threaded is because it's just a piece of code i had laying around for other projects, it gives you an idea how to work with the API and perhaps build a background service to fetch search results on twitter.
There's a lot of crap in my original code so the structure might look a bit odd.
Note that you don't really need the consumer_keys etc for just a search but you will need OAuth login for more features such as posting or checking messages.
The only two things you really need is:
import twitter
print twitter.Twitter(domain="search.twitter.com").search(q='#hashtag')['results']

Categories

Resources