Datastore performance, my code or the datastore latency

Datastore performance, my code or the datastore latency - python

I had for the last month a bit of a problem with a quite basic datastore query. It involves 2 db.Models with one referring to the other with a db.ReferenceProperty.
The problem is that according to the admin logs the request takes about 2-4 seconds to complete. I strip it down to a bare form and a list to display the results.
The put works fine, but the get accumulates (in my opinion) way to much cpu time.
#The get look like this:
outputData['items'] = {}
labelsData = Label.all()
for label in labelsData:
labelItem = label.item.name
if labelItem not in outputData['items']:
outputData['items'][labelItem] = { 'item' : labelItem, 'labels' : [] }
outputData['items'][labelItem]['labels'].append(label.text)
path = os.path.join(os.path.dirname(__file__), 'index.html')
self.response.out.write(template.render(path, outputData))
#And the models:
class Item(db.Model):
name = db.StringProperty()
class Label(db.Model):
text = db.StringProperty()
lang = db.StringProperty()
item = db.ReferenceProperty(Item)
I've tried to make it a number of different way ie. instead of ReferenceProperty storing all Label keys in the Item Model as a db.ListProperty.
My test data is just 10 rows in Item and 40 in Label.
So my questions: Is it a fools errand to try to optimize this since the high cpu usage is due to the problems with the datastore or have I just screwed up somewhere in the code?
..fredrik
EDIT:
I got a great response from djidjadji at the google appengine mailing list.
The new code looks like this:
outputData['items'] = {}
labelsData = Label.all().fetch(1000)
labelItems = db.get([Label.item.get_value_for_datastore(label) for label in labelsData ])
for label,labelItem in zip(labelsData, labelItems):
name = labelItem.name
try:
outputData['items'][name]['labels'].append(label.text)
except KeyError:
outputData['items'][name] = { 'item' : name, 'labels' : [label.text] }

There's certainly things you can do to optimize your code. For example, you're iterating over a query, which is less efficient than fetching the query and iterating over the results.
I'd recommend using Appstats to profile your app, and check out the Patterns of Doom series of posts.

Don't just try things. That's guessing. You'll only be right some of the time. Don't ask other people to guess either, for the same reason.
Be right every time.
Just pause the code several times and look at the call stack. That will tell you exactly what's going on.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.

You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.

import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

How to use Azure DevOps / VSTS to fetch query results in python

Below is my current code. It connects successfully to the organization. How can I fetch the results of a query in Azure like they have here? I know this was solved but there isn't an explanation and there's quite a big gap on what they're doing.
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
from azure.devops.v5_1.work_item_tracking.models import Wiql
personal_access_token = 'xxx'
organization_url = 'zzz'
# Create a connection to the org
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
wit_client = connection.clients.get_work_item_tracking_client()
results = wit_client.query_by_id("my query ID here")
P.S. Please don't link me to the github or documentation. I've looked at both extensively for days and it hasn't helped.
Edit: I've added the results line that successfully gets the query. However, it returns a WorkItemQueryResult class which is not exactly what is needed. I need a way to view the column and results of the query for that column.

So I've figured this out in probably the most inefficient way possible, but hope it helps someone else and they find a way to improve it.
The issue with the WorkItemQueryResult class stored in variable "result" is that it doesn't allow the contents of the work item to be shown.
So the goal is to be able to use the get_work_item method that requires the id field, which you can get (in a rather roundabout way) through item.target.id from results' work_item_relations. The code below is added on.
for item in results.work_item_relations:
id = item.target.id
work_item = wit_client.get_work_item(id)
fields = work_item.fields
This gets the id from every work item in your result class and then grants access to the fields of that work item, which you can access by fields.get("System.Title"), etc.

GAE python NDB projection query working in development but not in production

I've been hitting my head against the wall because my Google App Engine python project has a very simple NDB projection query which works fine on my local machine, but mysteriously fails when deployed to production.
Adding to the mystery... as a test I added an identical projection on another property, and it works in both dev and production! Could anyone help please?! Here are more details:
I have the following entity that represents an expense:
class Entry(ndb.Model):
datetime = ndb.DateTimeProperty(indexed=True, required=True)
amount = ndb.IntegerProperty(indexed=False, required=True)
payee = ndb.StringProperty(indexed=True, required=True)
comment = ndb.StringProperty(indexed=False)
# ...
Later on in the code I am doing a projection on Entry.payee (to get a list of all payees). As a test I also added a projection on Entry.datetime:
log_msg = '' # For passing debug info to the browser
payeeObjects = Entry.query(ancestor=exp_traq_key(exp_traq_name), projection=[Entry.payee]).fetch()
payees = []
for obj in payeeObjects:
payees.append(obj.payee)
log_msg += '%d payees: %s' % (len(payees), str(payees))
log_msg += ' ------------------- ' # a visual separator
dtObjects = Entry.query(ancestor=exp_traq_key(exp_traq_name), projection=[Entry.datetime]).fetch()
dts = []
for obj in dtObjects:
dts.append(obj.datetime)
log_msg += '%d datetimes: %s' % (len(dts), str(dts))
#...other code, including passing log_msg down to the client
Here's the output in dev environment (notice a list of payees and a list of datetimes are displayed in console):
And here's the output when deployed to app engine. I can't get it to return a list of payees. It keeps returning an empty list even though in dev it returns the list fine:
I've ensured that I have the indexes properly set up on GAE:
Please help!
2018-12-05 Update:
I added a couple more entries in production and they got picked up! See screenshot. But the older entries are still not being returned.
My immediate reaction is that the datastore index needs to be "refreshed" somehow so it can "see" old entries. BUT the thing is I removed and recreated the index yesterday, which means it should have old entries... So still need help resolving this mystery!

I figured it out. Darn it wasn't intuitive at all. I wish GAE documentation was better on this point...
My datastore in production contains a lot of previously created entries. As part of my latest code where I'm trying to do the projection on Entry.payee, I had to change the definition of Entry.payee from unindexed to indexed, like so:
payee = ndb.StringProperty(indexed=True, required=True) # Originally was indexed=False
So now all those entries sitting in the datastore are being ignored by the projection query because the index on payee ignores those entries.
So what I need to do now is somehow migrate all those old entities to be indexed=True.
Update - here's how I did this migration. Turned out simpler than expected.
def runPayeeTypeMigration(exp_traq_name):
Entry.query(ancestor=exp_traq_key(exp_traq_name)).fetch()
for entry in entries:
entry.put()
This works by reading all entries into the updated datastructure (the one where Entry.payee is indexed=True) and writes it back to the datastore, so that the entity will now be indexed.

Lookup speed: State or Database?

I have a bunch of wordlists on a server of mine, and I've been planning to make a simple open-source JSON API that returns if a password is on the list1, as a method of validation. I'm doing this in Python with Flask, and literally just returning if input is present.
One small problem: the wordlists total about 150 million entries, and 1.1GB of text.
My API (minimal) is below. Is it more efficient to store every row in MongoDB and look up repeatedly, or to store the entire thing in memory using a singleton, and populate it on startup when I call app.run? Or are the differences subjective?
Furthermore, is it even good practice to do the latter? I'm thinking the lookups might start to become taxing if I open this to the public. I've also had someone suggest a Trie for efficient searching.
Update: I've done a bit of testing, and document searching is painfully slow with such a high number of records. Is it justifiable to use a database with proper indexes for a single column of data that needs to be efficiently searched?
from flask import Flask
from flask.views import MethodView
from flask.ext.pymongo import PyMongo
import json
app = Flask(__name__)
mongo = PyMongo(app)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
not mongo.db.passwords.find({'pass' : request.form["password"])})
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)
1: I'm a security nut. I'm using it in my login forms and rejecting common input. One of the wordlists is UNIQPASS.

What I would suggest is a hybrid approach. As requests are made do two checks. The first in a local cache and the the second in the MongoDB store. If the first fails but the second succeeds then add it to the in memory cache. Over time the application will "fault" in the most common "bad passwords"/records.
This has two advantages:
1) The common words are rejected very fast from within memory.
2) The startup cost is close to zero and amortized over many queries.
When storing the word list in MongoDB I would make the _id field hold each word. By default you will get a ObjectId which is a complete waste in this case. We can also then leverage the automatic index on _id. I suspect the poor performance you saw was due to there not being an index on the 'pass' field. You can also try adding one on the 'pass' field with:
mongo.db.passwords.create_index("pass")
To complete the _id scenario: to insert a word:
mongo.db.passwords.insert( { "_id" : "password" } );
Queries then look like:
mongo.db.passwords.find( { "_id" : request.form["password"] } )
As #Madarco mentioned you can also shave another bit off the query time by ensuring results are returned from the index by limiting the returned fields to just the _id field ({ "_id" : 1}).
mongo.db.passwords.find( { "_id" : request.form["password"] }, { "_id" : 1} )
HTH - Rob
P.S. I am not a Python/Pymongo expert so might not have the syntax 100% correct. Hopefully it is still helpful.

Given that your list is totally static and fits in memory, I don't see a compelling reason to use a database.
I agree that a Trie would be efficient for your goal. A hash table would work too.
PS: it's too bad about Python's Global Interpreter Lock. If you used a language with real multithreading, you could take advantage of the unchanging data structure and run the server across multiple cores with shared memory.

I would suggest checking out and trying redis as an option too. Its fast, very fast, and has nice python bindings. I would try to create a set in redis of the word list, then use the SISMEMBER function to check if the word is in the set. SISMEMBER is an O(1) operation, so it should be faster than a mongo query.
Thats assuming you want the whole list in memory of course, and that you are willing to do away with the mongo. . .
Here's more info on redis's SISMEMBER, and the python bindings
for redis

I'd recommend kyotocabinet, it's very fast. I've used it in similar circumstances:
import kyotocabinet as kyc
from flask import Flask
from flask.views import MethodView
import json
app = Flask(__name__)
dbTree = kyc.DB()
if not dbTree.open('./passwords.kct', DB.OREADER):
print >>sys.stderr, "open error: " + str(dbTree.error())
raise SystemExit
app = Flask(__name__)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
dbTree.check(request.form["password"]) > 0 })
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)

What's wrong in this caching function in Django?

I've created the model for counting the number of views of my page:
class RequestCounter(models.Model):
count = models.IntegerField(default=0)
def __unicode__(self):
return str(self.count)
For incrementing the counter I use:
def inc_counter():
counter = RequestCounter.objects.get_or_create(id =1)[0]
counter.count = F('count') + 1
counter.save()
Then I show the number of the page views on my page and it works fine.
But now I need to cache my counter for some time. I use:
def get_view_count():
view_count = cache.get('v_count')
if view_count==None:
cache.set('v_count',RequestCounter.objects.filter(id = 1)[0],15)
view_count = cache.get('v_count')
return view_count
After this I'm passing the result of get_view_count to my template.
So I expect, that my counter would now stand still for 15 sec and then change to a new value. But, actually, it isn't quite so: when I'm testing this from my virtual ubuntu it jumps, for example, from 55 to 56, after 15 secs it changes and now jumps from 87 to 88.
The values are always alternating and they don't differ much from each other.
If I'm trying this locally from windows, the counter seems to be fine, until I try to open more than browser.
Got no idea what to do with it. Do you see what can be the problem?
p.s. i tried using caching in the templates - and received the same result.

What CACHE_BACKEND are you using? If it's locmem:// and you're running Apache, you'll have a different cache active for each Apache child, which would explain the differing results. I had this a while ago and it was a subtle one to work out. I'd recommend switching to memcache if you're not already on it, as this won't give you the multiple-caches problem

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Datastore performance, my code or the datastore latency - python

There's certainly things you can do to optimize your code. For example, you're iterating over a query, which is less efficient than fetching the query and iterating over the results. I'd recommend using Appstats to profile your app, and check out the Patterns of Doom series of posts.

Don't just try things. That's guessing. You'll only be right some of the time. Don't ask other people to guess either, for the same reason. Be right every time. Just pause the code several times and look at the call stack. That will tell you exactly what's going on.

Related

Struggling with how to iterate data

How to use Azure DevOps / VSTS to fetch query results in python

GAE python NDB projection query working in development but not in production

Lookup speed: State or Database?

What's wrong in this caching function in Django?

Categories

Resources