Lookup speed: State or Database?

Lookup speed: State or Database? - python

I have a bunch of wordlists on a server of mine, and I've been planning to make a simple open-source JSON API that returns if a password is on the list1, as a method of validation. I'm doing this in Python with Flask, and literally just returning if input is present.
One small problem: the wordlists total about 150 million entries, and 1.1GB of text.
My API (minimal) is below. Is it more efficient to store every row in MongoDB and look up repeatedly, or to store the entire thing in memory using a singleton, and populate it on startup when I call app.run? Or are the differences subjective?
Furthermore, is it even good practice to do the latter? I'm thinking the lookups might start to become taxing if I open this to the public. I've also had someone suggest a Trie for efficient searching.
Update: I've done a bit of testing, and document searching is painfully slow with such a high number of records. Is it justifiable to use a database with proper indexes for a single column of data that needs to be efficiently searched?
from flask import Flask
from flask.views import MethodView
from flask.ext.pymongo import PyMongo
import json
app = Flask(__name__)
mongo = PyMongo(app)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
not mongo.db.passwords.find({'pass' : request.form["password"])})
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)
1: I'm a security nut. I'm using it in my login forms and rejecting common input. One of the wordlists is UNIQPASS.

What I would suggest is a hybrid approach. As requests are made do two checks. The first in a local cache and the the second in the MongoDB store. If the first fails but the second succeeds then add it to the in memory cache. Over time the application will "fault" in the most common "bad passwords"/records.
This has two advantages:
1) The common words are rejected very fast from within memory.
2) The startup cost is close to zero and amortized over many queries.
When storing the word list in MongoDB I would make the _id field hold each word. By default you will get a ObjectId which is a complete waste in this case. We can also then leverage the automatic index on _id. I suspect the poor performance you saw was due to there not being an index on the 'pass' field. You can also try adding one on the 'pass' field with:
mongo.db.passwords.create_index("pass")
To complete the _id scenario: to insert a word:
mongo.db.passwords.insert( { "_id" : "password" } );
Queries then look like:
mongo.db.passwords.find( { "_id" : request.form["password"] } )
As #Madarco mentioned you can also shave another bit off the query time by ensuring results are returned from the index by limiting the returned fields to just the _id field ({ "_id" : 1}).
mongo.db.passwords.find( { "_id" : request.form["password"] }, { "_id" : 1} )
HTH - Rob
P.S. I am not a Python/Pymongo expert so might not have the syntax 100% correct. Hopefully it is still helpful.

Given that your list is totally static and fits in memory, I don't see a compelling reason to use a database.
I agree that a Trie would be efficient for your goal. A hash table would work too.
PS: it's too bad about Python's Global Interpreter Lock. If you used a language with real multithreading, you could take advantage of the unchanging data structure and run the server across multiple cores with shared memory.

I would suggest checking out and trying redis as an option too. Its fast, very fast, and has nice python bindings. I would try to create a set in redis of the word list, then use the SISMEMBER function to check if the word is in the set. SISMEMBER is an O(1) operation, so it should be faster than a mongo query.
Thats assuming you want the whole list in memory of course, and that you are willing to do away with the mongo. . .
Here's more info on redis's SISMEMBER, and the python bindings
for redis

I'd recommend kyotocabinet, it's very fast. I've used it in similar circumstances:
import kyotocabinet as kyc
from flask import Flask
from flask.views import MethodView
import json
app = Flask(__name__)
dbTree = kyc.DB()
if not dbTree.open('./passwords.kct', DB.OREADER):
print >>sys.stderr, "open error: " + str(dbTree.error())
raise SystemExit
app = Flask(__name__)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
dbTree.check(request.form["password"]) > 0 })
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.

You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.

import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

Select different dataset when testing | Separate test from production

This question is partly about how to test external dependencies (a.k.a. integration tests) and partly how to implement it with Python for SQL with BigQuery in specific. So answers only about 'This is how you should do integration tests' are very welcome.
In my project I have two different datasets
'project_1.production.table_1'
'project_1.development.table_1'
When running my tests I would like to call the development environment. But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.
Production code looks like:
def find_data(variable_x: string) -> DataFrame:
query = '''
SELECT *
FROM `project_1.production.table_1`
WHERE foo = #variable_x
'''
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter(
name='foo', type_="STRING", value=variable_x
)
]
)
df = self.client.query(
query=query, job_config=job_config).to_dataframe()
return df
Solution 1 : Environment variables for the dataset
The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. The problem is that bigQuery does not allow to parameterize the dataset. (To prevent SQL-injection I think) See running parameterized queries docs
From the docs
Parameters cannot be used as substitutes for identifiers, column
names, table names, or other parts of the query.
So having the environment variable as dataset name is not possible.
Solution 2 : Environment variable for flow control
I could add a if production == True evaluation and select the dataset. However this results in test/debug code in my production code. I would like to avoid it as much as possible.
from os import getenv
def find_data(variable_x : string) -> Dataframe:
load_dotenv()
PRODUCTION = getenv("PRODUCTION")
if PRODUCTION == TRUE:
*Execute query on project_1.production.table_1*
else:
*Execute query on project_1.development.table_1*
job_config = (*snip*)
df = (*snip*)
return df
Solution 3 : Mimic function in testcode
Make a copy of the production code and set up the test code so that the development dataset is called.
This leads to duplication of code (one in production code and one in test code). A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time. So I think this solution is not 'Embracing Change'
Solution 4 : Skip testing this function
Perhaps this function does not need to be called at all in my test code. Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result. However then I need to adjust my architecture a bit.
The above solutions don't satisfy me completely. I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?

It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want. You could replace the first part of your function by the following code:
query = '''
SELECT *
FROM `{table}`
WHERE foo = #variable_x
'''.format(table = getenv("DATA_TABLE"))
This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library. The String.format allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format)
Important security note: it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.

How to test SQLAlchemy versioning in unit tests - Python

Note: Using flask_sqlalchemy here
I'm working on adding versioning to multiple services on the same DB. To make sure it works, I'm adding unit tests that confirm I get an error (for this case my error should be StaleDataError). For other services in other languages, I pulled the same object twice from the DB, updated one instance, saved it, updated the other instance, then tried to save that as well.
However, because SQLAlchemy adds a fake-cache layer between the DB and the service, when I update the first object it automatically updates the other object I hold in memory. Does anyone have a way around this? I created a second session (that solution had worked in other languages) but SQLAlchemy knows not to hold the same object in two sessions.
I was able to manually test it by putting time.sleep() halfway through the test and manually changing data in the DB, but I'd like a way to test this using just the unit code.
Example code:
def test_optimistic_locking(self):
c = Customer(formal_name='John', id=1)
db.session.add(c)
db.session.flush()
cust = Customer.query.filter_by(id=1).first()
db.session.expire(cust)
same_cust = Customer.query.filter_by(id=1).first()
db.session.expire(same_cust)
same_cust.formal_name = 'Tim'
db.session.add(same_cust)
db.session.flush()
db.session.expire(same_cust)
cust.formal_name = 'Jon'
db.session.add(cust)
with self.assertRaises(StaleDataError): db.session.flush()
db.session.rollback()

It actually is possible, you need to create two separate sessions. See the unit test of SQLAlchemy itself for inspiration. Here's a code snippet of one of our unit tests written with pytest:
def test_article__versioning(connection, db_session: Session):
article = ProductSheetFactory(title="Old Title", version=1)
db_session.refresh(article)
assert article.version == 1
db_session2 = Session(bind=connection)
article2 = db_session2.query(ProductSheet).get(article.id)
assert article2.version == 1
article.title = "New Title"
article.version += 1
db_session.commit()
assert article.version == 2
with pytest.raises(sqlalchemy.orm.exc.StaleDataError):
article2.title = "Yet another title"
assert article2.version == 1
article2.version += 1
db_session2.commit()
Hope that helps. Note that we use "version_id_generator": False in the model, that's why we increment the version ourselves. See the docs for details.

For anyone that comes across this question, my current hypothesis is that it can't be done. SQLAlchemy is incredibly powerful and, given that the functionality is so good that we can't test this line, we should trust that it works as expected

making a cache for json which can be explicitly purged

I have many routes on blueprints that do something along these lines:
# charts.py
#charts.route('/pie')
def pie():
# ...
return jsonify(my_data)
The data comes from a CSV which is grabbed once every x hours by a script which is separate from the application. The application reads this using a class which is then bound to the blueprint.
# __init__.py
from flask import Blueprint
from helpers.csv_reader import CSVReader
chart_blueprint = Blueprint('chart_blueprint', __name__)
chart_blueprint.data = CSVReader('statistics.csv')
from . import charts
My goal is to cache several of the responses of the route, as the data does not change. However, the more challenging problem is to be able to explicitly purge the data on my fetch script finishing.
How would one go about this? I'm a bit lost but I do imagine I will need to register a before_request on my blueprints

ETag and Expires were made for exactly this:
class CSVReader(object):
def read_if_reloaded(self):
# Do your thing here
self.expires_on = self.calculate_next_load_time()
self.checksum = self.calculate_current_checksum()
#charts.route('/pie')
def pie():
if request.headers.get('ETag') == charts.data.checksum:
return ('', 304, {})
# ...
response = jsonify(my_data)
response.headers['Expires'] = charts.data.expires_on
response.headers['ETag'] = charts.data.checksum
return response

Sean's answer is great for clients that come back and request the same information before the next batch is read in, but it does not help for clients that are coming in cold.
For new clients you can use cache servers such as redis or memcachd that can store the pre-calculated results. These servers are very simple key-value stores, but they are very fast. You can even set how long the values will be valid before it expires.
Cache servers help if calculating the result is time consuming or computationally expensive, but if you are simply returning items from a file it will not make a dramatic improvement.
Here is a flask pattern for using the werkzeug cache interface flask pattern and here is a link to the flask cache extention

Datastore performance, my code or the datastore latency

I had for the last month a bit of a problem with a quite basic datastore query. It involves 2 db.Models with one referring to the other with a db.ReferenceProperty.
The problem is that according to the admin logs the request takes about 2-4 seconds to complete. I strip it down to a bare form and a list to display the results.
The put works fine, but the get accumulates (in my opinion) way to much cpu time.
#The get look like this:
outputData['items'] = {}
labelsData = Label.all()
for label in labelsData:
labelItem = label.item.name
if labelItem not in outputData['items']:
outputData['items'][labelItem] = { 'item' : labelItem, 'labels' : [] }
outputData['items'][labelItem]['labels'].append(label.text)
path = os.path.join(os.path.dirname(__file__), 'index.html')
self.response.out.write(template.render(path, outputData))
#And the models:
class Item(db.Model):
name = db.StringProperty()
class Label(db.Model):
text = db.StringProperty()
lang = db.StringProperty()
item = db.ReferenceProperty(Item)
I've tried to make it a number of different way ie. instead of ReferenceProperty storing all Label keys in the Item Model as a db.ListProperty.
My test data is just 10 rows in Item and 40 in Label.
So my questions: Is it a fools errand to try to optimize this since the high cpu usage is due to the problems with the datastore or have I just screwed up somewhere in the code?
..fredrik
EDIT:
I got a great response from djidjadji at the google appengine mailing list.
The new code looks like this:
outputData['items'] = {}
labelsData = Label.all().fetch(1000)
labelItems = db.get([Label.item.get_value_for_datastore(label) for label in labelsData ])
for label,labelItem in zip(labelsData, labelItems):
name = labelItem.name
try:
outputData['items'][name]['labels'].append(label.text)
except KeyError:
outputData['items'][name] = { 'item' : name, 'labels' : [label.text] }

There's certainly things you can do to optimize your code. For example, you're iterating over a query, which is less efficient than fetching the query and iterating over the results.
I'd recommend using Appstats to profile your app, and check out the Patterns of Doom series of posts.

Don't just try things. That's guessing. You'll only be right some of the time. Don't ask other people to guess either, for the same reason.
Be right every time.
Just pause the code several times and look at the call stack. That will tell you exactly what's going on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lookup speed: State or Database? - python

Related

Struggling with how to iterate data

Select different dataset when testing | Separate test from production

How to test SQLAlchemy versioning in unit tests - Python

making a cache for json which can be explicitly purged

Datastore performance, my code or the datastore latency

Categories

Resources