Select different dataset when testing | Separate test from production - python

This question is partly about how to test external dependencies (a.k.a. integration tests) and partly how to implement it with Python for SQL with BigQuery in specific. So answers only about 'This is how you should do integration tests' are very welcome.
In my project I have two different datasets
'project_1.production.table_1'
'project_1.development.table_1'
When running my tests I would like to call the development environment. But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.
Production code looks like:
def find_data(variable_x: string) -> DataFrame:
query = '''
SELECT *
FROM `project_1.production.table_1`
WHERE foo = #variable_x
'''
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter(
name='foo', type_="STRING", value=variable_x
)
]
)
df = self.client.query(
query=query, job_config=job_config).to_dataframe()
return df
Solution 1 : Environment variables for the dataset
The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. The problem is that bigQuery does not allow to parameterize the dataset. (To prevent SQL-injection I think) See running parameterized queries docs
From the docs
Parameters cannot be used as substitutes for identifiers, column
names, table names, or other parts of the query.
So having the environment variable as dataset name is not possible.
Solution 2 : Environment variable for flow control
I could add a if production == True evaluation and select the dataset. However this results in test/debug code in my production code. I would like to avoid it as much as possible.
from os import getenv
def find_data(variable_x : string) -> Dataframe:
load_dotenv()
PRODUCTION = getenv("PRODUCTION")
if PRODUCTION == TRUE:
*Execute query on project_1.production.table_1*
else:
*Execute query on project_1.development.table_1*
job_config = (*snip*)
df = (*snip*)
return df
Solution 3 : Mimic function in testcode
Make a copy of the production code and set up the test code so that the development dataset is called.
This leads to duplication of code (one in production code and one in test code). A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time. So I think this solution is not 'Embracing Change'
Solution 4 : Skip testing this function
Perhaps this function does not need to be called at all in my test code. Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result. However then I need to adjust my architecture a bit.
The above solutions don't satisfy me completely. I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?

It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want. You could replace the first part of your function by the following code:
query = '''
SELECT *
FROM `{table}`
WHERE foo = #variable_x
'''.format(table = getenv("DATA_TABLE"))
This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library. The String.format allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format)
Important security note: it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.

Related

how can i write unit test for function that is making network request without changing it's interface?

I read that Unit tests run fast. If they don’t run fast, they aren’t unit tests. A test is not a unit test if 1. It talks to a database. 2. It communicates across a network. 3. It touches the file system. 4. You have to do special things to your environment (such as editing configuration files) to run it. in Working Effectively with legacy code (book).
I have a function that is downloading the zip from the internet and then converting it into a python object for a particular class.
import typing as t
def get_book_objects(date: str) -> t.List[Book]:
# download the zip with the date from the endpoint
res = requests.get(f"HTTP-URL-{date}")
# code to read the response content in BytesIO and then use the ZipFile module
# to extract data.
# parse the data and return a list of Book object
return books
let's say I want to write a unit test for the function get_book_objects. Then how am I supposed to write a unit test without making a network request? I mean I prefer file system read-over a network request because it will be way faster than making a request to the network although it is written that a good unit test also not touches the file system I will be fine with that.
So even if I want to write a unit test where I can provide a local zip file I have to modify the existing function to open the file from the local file system or I have to add some additional parameter to the function so I can send a zip file path from unit test function.
What will you do to write a good unit test in this kind of situation?
What will you do to write a good unit test in this kind of situation?
In the TDD world, the usual answer would be to delegate the work to a more easily tested component.
Consider:
def get_book_objects(date: str) -> t.List[Book]:
# This is the piece that makes get_book_objects hard
# to isolate
http_get = requests.get
# download the zip with the date from the endpoint
res = http_get(f"HTTP-URL-{date}")
# code to read the response content in BytesIO and then use the ZipFile module
# to extract data.
# parse the data and return a list of Book object
return books
which might then become something like
def get_book_objects(date: str) -> t.List[Book]:
# This is the piece that makes get_book_objects hard
# to isolate
http_get = requests.get
return get_book_objects_v2(http_get, date)
def get_book_objects_v2(http_get, date: str) -> t.List[Book]
# download the zip with the date from the endpoint
res = http_get(f"HTTP-URL-{date}")
# code to read the response content in BytesIO and then use the ZipFile module
# to extract data.
# parse the data and return a list of Book object
return books
get_book_objects is still hard to test, but it is also "so simple that there are obviously no deficiencies". On the other hand, get_book_objects_v2 is easy to test, because your test can control what callable is passed to the subject, and can use any reasonable substitute you like.
What we've done is shift most of the complexity/risk into a "unit" that is easier to test. For the function that is still hard to test, we'll use other techniques.
When authors talk about tests "driving" the design, this is one example - we're treating "complicated code needs to be easy to test" as a constraint on our design.
You've already identified the correct reference (Working Effectively with Legacy Code). The material you want is the discussion of seams.
A seam is a place where you can alter behavior in your program without editing in that place.
(In my edition of the book, the discussion begins in Chapter 4).

Converting Intersystems cache objectscript into a python function

I am accessing an Intersystems cache 2017.1.xx instance through a python process to get various attributes about the database in able to monitor the database.
One of the items I want to monitor is license usage. I wrote a objectscript script in a Terminal window to access license usage by user:
s Rset=##class(%ResultSet).%New("%SYSTEM.License.UserListAll")
s r=Rset.Execute()
s ncol=Rset.GetColumnCount()
While (Rset.Next()) {f i=1:1:ncol w !,Rset.GetData(i)}
But, I have been unable to determine how to convert this script into a Python equivalent. I am using the intersys.pythonbind3 import for connecting and accessing the cache instance. I have been able to create python functions that accessing most everything else in the instance but this one piece of data I can not figure out how to translate it to Python (3.7).
Following should work (based on the documentation):
query = intersys.pythonbind.query(database)
query.prepare_class("%SYSTEM.License","UserListAll")
query.execute();
# Fetch each row in the result set, and print the
# name and value of each column in a row:
while 1:
cols = query.fetch([None])
if len(cols) == 0: break
print str(cols[0])
Also, notice that InterSystems IRIS -- successor to the Caché now has Python as an embedded language. See more in the docs
Since the noted query "UserListAll" is not defined correctly in the library; not SqlProc. So to resolve this issue would require a ObjectScript with the query and the use of #Result set or similar in Python to get the results. So I am marking this as resolved.
Not sure which Python interface you're using for Cache/IRIS, but this Open Source 3rd party one is worth investigating for the kind of things you're trying to do:
https://github.com/chrisemunt/mg_python

How to prevent LDAP-injection in ldap3 for python3

I'm writing some python3 code using the ldap3 library and I'm trying to prevent LDAP-injection. The OWASP injection-prevention cheat sheet recommends using a safe/parameterized API(among other things). However, I can't find a safe API or safe method for composing search queries in the ldap3 docs. Most of the search queries in the docs use hard-coded strings, like this:
conn.search('dc=demo1,dc=freeipa,dc=org', '(objectclass=person)')
and I'm trying to avoid the need to compose queries in a manner similar to this:
conn.search(search, '(accAttrib=' + accName + ')')
Additionally, there seems to be no mention of 'injection' or 'escaping' or similar concepts in the docs. Does anyone know if this is missing in this library altogether or if there is a similar library for Python that provides a safe/parameterized API? Or has anyone encountered and solved this problem before?
A final point: I've seen the other StackOverflow questions that point out how to use whitelist validation or escaping as a way to prevent LDAP-injection and I plan to implement them. But I'd prefer to use all three methods if possible.
I was a little surprised that the documentation doesn't seem to mention this. However there is a utility function escape_filter_chars which I believe is what you are looking for:
from ldap3.utils import conv
attribute = conv.escape_filter_chars("bar)", encoding=None)
query = "(foo={0})".format(attribute)
conn.search(search, query)
I believe the best way to prevent LDAP injection in python3 is to use the Abstraction Layer.
Example code:
# First create a connection to ldap to use.
# I use a function that creates my connection (abstracted here)
conn = self.connect_to_ldap()
o = ObjectDef('groupOfUniqueNames', conn)
query = 'Common Name: %s' % cn
r = Reader(conn, o, 'dc=example,,dc=org', query)
r.search()
Notice how the query is abstracted? The query would error if someone tried to inject a search here. Also, this search is protected by a Reader instead of a Writer. The ldap3 documentation goes through all of this.

multi-processing with spark(PySpark) [duplicate]

This question already has an answer here:
How to run independent transformations in parallel using PySpark?
(1 answer)
Closed 5 years ago.
The usecase is the following:
I have a large dataframe, with a 'user_id' column in it (every user_id can appear in many rows). I have a list of users my_users which I need to analyse.
Groupby, filter and aggregate could be a good idea, but the available aggregation functions included in pyspark did not fit my needs. In the pyspark ver, user defined aggregation functions are still not fully supported and I decided to leave it for now..
Instead, I simply iterate the my_users list, filter each user in the dataframe, and analyse. In order to optimize this procedure, I decided to use python multiprocessing pool, for each user in my_users
The function that does the analysis (and passed to the pool) takes two arguments: the user_id, and a path to the main dataframe, on which I perform all the computations (PARQUET format). In the method, I load the dataframe, and work on it (DataFrame can't be passed as an argument itself)
I get all sorts of weird errors, on some of the processes (different in each run), that look like:
PythonUtils does not exist in the JVM (when reading the 'parquet' dataframe)
KeyError: 'c' not found (also, when reading the 'parquet' dataframe. What is 'c' anyway??)
When I run it without any multiprocessing, everything runs smooth, but slow..
Any ideas where these errors are coming from?
I'll put some code sample just to make things clearer:
PYSPRAK_SUBMIT_ARGS = '--driver-memory 4g --conf spark.driver.maxResultSize=3g --master local[*] pyspark-shell' #if it's relevant
# ....
def users_worker(df_path, user_id):
df = spark.read.parquet(df_path) # The problem is here!
## the analysis of user_id in df is here
def user_worker_wrapper(args):
users_worker(*args)
def analyse():
# ...
users_worker_args = [(df_path, user_id) for user_id in my_users]
users_pool = Pool(processes=len(my_users))
users_pool.map(users_worker_wrapper, users_worker_args)
users_pool.close()
users_pool.join()
Indeed, as #user6910411 commented, when I changed the Pool to be threadPool (multiprocessing.pool.ThreadPool package), everything worked as expected and these errors were gone.
The root reasons for the errors themselves are also clear now, if you want me to share them, please comment below.

Lookup speed: State or Database?

I have a bunch of wordlists on a server of mine, and I've been planning to make a simple open-source JSON API that returns if a password is on the list1, as a method of validation. I'm doing this in Python with Flask, and literally just returning if input is present.
One small problem: the wordlists total about 150 million entries, and 1.1GB of text.
My API (minimal) is below. Is it more efficient to store every row in MongoDB and look up repeatedly, or to store the entire thing in memory using a singleton, and populate it on startup when I call app.run? Or are the differences subjective?
Furthermore, is it even good practice to do the latter? I'm thinking the lookups might start to become taxing if I open this to the public. I've also had someone suggest a Trie for efficient searching.
Update: I've done a bit of testing, and document searching is painfully slow with such a high number of records. Is it justifiable to use a database with proper indexes for a single column of data that needs to be efficiently searched?
from flask import Flask
from flask.views import MethodView
from flask.ext.pymongo import PyMongo
import json
app = Flask(__name__)
mongo = PyMongo(app)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
not mongo.db.passwords.find({'pass' : request.form["password"])})
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)
1: I'm a security nut. I'm using it in my login forms and rejecting common input. One of the wordlists is UNIQPASS.
What I would suggest is a hybrid approach. As requests are made do two checks. The first in a local cache and the the second in the MongoDB store. If the first fails but the second succeeds then add it to the in memory cache. Over time the application will "fault" in the most common "bad passwords"/records.
This has two advantages:
1) The common words are rejected very fast from within memory.
2) The startup cost is close to zero and amortized over many queries.
When storing the word list in MongoDB I would make the _id field hold each word. By default you will get a ObjectId which is a complete waste in this case. We can also then leverage the automatic index on _id. I suspect the poor performance you saw was due to there not being an index on the 'pass' field. You can also try adding one on the 'pass' field with:
mongo.db.passwords.create_index("pass")
To complete the _id scenario: to insert a word:
mongo.db.passwords.insert( { "_id" : "password" } );
Queries then look like:
mongo.db.passwords.find( { "_id" : request.form["password"] } )
As #Madarco mentioned you can also shave another bit off the query time by ensuring results are returned from the index by limiting the returned fields to just the _id field ({ "_id" : 1}).
mongo.db.passwords.find( { "_id" : request.form["password"] }, { "_id" : 1} )
HTH - Rob
P.S. I am not a Python/Pymongo expert so might not have the syntax 100% correct. Hopefully it is still helpful.
Given that your list is totally static and fits in memory, I don't see a compelling reason to use a database.
I agree that a Trie would be efficient for your goal. A hash table would work too.
PS: it's too bad about Python's Global Interpreter Lock. If you used a language with real multithreading, you could take advantage of the unchanging data structure and run the server across multiple cores with shared memory.
I would suggest checking out and trying redis as an option too. Its fast, very fast, and has nice python bindings. I would try to create a set in redis of the word list, then use the SISMEMBER function to check if the word is in the set. SISMEMBER is an O(1) operation, so it should be faster than a mongo query.
Thats assuming you want the whole list in memory of course, and that you are willing to do away with the mongo. . .
Here's more info on redis's SISMEMBER, and the python bindings
for redis
I'd recommend kyotocabinet, it's very fast. I've used it in similar circumstances:
import kyotocabinet as kyc
from flask import Flask
from flask.views import MethodView
import json
app = Flask(__name__)
dbTree = kyc.DB()
if not dbTree.open('./passwords.kct', DB.OREADER):
print >>sys.stderr, "open error: " + str(dbTree.error())
raise SystemExit
app = Flask(__name__)
class HashCheck(MethodView):
def post(self):
return json.dumps({'result' :
dbTree.check(request.form["password"]) > 0 })
# Error-handling + test cases to come. Negate is for bool.
def get(self):
return redirect('/')
if __name__ == "__main__":
app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
app.run(host="0.0.0.0", debug=True)

Categories

Resources