Transparently encoding/decoding `$` and `.` when inserting/retrieving documents in MongoDB - python

I'm working on an API written in Python that accepts JSON payloads from clients, applies some validation and stores the payloads in MongoDB so that they can be processed asynchronously.
However, I'm running into some trouble with payloads that (legitimately) include keys that start with $ and/or include .. According to the MongoDB documentation, my best bet is to escape these characters:
In some cases, you may wish to build a BSON object with a user-provided key. In these situations, keys will need to substitute the reserved $ and . characters. Any character is sufficient, but consider using the Unicode full width equivalents: U+FF04 (i.e. “$”) and U+FF0E (i.e. “.”).
Fair enough, but here's where it gets interesting. I'd like this process to be transparent to the application, so:
The keys should be unescaped when retrieving the documents...
...but only keys that needed to be escaped in the first place.
For example, suppose a (nefarious) user sends a JSON payload that includes a key like \ff04mixed.chars. When the application gets this document from the storage backend, this key should be converted back into \ff04mixed.chars, not $mixed.chars.
My primary concern here is information leakage; I don't want somebody to discover that the application requires special treatment for $ and . characters. The bad guys probably how to exploit MongoDB way better than I know how to secure it, and I don't want to take any chances.

Here's the approach I ended up going with:
Before inserting a document into Mongo, run it through a SONManipulator that searches for and escapes any illegal keys in the document.
The original keys get stored as a separate attribute in the document so that we can restore them later.
After retrieving a document from Mongo, run it through the SONManipulator to extract the original keys and restore them.
Here's an abbreviated example:
# Example of a document with naughty keys.
document = {
'$foo': 'bar',
'$baz': 'luhrmann'
}
##
# Before inserting the document, we must first run it through our
# SONManipulator.
manipulator = KeyEscaper()
escaped = manipulator.transform_incoming(document, collection.name)
# Now we can insert the document.
document_id = collection.insert_one(escaped).inserted_id
##
# Later, we retrieve the document.
raw = collection.find_one({'_id': document_id})
# Run the document through our KeyEscaper to restore the original
# keys.
unescaped = manipulator.transform_outgoing(raw, collection.name)
assert unescaped == document
The actual document stored in MongoDB looks like this:
{
"_id": ObjectId('582cebe5cd9b344c814d98e3')
"__escaped__1": "luhrmann",
"__escaped__0": "bar",
"__escaped__": {
"__escaped__1": ["$baz", {}],
"__escaped__0": ["$foo", {}]
}
}
Note the __escaped__ attribute that contains the original keys so that they can be restored when the document is retrieved.
This makes querying against the escaped keys a little tricky, but that's infinitely preferable to not being able to store the document in the first place.
Full code with unit tests and example usage:
https://gist.github.com/todofixthis/79a2f213989a3584211e49bfba582b40

Related

For the deidentify_with_fpe() Python API wrapper for google DLP what are the arguments needed to pass through?

I am working through the google cloud dlp api documentation available here specifically this question is about deidentify_with_fpe().
My question is what is the format of the arguments needing the be passed through the function for it to return anonymised data. My code at the moment is
def deidentify_with_fpe(
string,
info_types,
alphabet=1,
project='XXXX-data-development',
surrogate_type=None,
key_name='projects/XXXX-data-development/locations/global/keyRings/google-dlp-test-global/cryptoKeys/google-dlp-test-key-global',
wrapped_key=WRAPPED
):
"read file in for wrapped key"
"""Uses the Data Loss Prevention API to deidentify sensitive data in a
string using Format Preserving Encryption (FPE).
Args:
project: The Google Cloud project id to use as a parent resource.
item: The string to deidentify (will be treated as text).
alphabet: The set of characters to replace sensitive ones with. For
more information, see https://cloud.google.com/dlp/docs/reference/
rest/v2beta2/organizations.deidentifyTemplates#ffxcommonnativealphabet
surrogate_type: The name of the surrogate custom info type to use. Only
necessary if you want to reverse the deidentification process. Can
be essentially any arbitrary string, as long as it doesn't appear
in your dataset otherwise.
key_name: The name of the Cloud KMS key used to encrypt ('wrap') the
AES-256 key. Example:
key_name = 'projects/YOUR_GCLOUD_PROJECT/locations/YOUR_LOCATION/
keyRings/YOUR_KEYRING_NAME/cryptoKeys/YOUR_KEY_NAME'
wrapped_key: The encrypted ('wrapped') AES-256 key to use. This key
should be encrypted using the Cloud KMS key specified by key_name.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library
import google.cloud.dlp
# Instantiate a client
dlp = google.cloud.dlp_v2.DlpServiceClient(credentials='/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
dlp = dlp_client.from_service_account_json('/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# The wrapped key is base64-encoded, but the library expects a binary
# string, so decode it here.
import base64
# wrapped_key = base64.b64decode(wrapped_key)
# Construct FPE configuration dictionary
crypto_replace_ffx_fpe_config = {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": wrapped_key,
"crypto_key_name": key_name,
}
},
"common_alphabet": alphabet,
}
# Add surrogate type
if surrogate_type:
crypto_replace_ffx_fpe_config["surrogate_info_type"] = {
"name": surrogate_type
}
# Construct inspect configuration dictionary
inspect_config = {
"info_types": [{"name": info_type} for info_type in info_types]
}
# Construct deidentify configuration dictionary
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": crypto_replace_ffx_fpe_config
}
}
]
}
}
# Convert string to item
item = {"value": string}
# Call the API
response = dlp.deidentify_content(
parent,
inspect_config=inspect_config,
deidentify_config=deidentify_config,
item=item,
)
# Print results
print(response.item.value)
Where
with open('mysecret.txt.encrypted', 'rb') as f:
WRAPPED = f.read()
and the mysecret.txt.encrypted was generated by this command in the terminal
--keyring google-dlp-test-global --key google-dlp-test-key-global \
--plaintext-file google-token.txt \
--ciphertext-file mysecret.txt.encrypted
When the google-token.txt was generated from here.
The error I am getting when calling deidentify_with_fpe('My name is john smith', ['FIRST_NAME']) is as follows:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered."
debug_error_string = "{"created":"#1581675678.839972000","description":"Error received from peer ipv4:216.58.213.10:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.","grpc_status":3}"
which is a direct cause of:
InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.
So I think my issue is to do with the key - before it is encrypted. There is no where I can see in the documentation for how to source that key, or how to pass that into the function.
I appreciate this is a long and lengthy submission and any response would be appreciated, I've spent too long trying to do this and feel like I'm close to getting it to work
The error:
“google.api_core.exceptions.InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.”
This is a generic error when free-form text de-identification fails due to some transformation errors. Unfortunately, it seems like the python library is not exposing the error details.
As per the service documentation [1], the detected tokens must be at least two characters long:
The input value:
- Must be at least two characters long (or the empty string).
- Must be encoded as ASCII.
- Comprised of the characters specified by an "alphabet," which is the set of between 2 and 64 allowed characters in the input value. For more information, see the alphabet field in CryptoReplaceFfxFpeConfig.
[1] https://cloud.google.com/dlp/docs/transformations-reference#fpe
Change Alphabet to the following instead of 1:
Comprised of the characters specified by alphabet. Valid options:
NUMERIC
HEXADECIMAL
UPPER_CASE_ALPHA_NUMERIC
ALPHA_NUMERIC
The input value:
Must be at least two characters long (or the empty string).
Must be comprised of the characters specified by an alphabet. The alphabet can be comprised of between 2 and 95 characters. (An alphabet with 95 characters includes all printable characters in the US-ASCII character set.)
If your input is in the form 111-222-333 then your custom alphabet field should be: "customAlphabet": "-0123456789"

Python Dictionary JSON Key, Val Extraction

I'm having a hard time understanding what is going on with this walmart API and I can't seem to iterate through key, values like I wish. I get different errors depending on the way I attack the problem.
import requests
import json
import urllib
response=requests.get("https://grocery.walmart.com/v0.1/api/stores/4104/departments/1256653758154/aisles/1256653758260/products?count=60&start=0")
info = json.loads(response.text)
print(info)
I'm not sure if I'm playing with a dictionary or a JSON object.
I'm thrown off because the API itself has no quotes over key/val.
When I do a json.loads it comes in but only comes in with single quotes.
I've tried going at it with for-loops but can only traverse the top layer and nothing else. My overall goal is to retrieve the info from the API link, turn it into JSON and be able to grab which ever key/val I need from it.
I'm not sure if I'm playing with a dictionary or a JSON object.
Python has no concept of a "JSON Object". It's a dictionary.
I'm thrown off because the API itself has no quotes over key/val.
Yes it does
{"aisleName":"Organic Dairy, Eggs & Meat","productCount":17,"products":[{"data":
When I do a json.loads it comes in but only comes in with single quotes
Because it's a Python dictionary, and the repr() of dict uses single quotes.
Try print(info['aisleName']) for example

Representation of python dictionaries with unicode in database queries

I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.
Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}
I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.

Python grabbing JSON from POST method

I have an Android appthat originally posted some strings in json format to a python cgi script, which all worked fine. The problem is when the json object contains lists, then python (Using simplejson) when it gets them is still treating them as a big string
Here is a text dump of the json once it reaches python before I parse it:
{"Prob1":"[1, 2, 3]","Name":"aaa","action":1,"Prob2":"[20, 20, 20]","Tasks":"[1 task, 2 task, 3 task]","Description":""}
if we look at the "Tasks" key, the list after is clearly a single string with the elements all treated as one string (i.e. no quotes around each element). it's the same for prob1 and prob2. action, Name etc are all fine. I'm not sure if this is what python is expecting but I'm guessing not?
Just in case the android data was to blame i added quotes around each element of the arraylist like this:
Tasks.add('"'+row.get(1).toString()+'"'); instead of Tasks.add(row.get(1).toString());
On the webserver it's now received as
{"Prob1":"[1, 2, 3]","Name":"aaa","action":1,"Prob2":"[20, 20, 20]","Tasks":"[\"1 task\", \"2 task\", \"3 task\"]","Description":""}
but i still get the same problem; when i iterate through "Tasks" in a loop it's looping through each individual character as if the whole thing were a string :/
Since I don't know what the json structure should look like before it gets to Python I'm wondering whether it's a probem with the Android sending the data or my python interpreting it.. though from the looks of that script I've been guessing it's been the sending.
In the Android App I'm sending one big JSONObject containing "Tasks" and the associated arraylist as one of the key value pairs... is this correct? or should JSONArray be involved anywhere?
Thanks for any help everyone, I'm new to the whole JSON thing as well as to Android/Java (And only really a novice at Python too..). I can post additional code if anyone needs it, I just didn't want to lengthen the post too much
EDIT:
when I add
json_data=json_data.replace(r'"[','[')
json_data=json_data.replace(r']"',']')
json_data=json_data.replace(r'\"','"')
to the python it WORKS!!!! but that strikes me as a bit nasty and just papering over a crack..
Tasks is just a big string. To be a valid list, it would have to be ["1 task", "2 task", "3 task"]
Same goes for Prob1 and Prob2. To be a valid list, the brackets should not be enclosed in quotes.

json object as get parameter

I'm writing API for a mongo database. I need to pass JSON object as GET parameter:
example.com/api/obj/list/1/?find={"foo":"bar"}
How should I organize this better?
I thought about using JSON-like objects without quotes and spaces, for example:
{$or:[{a:foo+bar},{b:2}]}
So is there any tools to parse it in Python/Django?
It should be fine as long as the JSON objects aren't too big, they don't contain sensitive data (it sucks to see your password in your browser history) and you URL-escape them.
Unfortunately, you have to take shortcuts if you want to have a human-readable JSON parameter. All JSON brackets ({, }, [, ]) are recommended for escaping. You don't have to escape them, but you are taking a risk if you don't. More annoying is the :, which is ubiquitous in JSON and must be escaped.
If you want human-readable query strings, then the sensible solution is to encode all query parameters explicitly. A compromise that might work quite well is to unpack the top-level JSON object into explicit query parameters, each of remains JSON-encoded. Going a small step further, you could drop any top-level delimiters that remain, e.g.:
JSON: {"foo":"bar", "items":[1, 2, 3], "staff":{"id":432, "first":"John", "last":"Doe"}}
Query: foo=bar&items=1,2,3&staff="id"%3A432,"first"%3A"John","last"%3A"Doe"
Since you know that foo is a string, items is an array and staff is an object, you can rehydrate the JSON syntax correctly before sending the lot to a JSON parser.

Categories

Resources