I am caching some JSON data, and in storage it is represented as a JSON-encode string. No work is performed on the JSON by the server before sending it to the client, other than collation of multiple cached objects, like this:
def get_cached_items():
item1 = cache.get(1)
item2 = cache.get(2)
return json.dumps(item1=item1, item2=item2, msg="123")
There may be other items included with the return value, in this case represented by msg="123".
The issue is that the cached items are double-escaped. It would behoove the library to allow a pass-through of the string without escaping it.
I have looked at the documentation for json.dumps default argument, as it seems to be the place where one would address this, and searched on google/SO but found no useful results.
It would be unfortunate, from a performance perspective, if I had to decode the JSON of each cached items to send it to the browser. It would be unfortunate from a complexity perspective to not be able to use json.dumps.
My inclination is to write a class that stores the cached string and when the default handler encounters an instance of this class it uses the string without perform escaping. I have yet to figure out how to achieve this though, and I would be grateful for thoughts and assistance.
EDIT For clarity, here is an example of the proposed default technique:
class RawJSON(object):
def __init__(self, str):
self.str = str
class JSONEncoderWithRaw(json.JSONEncoder):
def default(self, o):
if isinstance(o, RawJSON):
return o.str # but avoid call to `encode_basestring` (or ASCII equiv.)
return super(JSONEncoderWithRaw, self).default(o)
Here is a degenerate example of the above:
>>> class M():
str = ''
>>> m = M()
>>> m.str = json.dumps(dict(x=123))
>>> json.dumps(dict(a=m), default=lambda (o): o.str)
'{"a": "{\\"x\\": 123}"}'
The desired output would include the unescaped string m.str, being:
'{"a": {"x": 123}}'
It would be good if the json module did not encode/escape the return of the default parameter, or if same could be avoided. In the absence of a method via the default parameter, one may have to achieve the objective here by overloading the encode and iterencode method of JSONEncoder, which brings challenges in terms of complexity, interoperability, and performance.
A quick-n-dirty way is to patch json.encoder.encode_basestring*() functions:
import json
class RawJson(unicode):
pass
# patch json.encoder module
for name in ['encode_basestring', 'encode_basestring_ascii']:
def encode(o, _encode=getattr(json.encoder, name)):
return o if isinstance(o, RawJson) else _encode(o)
setattr(json.encoder, name, encode)
print(json.dumps([1, RawJson(u'["abc", 2]'), u'["def", 3]']))
# -> [1, ["abc", 2], "[\"def\", 3]"]
If you are caching JSON strings, you need to first decode them to python structures; there is no way for json.dumps() to distinguish between normal strings and strings that are really JSON-encoded structures:
return json.dumps({'item1': json.loads(item1), 'item2': json.loads(item2), 'msg': "123"})
Unfortunately, there is no option to include already-converted JSON data in this; the default function is expected to return Python values. You extract data from whatever object that is passed in and return a value that can be converted to JSON, not a value that is already JSON itself.
The only other approach I can see is to insert "template" values, then use string replacement techniques to manipulate the JSON output to replace the templates with your actual cached data:
json_data = json.dumps({'item1': '==item1==', 'item2': '==item2==', 'msg': "123"})
return json_data.replace('"==item1=="', item1).replace('"==item2=="', item2)
A third option is to cache item1 and item2 in non-serialized form, as a Python structure instead of a JSON string.
You can use the better maintained simplejson instead of json which provides this functionality.
import simplejson as json
from simplejson.encoder import RawJSON
print(json.dumps([1, RawJSON(u'["abc", 2]'), u'["def", 3]']))
# -> [1, ["abc", 2], "[\"def\", 3]"]
You get simplicity of code, plus all the C optimisations of simplejson.
Related
I'm trying to dump some Python objects out into YAML.
Currently, regardless of YAML library (pyyaml, oyaml, or ruamel) I'm having an issue where calling .dump(MyObject) gives me correct YAML, but seems to add a lot of metadata about the Python objects that I don't want, in a form that looks like:
!!python/object:MyObject and other similar strings.
I do not need to be able to rebuild the objects from the YAML, so I am fine for this metadata to be removed completely
Other questions on SO indicate that the common solution to this is to use safe_dump instead of dump.
However, safe_dump does not seem to work for nested objects (or objects at all), as it throws this error:
yaml.representer.RepresenterError: ('cannot represent an object', MyObject)
I see that the common workaround here is to manually specify Representers for the objects that I am trying to dump. My issue here is that my Objects are generated code that I don't have control over. I will also be dumping a variety of different objects.
Bottom line: Is there a way to dump nested objects using .dump, but where the metadata isn't added?
Although the words "correct YAML" are not really accurate, and would be better phrased as
"YAML output looking like you want it, except for the tag information", this fortunately gives some
information on how you want your YAML to look, as there are an infinite number of ways to dump objects.
If you dump an object using ruamel.yaml:
import sys
import ruamel.yaml
class MyObject:
def __init__(self, a, b):
self.a = a
self.b = b
self.c = [a, b]
data = dict(x=MyObject(42, -1))
yaml = ruamel.yaml.YAML(typ='unsafe')
yaml.dump(data, sys.stdout)
this gives:
x: !!python/object:__main__.MyObject
a: 42
b: -1
c: [42, -1]
You have a tag !!python/object:__main__.MyObject (yours might differ depending on where the
class is defined, etc.) and each of the attributes of the class are dumped as keys of a mapping.
There are multiple ways on how to get rid of the tag in that dump:
Registering classes
Add a classmethod named to_yaml(), to each of your classes and
register those classes. You have to do this for each of your classes,
but doing so allows you to use the safe-dumper. An example on how to
do this can be found in the
documentation
Post-process
It is fairly easy to postprocess the output and remove the tags, which for objects always occur on the line
before the mapping, and you can delete from !!python until the end-of-line
def strip_python_tags(s):
result = []
for line in s.splitlines():
idx = line.find("!!python/")
if idx > -1:
line = line[:idx]
result.append(line)
return '\n'.join(result)
yaml.encoding = None
yaml.dump(data, sys.stdout, transform=strip_python_tags)
and that gives:
x:
a: 42
b: -1
c: [42, -1]
As achors are dumped before the tag, this "stripping from !!python
until end-of-the line", also works when you dump object that have
multiple references.
Change the dumper
You can also change the unsafe dumper routine for mappings to
recognise the tag used for objects and change the tag to the "normal"
one for dict/mapping (for which normally a tag is not output )
yaml.representer.org_represent_mapping = yaml.representer.represent_mapping
def my_represent_mapping(tag, mapping, flow_style=None):
if tag.startswith("tag:yaml.org,2002:python/object"):
tag = u'tag:yaml.org,2002:map'
return yaml.representer.org_represent_mapping(tag, mapping, flow_style=flow_style)
yaml.representer.represent_mapping = my_represent_mapping
yaml.dump(data, sys.stdout)
and that gives once more:
x:
a: 42
b: -1
c: [42, -1]
These last two methods work for all instances of all Python classes that you define without extra work.
Fast and hacky:
"\n".join([re.sub(r" ?!!python/.*$", "", l) for l in yaml.dump(obj).splitlines()]
"\n".join(...) – concat list to string agin
yaml.dump(obj).splitlines() – create list of lines of yaml
re.sub(r" ?!!python/.*$", "", l) – replace all yaml python tags with empty string
I'm trying to interpret data from the Twitch API with Python. This is my code:
from twitch.api import v3
import json
streams = v3.streams.all(limit=1)
list = json.loads(streams)
print(list)
Then, when running, I get:
TypeError, "the JSON object must be str, not 'dict'"
Any ideas? Also, is this a method in which I would actually want to use data from an API?
Per the documentation json.loads() will parse a string into a json hierarchy (which is often a dict). Therefore, if you don't pass a string to it, it will fail.
json.loads(s, encoding=None, cls=None, object_hook=None,
parse_float=None, parse_int=None, parse_constant=None,
object_pairs_hook=None, **kw) Deserialize s (a str instance containing
a JSON document) to a Python object using this conversion table.
The other arguments have the same meaning as in load(), except
encoding which is ignored and deprecated.
If the data being deserialized is not a valid JSON document, a
JSONDecodeError will be raised.
From the Twitch API we see that the object being returned by all() is a V3Query. Looking at the source and documentation for that, we see it is meant to return a list. Thus, you should treat that as a list rather than a string that needs to be decoded.
Specifically, the V3Query is a subclass of ApiQuery, in turn a subclass of JsonQuery. That class explicitly runs the query and passes a function over the results, get_json. That source explicitly calls json.loads()... so you don't need to! Remember: never be afraid to dig through the source.
after streams = v3.streams.all(limit=1)
try using
streams = json.dumps(streams)
As the streams should be a JSON string and be in the form:
'{"key":value}'
instead of just dict form:
{"key":value}
Some background first: I have a few rather simple data structures which are persisted as json files on disk. These json files are shared between applications of different languages and different environments (like web frontend and data manipulation tools).
For each of the files I want to create a Python "POPO" (Plain Old Python Object), and a corresponding data mapper class for each item should implement some simple CRUD like behavior (e.g. save will serialize the class and store as json file on disk).
I think a simple mapper (which only knows about basic types) will work. However, I'm concerned about security. Some of the json files will be generated by a web frontend, so a possible security risk if a user feeds me some bad json.
Finally, here is the simple mapping code (found at How to convert JSON data into a Python object):
class User(object):
def __init__(self, name, username):
self.name = name
self.username = username
import json
j = json.loads(your_json)
u = User(**j)
What possible security issues do you see?
NB: I'm new to Python.
Edit: Thanks all for your comments. I've found out that I have one json where I have 2 arrays, each having a map. Unfortunately this starts to look like it gets cumbersome when I get more of these.
I'm extending the question to mapping a json input to a recordtype. The original code is from here: https://stackoverflow.com/a/15882054/1708349.
Since I need mutable objects, I'd change it to use a namedlist instead of a namedtuple:
import json
from namedlist import namedlist
data = '{"name": "John Smith", "hometown": {"name": "New York", "id": 123}}'
# Parse JSON into an object with attributes corresponding to dict keys.
x = json.loads(data, object_hook=lambda d: namedlist('X', d.keys())(*d.values()))
print x.name, x.hometown.name, x.hometown.id
Is it still safe?
There's not much wrong that can happen in the first case. You're limiting what arguments can be provided and it's easy to add validation/conversion right after loading from JSON.
The second example is a bit worse. Packing things into records like this will not help you in any way. You don't inherit any methods, because each type you define is new. You can't compare values easily, because dicts are not ordered. You don't know if you have all arguments handled, or if there is some extra data, which can lead to hidden problems later.
So in summary: with User(**data), you're pretty safe. With namedlist there's space for ambiguity and you don't really gain anything. (compared to bare, parsed json)
If you blindly accept users json input without sanity check, you are at risk of become json injection victim.
See detail explanation of json injection attack here: https://www.acunetix.com/blog/web-security-zone/what-are-json-injections/
Besides security vulnerability, parse JSON to Python object this way is not type safe.
With your example of User class, I would assume you expect both fields name and username to be string type. What if the json input is like this:
{
"name": "my name",
"username": 1
}
j = json.loads(your_json)
u = User(**j)
type(u.username) # int
You have gotten an object with unexpected type.
One solution to make sure type safe is to use json schema to validate input json. more about json schema: https://json-schema.org/
I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.
Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}
I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.
I'm finding this to be a difficult question to put into words, hence the examples. However, I'm basically being given arbitrary format strings, and I need to go (efficiently) fetch the appropriate values from a database, in order to build a relevant mapping object dynamically.
Given a format string expecting a mapping object, e.g.:
>>> 'Hello, %(first-name)s!' % {'first-name': 'Dolph'}
'Hello, Dolph!'
I'm looking for an implementation of 'infer_field_names()' below:
>>> infer_field_names('Hello, %(first-name)s! You are #%(customer-number)d.')
['first-name', 'customer-number']
I know I could write regex (or even try to parse exception messages!), but I'm hoping there's an existing API call I can use instead..?
Based on the string Formatter docs, I thought this would work:
>>> import string
>>> format_string = 'Hello, %(first-name)s! You are #%(customer-number)d.'
>>> [x[1] for x in string.Formatter().parse(format_string)]
[None]
But that doesn't quite return what I would expect (a list of field_names, per the docs).
When using the % operator to format strings, the right operand doesn't have to be a dictionary -- it only has to be some object mapping the required field names to the values that are supposed to be substistuted. So all you need to do is write a class with an redefined __getitem__() that retrieves the values from the database.
Here's a pointless example:
class Mapper(object):
def __getitem__(self, item):
return item
print 'Hello, %(first-name)s!' % Mapper()
prints
Hello, first-name!