python - lxml: enforcing a specific order for attributes - python

I have an XML writing script that outputs XML for a specific 3rd party tool.
I've used the original XML as a template to make sure that I'm building all the correct elements, but the final XML does not appear like the original.
I write the attributes in the same order, but lxml is writing them in its own order.
I'm not sure, but I suspect that the 3rd part tool expects attributes to appear in a specific order, and I'd like to resolve this issue so I can see if its the attrib order that making it fail, or something else.
Source element:
<FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="text/x-test-signature">
My source script:
sig.fileformat = etree.SubElement(sig.fileformats, "FileFormat", ID = str(db.ID), Name = db.name, PUID="fileSig/{}".format(str(db.ID)), Version = "", MIMEType = "")
My resultant XML:
<FileFormat MIMEType="" PUID="fileSig/19" Version="" Name="Printer Info File" ID="19">
Is there a way of constraining the order they are written?

It looks like lxml serializes attributes in the order you set them:
>>> from lxml import etree as ET
>>> x = ET.Element("x")
>>> x.set('a', '1')
>>> x.set('b', '2')
>>> ET.tostring(x)
'<x a="1" b="2"/>'
>>> y= ET.Element("y")
>>> y.set('b', '2')
>>> y.set('a', '1')
>>> ET.tostring(y)
'<y b="2" a="1"/>'
Note that when you pass attributes using the ET.SubElement() constructor, Python constructs a dictionary of keyword arguments and passes that dictionary to lxml. This loses any ordering you had in the source file, since Python's dictionaries are unordered (or, rather, their order is determined by string hash values, which may differ from platform to platform or, in fact, from execution to execution).

OrderedDict of attributes
As of lxml 3.3.3 (perhaps also in earlier versions) you can pass an OrderedDict of attributes to the lxml.etree.(Sub)Element constructor and the order will be preserved when using lxml.etree.tostring(root):
sig.fileformat = etree.SubElement(sig.fileformats, "FileFormat", OrderedDict([("ID",str(db.ID)), ("Name",db.name), ("PUID","fileSig/{}".format(str(db.ID))), ("Version",""), ("MIMEType","")]))
Note that the ElementTree API (xml.etree.ElementTree) does not preserve attribute order even if you provide an OrderedDict to the xml.etree.ElementTree.(Sub)Element constructor!
UPDATE: Also note that using the **extra parameter of the lxml.etree.(Sub)Element constructor for specifying attributes does not preserve attribute order:
>>> from lxml.etree import Element, tostring
>>> from collections import OrderedDict
>>> root = Element("root", OrderedDict([("b","1"),("a","2")])) # attrib parameter
>>> tostring(root)
b'<root b="1" a="2"/>' # preserved
>>> root = Element("root", b="1", a="2") # **extra parameter
>>> tostring(root)
b'<root a="2" b="1"/>' # not preserved

Attribute ordering and readability
As the commenters have mentioned, attribute order has no semantic significance in XML, which is to say it doesn't change the meaning of an element:
<tag attr1="val1" attr2="val2"/>
<!-- means the same thing as: -->
<tag attr2="val2" attr1="val1"/>
There is an analogous characteristic in SQL, where column order doesn't change
the meaning of a table definition. XML attributes and SQL columns are a set
(not an ordered set), and so all that can "officially" be said about either
one of those is whether the attribute or column is present in the set.
That said, it definitely makes a difference to human readability which order
these things appear in and in situations where constructs like this are authored and appear in text (e.g. source code) and must be interpreted, a careful ordering makes a lot of sense to me.
Typical parser behavior
Any XML parser that treated attribute order as significant would be out of compliance with the XML standard. That doesn't mean it can't happen, but in my experience it is certainly unusual. Still, depending on the provenence of the tool you mention, it's a possibility that may be worth testing.
As far as I know, lxml has no mechanism for specifying the order attributes appear in serialized XML, and I would be surprised if it did.
In order to test the behavior I'd be strongly inclined to just write a text-based template to generate enough XML to test it out:
id = 1
name = 'Development Signature'
puid = 'dev/1'
version = '1.0'
mimetype = 'text/x-test-signature'
template = ('<FileFormat ID="%d" Name="%s" PUID="%s" Version="%s" '
'MIMEType="%s">')
xml = template % (id, name, puid, version, mimetype)

I have seen order matter where the consumer of the XML is expecting canonicalized XML. Canonical XML specifies that the attributes be sorted:
in increasing lexicographic order with namespace URI as the primary
key and local name as the secondary key (an empty namespace URI is
lexicographically least). (section 2.6 of https://www.w3.org/TR/xml-c14n2/)
So if your application is expecting the kind of order you would get out of canonical XML, lxml does support output in canonical form using the method= argument to print. (see heading C14N of https://lxml.de/api.html)
For example:
from lxml import etree as ET
element = ET.Element('Test', B='beta', Z='omega', A='alpha')
val = ET.tostring(element, method="c14n")
print(val)

lxml uses libxml2 under the hood. It preserves attribute order, which means for an individual element you can sort them like this:
x = etree.XML('<x a="1" b="2" d="4" c="3"><y></y></x>')
sorted_attrs = sorted(x.attrib.items())
x.attrib.clear()
x.attrib.update(sorted_attrs)
Not very helpful if you want them all sorted though. If you want them all sorted you can use the c14n2 output method (XML Canonicalisation Version 2):
>>> x = etree.XML('<x a="1" b="2" d="4" c="3"><y></y></x>')
>>> etree.tostring(x, method="c14n2")
b'<x a="1" b="2" c="3" d="4"><y></y></x>'
That will sort the attributes. Unfortunately it has the downside of ignoring pretty_print, which isn't great if you want human-readable XML.
If you use c14n2 then lxml will use custom Python serialisation code to write the XML which calls sorted(x.attrib.items() itself for all attributes. If you don't, then it will instead call into libxml2's xmlNodeDumpOutput() function which doesn't support sorting attributes but does support pretty-printing.
Therefore the only solution is to manually walk the XML tree and sort all the attributes, like this:
from lxml import etree
x = etree.XML('<x a="1" b="2" d="4" c="3"><y z="1" a="2"><!--comment--></y></x>')
for el in x.iter(etree.Element):
sorted_attrs = sorted(el.attrib.items())
el.attrib.clear()
el.attrib.update(sorted_attrs)
etree.tostring(x, pretty_print=True)
# b'<x a="1" b="2" c="3" d="4">\n <y a="2" z="1">\n <!--comment-->\n </y>\n</x>\n'

You need to encapsulate a new string, which gives order when compared, and gives value when print and get strings.
Here is an example:
class S:
def __init__(self, _idx, _obj):
self._obj = (_idx, _obj)
def get_idx(self):
return self._obj[0]
def __le__(self, other):
return self._obj[0] <= other.get_idx()
def __lt__(self, other):
return self._obj[0] < other.get_idx()
def __str__(self):
return self._obj[1].__str__()
def __repr__(self):
return self._obj[1].__repr__()
def __eq__(self, other):
if isinstance(other, str):
return self._obj[1] == other
elif isinstance(other, S):
return self._obj[
0] == other.get_idx() and self.__str__() == other.__str__()
else:
return self._obj[
0] == other.get_idx() and self._obj[1] == other
def __add__(self, other):
return self._obj[1] + other
def __hash__(self):
return self._obj[1].__hash__()
def __getitem__(self, item):
return self._obj[1].__getitem__(item)
def __radd__(self, other):
return other + self._obj[1]
list_sortable = ['c', 'b', 'a']
list_not_sortable = [S(0, 'c'), S(0, 'b'), S(0, 'a')]
print("list_sortable ---- Before sort ----")
for ele in list_sortable:
print(ele)
print("list_not_sortable ---- Before sort ----")
for ele in list_not_sortable:
print(ele)
list_sortable.sort()
list_not_sortable.sort()
print("list_sortable ---- After sort ----")
for ele in list_sortable:
print(ele)
print("list_not_sortable ---- After sort ----")
for ele in list_not_sortable:
print(ele)
running result:
list_sortable ---- Before sort ----
c
b
a
list_not_sortable ---- Before sort ----
c
b
a
list_sortable ---- After sort ----
a
b
c
list_not_sortable ---- After sort ----
c
b
a
dict_sortable ---- After sort ----
a 3
b 2
c 1
dict_not_sortable ---- After sort ----
c 1
b 2
a 3

Related

How can I replace OrderedDict with dict in a Python AST before literal_eval?

I have a string with Python code in it that I could evaluate as Python with literal_eval if it only had instances of OrderedDict replaced with {}.
I am trying to use ast.parse and ast.NodeTransformer to do the replacement, but when I catch the node with nodetype == 'Name' and node.id == 'OrderedDict', I can't find the list that is the argument in the node object so that I can replace it with a Dict node.
Is this even the right approach?
Some code:
from ast import NodeTransformer, parse
py_str = "[OrderedDict([('a', 1)])]"
class Transformer(NodeTransformer):
def generic_visit(self, node):
nodetype = type(node).__name__
if nodetype == 'Name' and node.id == 'OrderedDict':
pass # ???
return NodeTransformer.generic_visit(self, node)
t = Transformer()
tree = parse(py_str)
t.visit(tree)
The idea is to replace all OrderedDict nodes, represented as ast.Call having specific attributes (which can be seen from ordered_dict_conditions below), with ast.Dict nodes whose key / value arguments are extracted from the ast.Call arguments.
import ast
class Transformer(ast.NodeTransformer):
def generic_visit(self, node):
# Need to call super() in any case to visit child nodes of the current one.
super().generic_visit(node)
ordered_dict_conditions = (
isinstance(node, ast.Call)
and isinstance(node.func, ast.Name)
and node.func.id == 'OrderedDict'
and len(node.args) == 1
and isinstance(node.args[0], ast.List)
)
if ordered_dict_conditions:
return ast.Dict(
[x.elts[0] for x in node.args[0].elts],
[x.elts[1] for x in node.args[0].elts]
)
return node
def transform_eval(py_str):
return ast.literal_eval(Transformer().visit(ast.parse(py_str, mode='eval')).body)
print(transform_eval("[OrderedDict([('a', 1)]), {'k': 'v'}]")) # [{'a': 1}, {'k': 'v'}]
print(transform_eval("OrderedDict([('a', OrderedDict([('b', 1)]))])")) # {'a': {'b': 1}}
Notes
Because we want to replace the innermost node first, we place a call to super() at the beginning of the function.
Whenever an OrderedDict node is encountered, the following things are used:
node.args is a list containing the arguments to the OrderedDict(...) call.
This call has a single argument, namely a list containing key-value pairs as tuples, which is accessible by node.args[0] (ast.List) and node.args[0].elts are the tuples wrapped in a list.
So node.args[0].elts[i] are the different ast.Tuples (for i in range(len(node.args[0].elts))) whose elements are accessible again via the .elts attribute.
Finally node.args[0].elts[i].elts[0] are the keys and node.args[0].elts[i].elts[1] are the values which are used in the OrderedDict call.
The latter keys and values are then used to create a fresh ast.Dict instance which is then used to replace the current node (which was ast.Call).
You could use the ast.NodeVisitor class to observe the OrderedDict tree in order to build the {} tree manually from the encountered nodes, using the parsed nodes from an empty dict as a basis.
import ast
from collections import deque
class Builder(ast.NodeVisitor):
def __init__(self):
super().__init__()
self._tree = ast.parse('[{}]')
self._list_node = self._tree.body[0].value
self._dict_node = self._list_node.elts[0]
self._new_item = False
def visit_Tuple(self, node):
self._new_item = True
self.generic_visit(node)
def visit_Str(self, node):
if self._new_item:
self._dict_node.keys.append(node)
self.generic_visit(node)
def visit_Num(self, node):
if self._new_item:
self._dict_node.values.append(node)
self._new_item = False
self.generic_visit(node)
def literal_eval(self):
return ast.literal_eval(self._list_node)
builder = Builder()
builder.visit(ast.parse("[OrderedDict([('a', 1)])]"))
print(builder.literal_eval())
Note that this only works for the simple structure of your example which uses str as keys and int as values. However extensions to more complex structures should be possible in a similar fashion.
Instead of using ast for parsing and transforming the expression you could also use a regular expression for doing that. For example:
>>> re.sub(
... r"OrderedDict\(\[((\(('[a-z]+'), (\d+)\)),?\s*)+\]\)",
... r'{\3: \4}',
... "[OrderedDict([('a', 1)])]"
... )
"[{'a': 1}]"
The above expression is based on the example string of the OP and considers single quoted strings as keys and positive integers as values, but of course it can be extended to more complex cases.

Convert python objects to python AST-nodes

I have a need to dump the modified python object back into source. So I try to find something to convert real python object to python ast.Node (to use later in astor lib to dump source)
Example of usage I want, Python 2:
import ast
import importlib
import astor
m = importlib.import_module('something')
# modify an object
m.VAR.append(123)
ast_nodes = some_magic(m)
source = astor.dump(ast_nodes)
Please help me to find that some_magic
There's no way to do what you want, because that's not how ASTs work.
When the interpreter runs your code, it will generate an AST out of the source files, and interpret that AST to generate python objects.
What happen to those objects once they've been generated has nothing to do with the AST.
It is however possible to get the AST of what generated the object in the first place.
The module inspect lets you get the source code of some python objects:
import ast
import importlib
import inspect
m = importlib.import_module('pprint')
s = inspect.getsource(m)
a = ast.parse(s)
print(ast.dump(a))
# Prints the AST of the pprint module
But getsource() is aptly named.
If I were to change the value of some variable (or any other object) in m, it wouldn't change its source code.
Even if it was possible to regenerate an AST out of an object, there wouldn't be a single solution some_magic() could return.
Imagine I have a variable x in some module, that I reassign in another module:
# In some_module.py
x = 0
# In __main__.py
m = importlib.import_module('some_module')
m.x = 1 + 227
Now, the value of m.x is 228, but there's no way to know what kind of expression led to that value (well, without reading the AST of __main__.py but this would quickly get out of hand). Was it a mere literal? The result of a function call?
If you really have to get a new AST after modifying some value of a module, the best solution would be to transform the original AST by yourself.
You can find where your identifier got its value, and replace the value of the assignment with whatever you want.
For instance, in my small example x = 0 is represented by the following AST:
Assign(targets=[Name(id='x', ctx=Store())], value=Num(n=0))
And to get the AST matching the reassignment I did in __main__.py, I would have to change the value of the above Assign node as the following:
value=BinOp(left=Num(n=1), op=Add(), right=Num(n=227))
If you'd like to go that way, I recommend you check Python's documentation of the AST node transformer (ast.NodeTransformer), as well as this excellent manual that documents all the nodes you can meet in Python ASTs Green Tree Snakes - the missing Python AST docs.
What Vladimir is asking about is certainly useful for compiler optimizations. Indeed, there are ways to accomplish that using the ast library. Here is a simple example demonstrating evaluation of constant functions:
from ast import *
import numpy as np
PURE_FUNS = {'arange' : np.arange}
PROG = '''
A=arange(5)
B=[0, 1, 2, 3, 4]
A[2:3] = 1
C = [A[1], 2, m]
'''
def py_to_ast(o):
if type(o) == np.ndarray:
return List(elts=[py_to_ast(e) for e in o], ctx=Load())
elif type(o) == np.int64:
return Constant(value=o)
# Add elifs for more types here
else:
assert False
class EvalPureFuns(NodeTransformer):
def visit_Call(self, node):
is_const_args = all(type(a) == Constant for a in node.args)
if node.func.id in PURE_FUNS and is_const_args:
res = eval(unparse(node), PURE_FUNS)
return py_to_ast(res)
return node
node = parse(PROG)
node = EvalPureFuns().visit(node)
print(unparse(node))

Adding comments to YAML produced with PyYaml

I'm creating Yaml documents from my own python objects using PyYaml.
for example my object:
class MyObj(object):
name = "boby"
age = 34
becomes:
boby:
age: 34
So far so good.
But I have not found a way to programmatically add comments to the produced yaml so it will look like:
boby: # this is the name
age: 34 # in years
Looking at PyYaml documentation and also at the code, I found no way of doing so.
Any suggestions?
You probably have some representer for the MyObj class, as by default dumping ( print(yaml.dump(MyObj())) ) with PyYAML will give you:
!!python/object:__main__.MyObj {}
PyYAML can only do one thing with the comments in your desired output: discard them. If you would read that desired output back in, you end
up with a dict containing a dict ({'boby': {'age': 34}}, you would not get a MyObj() instance because there is no tag information)
The enhanced version for PyYAML that I developed (ruamel.yaml) can read in YAML with comments, preserve the comments and write comments when dumping.
If you read your desired output, the resulting data will look (and act) like a dict containing a dict, but in reality there is more complex data structure that can handle the comments. You can however create that structure when ruamel.yaml asks you to dump an instance of MyObj and if you add the comments at that time, you will get your desired output.
from __future__ import print_function
import sys
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap
class MyObj():
name = "boby"
age = 34
def convert_to_yaml_struct(self):
x = CommentedMap()
a = CommentedMap()
x[data.name] = a
x.yaml_add_eol_comment('this is the name', 'boby', 11)
a['age'] = data.age
a.yaml_add_eol_comment('in years', 'age', 11)
return x
#staticmethod
def yaml_representer(dumper, data, flow_style=False):
assert isinstance(dumper, ruamel.yaml.RoundTripDumper)
return dumper.represent_dict(data.convert_to_yaml_struct())
ruamel.yaml.RoundTripDumper.add_representer(MyObj, MyObj.yaml_representer)
ruamel.yaml.round_trip_dump(MyObj(), sys.stdout)
Which prints:
boby: # this is the name
age: 34 # in years
There is no need to wait with creating the CommentedMap instances until you want to represent the MyObj instance. I would e.g. make name and age into properties that get/set values from/on the approprate CommentedMap. That way you could more easily add the comments before the yaml_representer static method is called to represent the MyObj instance.
Here is a solution I came up with; it's a bit complex but less complex than ruamel, as it works entirely with the plain PyYAML API, and does not round trip comments (so it would not be an appropriate answer to this other question). It's probably not as robust overall yet, as I have not tested extensively, but it seems good-enough for my use case, which is that I want dicts/mappings to be able to have comments, both for the entire mapping, as well as per-item comments.
I believe that round-tripping comments--in this limited context--would also be possible with a similar approach, but I have not tried it, as it's not currently a use-case I have.
Finally, while this solution does not implement adding per-item comment to items in lists/sequences (as this is not something I need at the moment) it could easily be extended to do so.
First, as in ruamel, we need a sort of CommentedMapping class, which associates comments with each key in a Mapping. There are many possible approaches to this; mine is just one:
from collections.abc import Mapping, MutableMapping
class CommentedMapping(MutableMapping):
def __init__(self, d, comment=None, comments={}):
self.mapping = d
self.comment = comment
self.comments = comments
def get_comment(self, *path):
if not path:
return self.comment
# Look the key up in self (recursively) and raise a
# KeyError or other execption if such a key does not
# exist in the nested structure
sub = self.mapping
for p in path:
if isinstance(sub, CommentedMapping):
# Subvert comment copying
sub = sub.mapping[p]
else:
sub = sub[p]
comment = None
if len(path) == 1:
comment = self.comments.get(path[0])
if comment is None:
comment = self.comments.get(path)
return comment
def __getitem__(self, item):
val = self.mapping[item]
if (isinstance(val, (dict, Mapping)) and
not isinstance(val, CommentedMapping)):
comment = self.get_comment(item)
comments = {k[1:]: v for k, v in self.comments.items()
if isinstance(k, tuple) and len(k) > 1 and k[0] == item}
val = self.__class__(val, comment=comment, comments=comments)
return val
def __setitem__(self, item, value):
self.mapping[item] = value
def __delitem__(self, item):
del self.mapping[item]
for k in list(self.comments):
if k == item or (isinstance(k, tuple) and k and k[0] == item):
del self.comments[key]
def __iter__(self):
return iter(self.mapping)
def __len__(self):
return len(self.mapping)
def __repr__(self):
return f'{type(self).__name__}({self.mapping}, comment={self.comment!r}, comments={self.comments})'
This class has both a .comment attribute, so that it can carry an overall comment for the mapping, and a .comments attribute containing per-key comments. It also allows adding comments for keys in nested dicts, by specifying the key path as a tuple. E.g. comments={('c', 'd'): 'comment'} allows specifying a comment for the key 'd' in the nested dict at 'c'. When getting items from CommentedMapping, if the item's value is a dict/Mapping, it is also wrapped in a CommentedMapping in such a way that preserves its comments. This is useful for recursive calls into the YAML representer for nested structures.
Next we need to implement a custom YAML Dumper which takes care of the full process of serializing an object to YAML. A Dumper is a complicated class that's composed from four other classes, an Emitter, a Serializer, a Representer, and a Resolver. Of these we only have to implement the first three; Resolvers are more concerned with, e.g. how implict scalars like 1 get resolved to the correct type, as well as determining the default tags for various values. It's not really involved here.
First we implement a resolver. The resolver is responsible for recognizing different Python types, and mapping them to their appropriate nodes in the native YAML data structure/representation graph. Namely, these include scalar nodes, sequence nodes, and mapping nodes. For example, the base Representer class includes a representer for Python dicts which converts them to a MappingNode (each item in the dict in turn consists of a pair of ScalarNodes, one for each key and one for each value).
In order to attach comments to entire mappings, as well as to each key in a mapping, we introduce two new Node types which are not formally part of the YAML specification:
from yaml.node import Node, ScalarNode, MappingNode
class CommentedNode(Node):
"""Dummy base class for all nodes with attached comments."""
class CommentedScalarNode(ScalarNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None, style=None,
comment=None):
super().__init__(tag, value, start_mark, end_mark, style)
self.comment = comment
class CommentedMappingNode(MappingNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None,
flow_style=None, comment=None, comments={}):
super().__init__(tag, value, start_mark, end_mark, flow_style)
self.comment = comment
self.comments = comments
We then add a CommentedRepresenter which includes code for representing a CommentedMapping as a CommentedMappingNode. In fact, it just reuses the base class's code for representing a mapping, but converts the returned MappingNode to a CommentedMappingNode. It also converts each key from a ScalarNode to a CommentedscalarNode. We base it on SafeRepresenter here since I don't need serialization of arbitrary Python objects:
from yaml.representer import SafeRepresenter
class CommentedRepresenter(SafeRepresenter):
def represent_commented_mapping(self, data):
node = super().represent_dict(data)
comments = {k: data.get_comment(k) for k in data}
value = []
for k, v in node.value:
if k.value in comments:
k = CommentedScalarNode(
k.tag, k.value,
k.start_mark, k.end_mark, k.style,
comment=comments[k.value])
value.append((k, v))
node = CommentedMappingNode(
node.tag,
value,
flow_style=False, # commented dicts must be in block style
# this could be implemented differently for flow-style
# maps, but for my case I only want block-style, and
# it makes things much simpler
comment=data.get_comment(),
comments=comments
)
return node
yaml_representers = SafeRepresenter.yaml_representers.copy()
yaml_representers[CommentedMapping] = represent_commented_mapping
Next we need to implement a subclass of Serializer. The serializer is responsible for walking the representation graph of nodes, and for each node output one or more events to the emitter, which is a complicated (and sometimes difficult to follow) state machine, which receives a stream of events and outputs the appropriate YAML markup for each event (e.g. there is a MappingStartEvent which, when received, will output a { if it's a flow-style mapping, and/or add the appropriate level of indentation for subsequent output up to the corresponding MappingEndEvent.
Point being, the new serializer must output events representing comments, so that the emitter can know when it needs to emit a comment. This is handling simply by adding a CommentEvent and emitting them every time a CommentedMappingNode or CommentedScalarNode are encountered in the representation:
from yaml import Event
class CommentEvent(yaml.Event):
"""
Simple stream event representing a comment to be output to the stream.
"""
def __init__(self, value, start_mark=None, end_mark=None):
super().__init__(start_mark, end_mark)
self.value = value
class CommentedSerializer(Serializer):
def serialize_node(self, node, parent, index):
if (node not in self.serialized_nodes and
isinstance(node, CommentedNode) and
not (isinstance(node, CommentedMappingNode) and
isinstance(parent, CommentedMappingNode))):
# Emit CommentEvents, but only if the current node is not a
# CommentedMappingNode nested in another CommentedMappingNode (in
# which case we would have already emitted its comment via the
# parent mapping)
self.emit(CommentEvent(node.comment))
super().serialize_node(node, parent, index)
Next, the Emitter needs to be subclassed to handle CommentEvents. This is perhaps the trickiest part, since as I wrote the emitter is a bit complex and fragile, and written in such a way that it's difficult to modify the state machine (I am tempted to rewrite it more clearly, but don't have time right now). So I experimented with a number of different solutions.
The key method here is Emitter.emit which processes the event stream, and calls "state" methods which perform some action depending on what state the machine is in, which is in turn affected by what events appear in the stream. An important realization is that the stream processing is suspended in many cases while waiting for more events to come in--this is what the Emitter.need_more_events method is responsible for. In some cases, before the current event can be handled, more events need to come in first. For example, in the case of MappingStartEvent at least 3 more events need to be buffered on the stream: the first key/value pair, and the possible the next key. The Emitter needs to know, before it can begin formatting a map, if there are one or more items in the map, and possibly also the length of the first key/value pair. The number of events required before the current event can be handled are hard-coded in the need_more_events method.
The problem is that this does not account for the now possible presence of CommentEvents on the event stream, which should not impact processing of other events. Therefore the Emitter.need_events method to account for the presence of CommentEvents. E.g. if the current event is MappingStartEvent, and there are 3 subsequent events buffered, if one of those are a CommentEvent we can't count it, so we'll need at a minimum 4 events (in case the next one is one of the expected events in a mapping).
Finally, every time a CommentEvent is encountered on the stream, we forcibly break out of the current event processing loop to handle writing the comment, then pop the CommentEvent off the stream and continue as if nothing happened. This is the end result:
import textwrap
from yaml.emitter import Emitter
class CommentedEmitter(Emitter):
def need_more_events(self):
if self.events and isinstance(self.events[0], CommentEvent):
# If the next event is a comment, always break out of the event
# handling loop so that we divert it for comment handling
return True
return super().need_more_events()
def need_events(self, count):
# Hack-y: the minimal number of queued events needed to start
# a block-level event is hard-coded, and does not account for
# possible comment events, so here we increase the necessary
# count for every comment event
comments = [e for e in self.events if isinstance(e, CommentEvent)]
return super().need_events(count + min(count, len(comments)))
def emit(self, event):
if self.events and isinstance(self.events[0], CommentEvent):
# Write the comment, then pop it off the event stream and continue
# as normal
self.write_comment(self.events[0].value)
self.events.pop(0)
super().emit(event)
def write_comment(self, comment):
indent = self.indent or 0
width = self.best_width - indent - 2 # 2 for the comment prefix '# '
lines = ['# ' + line for line in wrap(comment, width)]
for line in lines:
if self.encoding:
line = line.encode(self.encoding)
self.write_indent()
self.stream.write(line)
self.write_line_break()
I also experimented with different approaches to the implementation of write_comment. The Emitter base class has its own method (write_plain) which can handle writing text to the stream with appropriate indentation and line-wrapping. However, it's not quite flexible enough to handle something like comments, where each line needs to be prefixed with something like '# '. One technique I tried was monkey-patching the write_indent method to handle this case, but in the end it was too ugly. I found that simply using Python's built-in textwrap.wrap was sufficient for my case.
Next, we create the dumper by subclassing the existing SafeDumper but inserting our new classes into the MRO:
from yaml import SafeDumper
class CommentedDumper(CommentedEmitter, CommentedSerializer,
CommentedRepresenter, SafeDumper):
"""
Extension of `yaml.SafeDumper` that supports writing `CommentedMapping`s with
all comments output as YAML comments.
"""
Here's an example usage:
>>> import yaml
>>> d = CommentedMapping({
... 'a': 1,
... 'b': 2,
... 'c': {'d': 3},
... }, comment='my commented dict', comments={
... 'a': 'a comment',
... 'b': 'b comment',
... 'c': 'long string ' * 44,
... ('c', 'd'): 'd comment'
... })
>>> print(yaml.dump(d, Dumper=CommentedDumper))
# my commented dict
# a comment
a: 1
# b comment
b: 2
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string
c:
# d comment
d: 3
I still haven't tested this solution very extensively, and it likely still contains bugs. I'll update it as I use it more and find corner-cases, etc.

What are the arguments of ElementTree.SubElement used for?

I have looked at the documentation here:
http://docs.python.org/dev/library/xml.etree.elementtree.html#xml.etree.ElementTree.SubElement
The parent and tag argument seems clear enough, but what format do I put the attribute name and value in? I couldn't find any previous example. What format is the extra** argument?
I receive and error for trying to call the SubElement itself, saying that it is not defined. Thank you.
SubElement is a function of ElementTree (not Element) which allows to create child objects for an Element.
attrib takes a dictionary containing the attributes
of the element you want to create.
**extra is used for additional keyword arguments, those will be added as attributes to the Element.
Example:
>>> import xml.etree.ElementTree as ET
>>>
>>> parent = ET.Element("parent")
>>>
>>> myattributes = {"size": "small", "gender": "unknown"}
>>> child = ET.SubElement(parent, "child", attrib=myattributes, age="10" )
>>>
>>> ET.dump(parent)
<parent><child age="10" gender="unknown" size="small" /></parent>
>>>
If you look further down on the same page you linked to where it deals with class xml.etree.ElementTree.Element(tag, attrib={}, **extra) it tells you how any of the extra arguments work, that is by e.g.:
from etree import ElementTree as ET
a = ET.Element('root-node', tag='This is an extra that sets a tag')
b = ET.SubElement(a, 'nested-node 1')
c = ET.SubElement(a, 'nested-node 2')
d = ET.SubElement(c, 'innermost node')
ET.dump(a)
This also shows you how subelement works, you simply tell it which element (can be a subelsement) that you want to attach it to. For the future, supply some code too so it's easier to see what you're doing/want.

Hashing a dictionary?

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))

Categories

Resources