Best way to represent this using Python data structures - python

<specification>
<propertyName> string </propertyName>
<value>
<number>
<value> anyNumberHere </value>
</number>
<text>
<value> anyTextHere </value>
</text>
<URL>
<value> anyURIHere </value>
</URL>
</value>
<!-- ... 1 or more value nodes here ... -->
</specification>
<!-- ... 1 or more specification nodes here ... -->
I need to construct this request for an API, where user will pass these values. So, what should be the best way to represent this, so that it is easy for the user of the method to pass the respective values to the operation?
I am thinking:
List of List of Dicts:
[specification1, specification2, specification3]
Where:
specification1= [value1, value2]
Where:
value1 = {number:anyNumberHere, text:anyTextHere, URL:anyURIHere}
value2 = {number:anyNumberHere, text:anyTextHere, URL:anyURIHere}
But, I am not able to accommodate: <propertyName> here. Any suggestions?
More over, it sounds way to complicated. Can we have object encapsulation like we do it in Java? I understand, we can, but I am curious, what is the recommended way in python?
My logic for now, suggestions (incorrect due to propertyName):
#specification is a List of List of Dicts
for spec in specification:
specification_elem = etree.SubElement(root, "specification")
propertyName_elem = etree.SubElement(specification_elem,"propertyName")
propertyName_elem.text = spec_propertyName
for value in spec:
value_elem = etree.SubElement(specification_elem, "value")
for key in value:
key_elem = etree.SubElement(value_elem, key)
keyValue_elem = etree.SubElement(key_elem, "value")
keyValue_elem.text = value[key]
Here, I will pass spec_propertyName, as a diff parameter. So, user will pass: specification and spec_propertyName

How about a list of Specification objects, of which each has a property_name and a list of value dictionaries? (values could be a list of objects instead of dictionaries.)
For example:
class Value(object):
def __init__(self,
number=None,
text=None,
url=None):
self.number = number
self.text = text
self.url = url
class Specification(object):
def __init__(self, propertyName):
self.propertyName = propertyName
self.values = []
spec1 = Specification("Spec1")
spec1.values = [Value(number=3,
url="http://www.google.com"),
Value(text="Hello, World!",
url="http://www.jasonfruit.com")]
spec2 = Specification("Spec2")
spec2.values = [Value(number=27,
text="I can haz cheezburger?",
url="http://stackoverflow.com"),
Value(text="Running out of ideas.",
url="http://news.google.com")]

If propertyNames are unique, you can have a dict of list of dict. If they are not, you can have a list of list of list of dict.
If all those lists end up hard to keep track of, you can make a class to store the data:
class Specification:
def __init__(self):
self.propertyName = ""
self.values = []
Then, use:
spec = Specification()
spec.propertyName = "string"
value = {"URL":"someURI", "text":"someText"}
spec.values.append(value)

Here's a representation approach using named tuples. You can upgrade to using classes (adding code to do some validation on the caller's input, and allowing the omission of field=None for optional fields) at your leisure with no other change to the user API and little change to the ElementTree-building code.
# -*- coding: cp1252 -*-
from collections import namedtuple
import xml.etree.cElementTree as etree
Specifications = namedtuple('Specifications', 'specification_list')
Specification = namedtuple('Specification', 'propertyName value_list')
Value = namedtuple('Value', 'number text url')
def make_etree(specifications, encoding):
"""
Convert user's `specifications` to an ElementTree.
`encoding` is encoding of *input* `str` objects.
"""
def ensure_unicode(v):
if isinstance(v, str): return v.decode(encoding)
if isinstance(v, unicode): return v
return unicode(v) # convert numbers etc to unicode strings
root = etree.Element('specifications')
for spec in specifications.specification_list:
specification_elem = etree.SubElement(root, "specification")
propertyName_elem = etree.SubElement(specification_elem, "propertyName")
propertyName_elem.text = ensure_unicode(spec.propertyName)
for value in spec.value_list:
value_elem = etree.SubElement(specification_elem, "value")
for key in value._fields:
kv = getattr(value, key)
if kv is None: continue
key_elem = etree.SubElement(value_elem, key)
keyValue_elem = etree.SubElement(key_elem, "value")
keyValue_elem.text = ensure_unicode(kv)
return etree.ElementTree(root)
# === sample caller code follows ===
specs = Specifications(
specification_list=[
Specification(
propertyName='a prop',
value_list=[
Value(
number=42,
text='universe',
url='http://uww.everywhere',
),
Value(
number=0,
text=None, # optional
url='file:///dev/null',
),
],
),
Specification(
propertyName='b prop',
value_list=[
Value(
number=1,
text='Üñîçøðè', # str object, encoded in cp1252
url=u'Üñîçøðè', # unicode object
),
],
),
],
)
print repr(specs); print
import sys
tree = make_etree(specs, 'cp1252')
import cStringIO
f = cStringIO.StringIO()
tree.write(f, encoding='UTF-8', xml_declaration=True)
print repr(f.getvalue())
print
Output (folded at column 80):
Specifications(specification_list=[Specification(propertyName='a prop', value_li
st=[Value(number=42, text='universe', url='http://uww.everywhere'), Value(number
=0, text=None, url='file:///dev/null')]), Specification(propertyName='b prop', v
alue_list=[Value(number=1, text='\xdc\xf1\xee\xe7\xf8\xf0\xe8', url=u'\xdc\xf1\x
ee\xe7\xf8\xf0\xe8')])])
"<?xml version='1.0' encoding='UTF-8'?>\n<specifications><specification><propert
yName>a prop</propertyName><value><number><value>42</value></number><text><value
>universe</value></text><url><value>http://uww.everywhere</value></url></value><
value><number><value>0</value></number><url><value>file:///dev/null</value></url
></value></specification><specification><propertyName>b prop</propertyName><valu
e><number><value>1</value></number><text><value>\xc3\x9c\xc3\xb1\xc3\xae\xc3\xa7
\xc3\xb8\xc3\xb0\xc3\xa8</value></text><url><value>\xc3\x9c\xc3\xb1\xc3\xae\xc3\
xa7\xc3\xb8\xc3\xb0\xc3\xa8</value></url></value></specification></specification
s>"

Related

How to use TextBuffer.register_serialize_format of PyGTK?

I'm using serialize and deserialize right now, and when decoding the serialized textbuffer with utf-8 I get this:
GTKTEXTBUFFERCONTENTS-0001 <text_view_markup>
<tags>
<tag name="bold" priority="1">
<attr name="weight" type="gint" value="700" />
</tag>
<tag name="#efef29292929" priority="2">
<attr name="foreground-gdk" type="GdkColor" value="efef:2929:2929" />
</tag>
<tag name="underline" priority="0">
<attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />
</tag>
</tags>
<text><apply_tag name="underline">At</apply_tag> the first <apply_tag name="bold">comes</apply_tag> rock! <apply_tag name="underline">Rock</apply_tag>, <apply_tag name="bold">paper,</apply_tag> <apply_tag name="#efef29292929">scissors!</apply_tag></text>
</text_view_markup>
I'm trying to apply the tags using some html tags like <u></u><b></b>, as I asked before and that was closed as a duplicate I'll be asking differently. So, how can I tell where these tags are ending if all they ends with </apply_tag>, instead of something like </apply_tag name="nameoftag"> I tried this before:
def correctTags(text):
tags = []
newstring = ''
for i in range(len(text)):
if string[i] == '<' and i+18 <= len(text):
if text[i+17] == '#':
tags.append('</font color>')
elif text[i+17] == 'b':
tags.append('</b>')
elif text[i+17] == 'u':
tags.append('</u>')
newstring = string.replace('<apply_tag name="#', '<font color="#').replace('<apply_tag name="bold">', '<b>').replace('<apply_tag name="underline">', '<u>')
for j in tags:
newstring = newstring.replace('</apply_tag>', j, 1)
return '<text>' + newstring + '</text>'
But there is a problem with inner tags, they will be closed where it shouldn't be.
I think maybe the answer is gtk.TextBuffer.register_serialize_format as I think this should serialize using the mime that I pass to it, like html, and then I should know where the tags are ending. But I didn't found any example extensive friendly usage of it.
I found the solution to get tags correctly out of serialized textbuffer at Serialising Gtk TextBuffers to HTML, it isn't register_serialize_format, but as was said at the site it's possible to write a serializer but the documentation is sparse (and for that I think is using register_serialize_format). Either way, the solution uses htlm.parser and xml.etree.ElementTree, but it's possible to use BeautifulSoup.
Basically, this script will handle the serialized textbuffer content using html paser, the hard work starts at the feed, that receive byte content (the serialized textbuffer content) and returns a string (the formated text with the html tags), first it'll find the index of <text_view_markup> dropping out the reader GTKTEXTBUFFERCONTENTS-0001 (this is what couldn't be decoded using decode('utf-8')) as it will result in "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position : invalid start byte", you can use decode('utf-8', erros='ignore') or erros='replace' for that, but as the feed method will drop this part the content is decoded with simple .decode().
Then tags and text will be handled separetly, first the tags will be handled and here I used xml.etree.ElementTree, but it's possible use beautifulsoup as the original script, after the tags are handled feed is called and the text is passed, this feed is the method of HTMLParser.
Also for the tags it's possible handle more than italis, bold, and color, you just need to update the tag2html dictionary.
Besides of not using beautifulsoup I made some other changes, as for the tag name, all the tags has names and so they are not using id, my color tag also already has hex values so I didn't need use the pango_to_html_hex method. And here is how it looks right now:
from html.parser import HTMLParser
from typing import Dict, List, Optional, Tuple
from xml.etree.ElementTree import fromstring
from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango
class PangoToHtml(HTMLParser):
"""Decode a subset of Pango markup and serialize it as HTML.
Only the Pango markup used within Gourmet is handled, although expanding it
is not difficult.
Due to the way that Pango attributes work, the HTML is not necessarily the
simplest. For example italic tags may be closed early and reopened if other
attributes, eg. bold, are inserted mid-way:
<i> italic text </i><i><u>and underlined</u></i>
This means that the HTML resulting from the conversion by this object may
differ from the original that was fed to the caller.
"""
def __init__(self):
super().__init__()
self.markup_text: str = "" # the resulting content
self.current_opening_tags: str = "" # used during parsing
self.current_closing_tags: List = [] # used during parsing
# The key is the Pango id of a tag, and the value is a tuple of opening
# and closing html tags for this id.
self.tags: Dict[str: Tuple[str, str]] = {}
tag2html: Dict[str, Tuple[str, str]] = {
Pango.Style.ITALIC.value_name: ("<i>", "</i>"), # Pango doesn't do <em>
str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
"foreground-gdk": (r'<span foreground="{}">', "</span>"),
"background-gdk": (r'<span background="{}">', "</span>")
}
def feed(self, data: bytes) -> str:
"""Convert a buffer (text and and the buffer's iterators to html string.
Unlike an HTMLParser, the whole string must be passed at once, chunks
are not supported.
"""
# Remove the Pango header: it contains a length mark, which we don't
# care about, but which does not necessarily decodes as valid char.
header_end = data.find(b"<text_view_markup>")
data = data[header_end:].decode()
# Get the tags
tags_begin = data.index("<tags>")
tags_end = data.index("</tags>") + len("</tags>")
tags = data[tags_begin:tags_end]
data = data[tags_end:]
# Get the textual content
text_begin = data.index("<text>")
text_end = data.index("</text>") + len("</text>")
text = data[text_begin:text_end]
# Convert the tags to html.
# We know that only a subset of HTML is handled in Gourmet:
# italics, bold, underlined and normal
root = fromstring(tags)
tags_name = list(root.iter('tag'))
tags_attributes = list(root.iter('attr'))
tags = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]
tags_list = {}
for tag in tags:
opening_tags = ""
closing_tags = ""
tag_name = tag[0].attrib['name']
vtype = tag[1].attrib['type']
value = tag[1].attrib['value']
name = tag[1].attrib['name']
if vtype == "GdkColor": # Convert colours to html
if name in ['foreground-gdk', 'background-gdk']:
opening, closing = self.tag2html[name]
hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
opening = opening.format(hex_color)
else:
continue # no idea!
else:
opening, closing = self.tag2html[value]
opening_tags += opening
closing_tags = closing + closing_tags # closing tags are FILO
tags_list[tag_name] = opening_tags, closing_tags
if opening_tags:
tags_list[tag_name] = opening_tags, closing_tags
self.tags = tags_list
# Create a single output string that will be sequentially appended to
# during feeding of text. It can then be returned once we've parse all
self.markup_text = ""
self.current_opening_tags = ""
self.current_closing_tags = [] # Closing tags are FILO
super().feed(text)
return self.markup_text
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
# The pango tags are either "apply_tag", or "text". We only really care
# about the "apply_tag". There could be an assert, but we let the
# parser quietly handle nonsense.
if tag == "apply_tag":
attrs = dict(attrs)
tag_name = attrs.get('name')
tags = self.tags.get(tag_name)
if tags is not None:
(self.current_opening_tags, closing_tag) = tags
self.current_closing_tags.append(closing_tag)
def handle_data(self, data: str) -> None:
data = self.current_opening_tags + data
self.markup_text += data
def handle_endtag(self, tag: str) -> None:
if self.current_closing_tags: # Can be empty due to closing "text" tag
self.markup_text += self.current_closing_tags.pop()
self.current_opening_tags = ""
Also a big thanks to Cyril Danilevski who wrote this, all credits to him. And as he explained, "There is also , that mark the beginning and end
of a TextBuffer's content." so if you follow allong the example from the site, at the handle_endtag it has self.markup_text += self.current_closing_tags.pop() and that will try to pop a empty list, so I recommend anyone who wants to handle tags also see pango_html.py which handle this by checking if the list is not empty (it's also on the code on this answer at the handle_endtag), there's also a test file test_pango_html.py.
Exemple of usage
import PangoToHtml
start_iter = text_buffer.get_start_iter()
end_iter = text_buffer.get_end_iter()
format = text_buffer.register_serialize_tagset()
exported = text_buffer.serialize( text_buffer,
format,
start_iter,
end_iter )
p = PangoToHtml()
p.feed(exported)

convert string representation of a object using python

I am having string PyObject [methodCall=1001, attributeName=javax.servlet.jsp.jstl.fmt.request.charset, attributeValue=UTF-8]
and in python I am having class :
class PyObject:
def __init__(self, methodCall, attributeName, attributeValue):
self.methodCall = methodCall
self.attributeName = attributeName
self.attributeValue = attributeValue
I am receiving above string from socket I want to convert that received string to object of PyObject so that I can access attributes like pyobj.methodCall.
I have tried with json.loads() but its giving me dict.
Since you added the requirement for this to be dynamic in the comments (you should add it to the question as well) consider this solution.
It is not the most beautiful or elegant, but it gets the job done.
This solution abuses the ability to dynamically create classes and instances with type.
string = 'PyObject [methodCall=1001, attributeName=javax.servlet.jsp.jstl.fmt.request.charset, attributeValue=UTF-8]'
class_name, string = string.split(' ', maxsplit=1)
data = [substring.strip() for substring in string.replace('[', '') \
.replace(']', '').split(',') if substring]
print(data)
# ['methodCall=1001', 'attributeName=javax.servlet.jsp.jstl.fmt.request.charset',
# 'attributeValue=UTF-8']
data_dict = {}
for d in data:
key, value = d.split('=')
data_dict[key] = value
print(data_dict)
# {'attributeValue': 'UTF-8',
# 'attributeName': 'javax.servlet.jsp.jstl.fmt.request.charset', 'methodCall': '1001'}
def init(self, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
obj = type(class_name, (), {'__init__': init})(**data_dict)
print(obj)
# <__main__.PyObject object at 0x00520C70>
print(obj.methodCall)
# 1001
print(obj.attributeName)
# javax.servlet.jsp.jstl.fmt.request.charset

Write 'xsi:' in front of attribute with lxml for python 3

I'm adding elements to an xml file.
The document's root is as follows
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
And elements to add look like
<Element xsi:type="some type">
<Sub1>Some text</Sub1>
<Sub2>More text</Sub2>
...
</Element>
I'm trying to find a way for lxml to write 'xsi:' in front of my Element's attibute. This xml file is used by a program to which's source code I do not have access to. I read in a few other questions how to do it by declaring the nsmap of the xml's root, and then again in the child's attribute, which I tried but it didn't work. So far I have (that's what didn't work, the ouput file did not contain the xsi prefix):
element = SubElement(_parent=parent,
_tag='some tag',
attrib={'{%s}type' % XSI: 'some type'}
nsmap={'xsi': XSI}) # Where XSI = namespace address
The namespace is declared properly in the xml file I parse, so I don't know why this isn't working.
The output I get is the element as shown above without the 'xsi:' prefix and all on one line:
<Element type="some type"><Sub1>Some text</Sub1><Sub2>More text</Sub2>...</Element>
If anyone can also point out why in this line
self.tree.write(self.filename, pretty_print=True, encoding='utf-8')
the 'pretty_print' option doesn't work (all printed out in one line), it would be greatly appreciated.
Here is a code example of my script:
from math import floor
from lxml import etree
from lxml.etree import SubElement
def Element(root, sub1: str):
if not isinstance(sub1, str):
raise TypeError
else:
element = SubElement(root, 'Element')
element_sub1 = SubElement(element, 'Sub1')
element_sub1.text = sub1
# ...
# Omitted additional SubElements
# ...
return element
def Sub(root, sub5_sub: str):
XSI = "http://www.w3.org/2001/XMLSchema-instance"
if not isinstance(sub5_sub, str):
raise TypeError
else:
sub = SubElement(root, 'Sub5_Sub', {'{%s}type' % XSI: 'SomeType'}, nsmap={'xsi': XSI})
# ...
# Omitted additional SubElements
# ...
return sub
class Generator:
def __init__(self) -> None:
self.filename = None
self.csv_filename = None
self.csv_content = []
self.tree = None
self.root = None
self.panel = None
self.panels = None
def mainloop(self) -> None:
"""App's mainloop"""
while True:
# Getting files from user
xml_filename = input('Enter path to xml file : ')
# Parsing files
csv_content = [{'field1': 'ElementSub1', 'field2': 'something'},
{'field1': 'ElementSub1', 'field2': 'something'},
{'field1': 'ElementSub2', 'field2': 'something'}] # Replaces csv file that I use
tree = etree.parse(xml_filename)
root = tree.getroot()
elements = root.find('Elements')
for element in elements:
if element.find('Sub1').text in ['ElementSub1', 'ElementSub2']:
for line in csv_content:
if element.find('Sub5') is not None:
Sub(root=element.find('Sub5'),
sub5_sub=line['field2'])
tree.write(xml_filename, pretty_print=True, encoding='utf-8')
if input('Continue? (Y) Quit (n)').upper().startswith('Y'):
elements.clear()
continue
else:
break
#staticmethod
def get_x(x: int) -> str:
if not isinstance(x, int):
x = int(x)
return str(int(floor(9999 / 9 * x)))
#staticmethod
def get_y(y: int) -> str:
if not isinstance(y, int):
y = int(y)
return str(int(floor(999 / 9 * y)))
def quit(self) -> None:
quit()
if __name__ == "__main__":
app = Generator()
app.mainloop()
app.quit()
Here is what it outputs:
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Elements>
<Element>
<Sub1>ElementSub1</Sub1>
<Sub5>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/></Sub5>
</Element>
<Element>
<Sub1>ElementSub1</Sub1>
<Sub5>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/></Sub5>
</Element>
<Element>
<Sub1>ElementSub1</Sub1>
</Element>
</Elements>
</Root>
For some reason, this piece of code does what I want but my real code doesn't. I've come to realize that it does put a prefix on some sub elements with the type attribute, but not all and on those it puts the prefix, it isn't always just 'xsi:'. I found a quick and dirty way to fix this problem which is less than ideal (find and replace through the file for xsi-type -> accepted by lxml's api to xsi:type). What still isn't working though is that it's all printed out in one line despite the pretty_print parameter being true.
I just recently encountered this scenario and was able to successfully create an attribute with the xsi:
qname = etree.QName("http://www.w3.org/2001/XMLSchema-instance", "type")
element = etree.Element('Element', {qname: "some type")
root.append(element)
this outputs something like
<Element xsi:type="some type">

Parse HTML style text annotations to a list of dictionaries

Currently I have the following problem:
Given a string
"<a>annotated <b>piece</b></a> of <c>text</c>"
construct a list such that the result is
[{"annotated": ["a"]}, {"piece": ["a", "b"]}, {"of": []}, {"text": ["c"]}].
My previous attempt looked something like
open_tag = '<[a-z0-9_]+>'
close_tag = '<\/[a-z0-9_]+>'
tag_def = "(" + open_tag + "|" + close_tag + ")"
def tokenize(str):
"""
Takes a string and converts it to a list of words or tokens
For example "<a>foo</a>, of" -> ['<a>', 'foo', '</a>', ',' 'of']
"""
tokens_by_tag = re.split(tag_def, str)
def tokenize(token):
if not re.match(tag_def, token):
return word_tokenize(token)
else:
return [token]
return list(chain.from_iterable([tokenize(token) for token in tokens_by_tag]))
def annotations(tokens):
"""
Process tokens into a list with {word : [tokens]} items
"""
mapping = []
curr = []
for token in tokens:
if re.match(open_tag, token):
curr.append(re.match('<([a-z0-9_]+)>',token).group(1))
elif re.match(close_tag, token):
tag = re.match('<\/([a-z0-9_]+)>',token).group(1)
try:
curr.remove(tag)
except ValueError:
pass
else:
mapping.append({token: list(curr)})
return mapping
Unfortunately this has a flaw since (n=54) resolves to {"n=54" : []} but (n=<n>52</n>) to [{"n=": []}, {52: ["n"]}] thus the length of the two lists differ, making it impossible to merge two different ones later on.
Is there a good strategy for doing parsing HTML/SGML style annotations in a way that two differently annotated (but otherwise identical) strings yield a list of equal size?
Note that I'm well aware that regexps are not suitable for this type of parsing, but also not the problem in this case.
EDIT Fixed a mistake in the example
Your xml data (or html) is not well formed.
Assuming an input file with following xml data well formed:
<root><a>annotated <b>piece</b></a> of <c>text</c></root>
you can use a sax parser to append and pop tags at start and end element events:
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import sys
class Xml2PseudoJson(ContentHandler):
def __init__(self):
self.tags = []
self.chars = []
self.json = []
def startElement(self, tag, attrs):
d = {''.join(self.chars): self.tags[:]}
self.json.append(d)
self.tags.append(tag)
self.chars = []
def endElement(self, tag):
d = {''.join(self.chars): self.tags[:]}
self.chars = []
self.tags.pop()
self.json.append(d)
def characters(self, content):
self.chars.append(content)
def endDocument(self):
print(list(filter(lambda x: '' not in x, self.json)))
parser = make_parser()
handler = Xml2PseudoJson()
parser.setContentHandler(handler)
parser.parse(open(sys.argv[1]))
Run it like:
python3 script.py xmlfile
That yields:
[
{'annotated ': ['root', 'a']},
{'piece': ['root', 'a', 'b']},
{' of ': ['root']},
{'text': ['root', 'c']}
]

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?
This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}
I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.
For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>
My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)
Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)
Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )
most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text
XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Categories

Resources