minidom.parse reading short XML?

minidom.parse reading short XML? - python

Here are 2 similar XML files :
Long XML
<mynode>
<text>Blah</text>
<position>322,13</position>
</mynode>
Short XML
<mynode text="Blah" position="322,13" />
It seems that Python's minidom.parse doesn't like the short XML.
Is this short XML style available with minidom (XML) ?
Is it possible to write a unique code that will read both short and long XML ?

from xml.dom import minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
def getValue(n, val):
res = None
for n in mynode:
rv = getChild(n,val)
for v in rv:
var = v.childNodes[0].nodeValue
res = var
if not res:
for n in mynode:
attr = n.getAttributeNode(val)
if attr:
res = attr.nodeValue.strip()
return res
xmldoc = minidom.parse('file.xml')
mynode = xmldoc.getElementsByTagName('mynode')
print getValue(mynode,'text')
print getValue(mynode,'position')
output:
Blah
322,13

You need a root node
>>> from xml.dom.minidom import parseString
>>> doc = parseString('<root><mynode text="Blah" position="322,13" /></root>')
>>> print d.firstChild.firstChild.getAttribute('text')
Blah
>>> print d.firstChild.firstChild.getAttribute('position')
322,13

Related

How to parse xml file with <pair key = "..."> </pair> format

I hope to parse a '.xml' file using python. The format of the file is as follows:
<root><dm_log_packet>
<pair key ="type_id">LTE_PHY_Serv_Cell_Measurement</pair>
</dm_log_packet>
</root>
I tried to parse it using ElementTree but failed.
Here is my code:
from xml.etree import ElementTree
class Log:
def __init__(self,type_id=None):
self.type_id=type_id
def __str__(self):
return self.type_id
roota=ElementTree.parse("file.xml")
log_file = roota.findall("dm_log_packet")
lo = []
for aa in log_file:
log = Log()
log.type_id = aa.find("type_id").text
lo.append(log)
I expect to parse each pair, but it can't do it like I have a <type_id>...</type_id> pair.

You can use BeautifulSoup
xml = """
<root>
<dm_log_packet>
<pair key ="type_id">LTE_PHY_Serv_Cell_Measurement</pair>
</dm_log_packet>
</root>
"""
soup_obj = BeautifulSoup(xml)
soup_obj.html.body.foo.bar.findAll('type')[0]['foobar']
Output will
'1'
More Descriptive Answer

.find() and .findall() expect XPath as arguments, plain strings like "dm_log_packet" will not find anything.
from xml.etree import ElementTree
class Log:
def __init__(self, type_id=None):
self.type_id=type_id
def __str__(self):
return self.type_id
tree = ElementTree.parse("file.xml")
lo = []
for dm_log_packet in tree.findall(".//dm_log_packet"):
pair = dm_log_packet.find("./pair/[#key='type_id']")
if pair is not None:
lo.append(Log(pair.text))
Note that dm_log_packet.find("./pair/[#key='type_id']") will return None when there is no <pair key="type_id">, hence the extra check.

Print lxml.objectify.ObjectifiedElement?

Printing a lxml.objectify.ObjectifiedElement just prints a blank line, so I have to access it via it's tags and when I don't know the tags of the response, I'm just guessing.
How do I print the entire object, showing children names and values?
As requested, here is the code I have. Not sure what purpose this holds, but:
from amazonproduct import API
api = API('xxxxx', 'xxxxx', 'us', 'xxxx')
result = api.item_lookup('B00H8U93JO', ResponseGroup='OfferSummary')
print result

Using lxml.etree.tostring() seems to work, although not prettified :
>>> from lxml import etree
>>> from lxml import objectify
>>> raw = '''<root>
... <foo>foo</foo>
... <bar>bar</bar>
... </root>'''
...
>>> root = objectify.fromstring(raw)
>>> print type(root)
<type 'lxml.objectify.ObjectifiedElement'>
>>> print etree.tostring(root)
<root><foo>foo</foo><bar>bar</bar></root>

In response to har07, You can use minidom to prettify
from lxml import objectify, etree
from xml.dom import minidom
def pretty_print( elem ):
xml = etree.tostring( elem )
pretty = minidom.parseString( xml ).toprettyxml( indent=' ' )
print( pretty )

xmlns namespace breaking lxml

I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
<provider>filmgroup</provider>
<language>en-GB</language>
<actor name="John Smith" display="Doe John"</actor>
</package>
And here is a sample of my python code:
metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
providerValue = tree.find('//provider')
providerValue = providerValue.text
print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
When I run this it can't find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer" then all work as expected.
My question is how can I remove this namespace, as i'm not at all interested in this, so I can get the tag values I need using lxml?

The provider tag is in the http://apple.com/itunes/importer namespace, so you either need to use the fully qualified name
{http://apple.com/itunes/importer}provider
or use one of the lxml methods that has the namespaces parameter, such as root.xpath. Then you can specify it with a namespace prefix (e.g. ns:provider):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/#name',
namespaces=namespaces))
for provider, actor in zip(*[items]*2):
print(provider, actor)
yields
('filmgroup', 'John Smith')
Note that the XPath used above assumes that <provider> and <actor> elements always appear in alternation. If that is not true, then there are of course ways to handle it, but the code becomes a bit more verbose:
for package in root.xpath('//ns:package', namespaces=namespaces):
for provider in package.xpath('ns:provider', namespaces=namespaces):
providerValue = provider.text
print providerValue
for actor in package.xpath('ns:actor', namespaces=namespaces):
print actor.attrib['name']

My suggestion is to not ignore the namespace but, instead, to take it into account. I wrote some related functions (copied with slight modification) for my work on the django-quickbooks library. With these functions, you should be able to do this:
providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')
Here are those functions:
def get_tag_with_ns(tag_name, ns):
return '{%s}%s' % (ns, tag_name)
def getel(elt, tag_name, ns=None):
""" Gets the first tag that matches the specified tag_name taking into
account the QB namespace.
:param ns: The namespace to use if not using the default one for
django-quickbooks.
:type ns: string
"""
res = elt.find(get_tag_with_ns(tag_name, ns=ns))
if res is None:
raise TagNotFound('Could not find tag by name "%s"' % tag_name)
return res
def getels(elt, *path, **kwargs):
""" Gets the first set of elements found at the specified path.
Example:
>>> xml = (
"<root>" +
"<item>" +
"<id>1</id>" +
"</item>" +
"<item>" +
"<id>2</id>"* +
"</item>" +
"</root>")
>>> el = etree.fromstring(xml)
>>> getels(el, 'root', 'item', ns='correct/namespace')
[<Element item>, <Element item>]
"""
ns = kwargs['ns']
i=-1
for i in range(len(path)-1):
elt = getel(elt, path[i], ns=ns)
tag_name = path[i+1]
return elt.findall(get_tag_with_ns(tag_name, ns=ns))

How to detect starting tag of xml and then parse and objectify

I'm using lxml to parse and objectify xml files in a path, I have a lot of model and xsd's, each object model maps to certain defined classes, for example if xml starts with model tag so it is a dataModel and if it starts with page tag it is a viewModel.
My question is how to detect in efficient way that xml file starts with which tag and then parse it with an appropriate xsd file and then objectify it
files = glob(os.path.join('resources/xml', '*.xml'))
for f in files:
xmlinput = open(f)
xmlContent = xmlinput.read()
if xsdPath:
xsdFile = open(xsdPath)
# xsdFile should retrieve according to xml content
schema = etree.XMLSchema(file=xsdFile)
xmlinput.seek(0)
myxml = etree.parse(xmlinput)
try:
schema.assertValid(myxml)
except etree.DocumentInvalid as x:
print "In file %s error %s has occurred." % (xmlPath, x.message)
finally:
xsdFile.close()
xmlinput.close()

I leave aside voluntarily file reading and treatments, to concentrate on your problem:
>>> from lxml.etree import fromstring
>>> # We have XMLs with different root tag
>>> tree1 = fromstring("<model><foo/><bar/></model>")
>>> tree2 = fromstring("<page><baz/><blah/></page>")
>>>
>>> # We have different treatments
>>> def modelTreatement(etree):
... return etree.xpath('//bar')
...
>>> def pageTreatment(etree):
... return etree.xpath('//blah')
...
>>> # Here is a recipe to read the root tag
>>> tree1.getroottree().getroot().tag
'model'
>>> tree2.getroottree().getroot().tag
'page'
>>>
>>> # So, by building an appropriated dict :
>>> tag_to_treatment_map = {'model': modelTreatement, 'page': pageTreatment}
>>> # You can run the right method on the right tree
>>> for tree in [tree1, tree2]:
... tag_to_treatment_map[tree.getroottree().getroot().tag](tree)
...
[<Element bar at 0x24979b0>]
[<Element blah at 0x2497a00>]
Hope this will be useful to someone, even if I had not seen this earlier.

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?

This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}

I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.

For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>

My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)

Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)

Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )

most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text

XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

minidom.parse reading short XML? - python

You need a root node >>> from xml.dom.minidom import parseString >>> doc = parseString('<root><mynode text="Blah" position="322,13" /></root>') >>> print d.firstChild.firstChild.getAttribute('text') Blah >>> print d.firstChild.firstChild.getAttribute('position') 322,13

Related

How to parse xml file with <pair key = "..."> </pair> format

Print lxml.objectify.ObjectifiedElement?

xmlns namespace breaking lxml

How to detect starting tag of xml and then parse and objectify

Editing XML as a dictionary in python?

Categories

Resources