I need to load an XML file and convert the contents into an object-oriented Python structure. I want to take this:
<main>
<object1 attr="name">content</object>
</main>
And turn it into something like this:
main
main.object1 = "content"
main.object1.attr = "name"
The XML data will have a more complicated structure than that and I can't hard code the element names. The attribute names need to be collected when parsing and used as the object properties.
How can I convert XML data into a Python object?
It's worth looking at lxml.objectify.
xml = """<main>
<object1 attr="name">content</object1>
<object1 attr="foo">contenbar</object1>
<test>me</test>
</main>"""
from lxml import objectify
main = objectify.fromstring(xml)
main.object1[0] # content
main.object1[1] # contenbar
main.object1[0].get("attr") # name
main.test # me
Or the other way around to build xml structures:
item = objectify.Element("item")
item.title = "Best of python"
item.price = 17.98
item.price.set("currency", "EUR")
order = objectify.Element("order")
order.append(item)
order.item.quantity = 3
order.price = sum(item.price * item.quantity for item in order.item)
import lxml.etree
print(lxml.etree.tostring(order, pretty_print=True))
Output:
<order>
<item>
<title>Best of python</title>
<price currency="EUR">17.98</price>
<quantity>3</quantity>
</item>
<price>53.94</price>
</order>
I've been recommending this more than once today, but try Beautiful Soup (easy_install BeautifulSoup).
from BeautifulSoup import BeautifulSoup
xml = """
<main>
<object attr="name">content</object>
</main>
"""
soup = BeautifulSoup(xml)
# look in the main node for object's with attr=name, optionally look up attrs with regex
my_objects = soup.main.findAll("object", attrs={'attr':'name'})
for my_object in my_objects:
# this will print a list of the contents of the tag
print my_object.contents
# if only text is inside the tag you can use this
# print tag.string
David Mertz's gnosis.xml.objectify would seem to do this for you. Documentation's a bit hard to come by, but there are a few IBM articles on it, including this one (text only version).
from gnosis.xml import objectify
xml = "<root><nodes><node>node 1</node><node>node 2</node></nodes></root>"
root = objectify.make_instance(xml)
print root.nodes.node[0].PCDATA # node 1
print root.nodes.node[1].PCDATA # node 2
Creating xml from objects in this way is a different matter, though.
How about this
http://evanjones.ca/software/simplexmlparse.html
##Stephen:
#"can't hardcode the element names, so I need to collect them
#at parse and use them somehow as the object names."
#I don't think thats possible. Instead you can do this.
#this will help you getting any object with a required name.
import BeautifulSoup
class Coll(object):
"""A class which can hold your Foo clas objects
and retrieve them easily when you want
abstracting the storage and retrieval logic
"""
def __init__(self):
self.foos={}
def add(self, fooobj):
self.foos[fooobj.name]=fooobj
def get(self, name):
return self.foos[name]
class Foo(object):
"""The required class
"""
def __init__(self, name, attr1=None, attr2=None):
self.name=name
self.attr1=attr1
self.attr2=attr2
s="""<main>
<object name="somename">
<attr name="attr1">value1</attr>
<attr name="attr2">value2</attr>
</object>
<object name="someothername">
<attr name="attr1">value3</attr>
<attr name="attr2">value4</attr>
</object>
</main>
"""
#
soup=BeautifulSoup.BeautifulSoup(s)
bars=Coll()
for each in soup.findAll('object'):
bar=Foo(each['name'])
attrs=each.findAll('attr')
for attr in attrs:
setattr(bar, attr['name'], attr.renderContents())
bars.add(bar)
#retrieve objects by name
print bars.get('somename').__dict__
print '\n\n', bars.get('someothername').__dict__
output
{'attr2': 'value2', 'name': u'somename', 'attr1': 'value1'}
{'attr2': 'value4', 'name': u'someothername', 'attr1': 'value3'}
There are three common XML parsers for python: xml.dom.minidom, elementree, and BeautifulSoup.
IMO, BeautifulSoup is by far the best.
http://www.crummy.com/software/BeautifulSoup/
If googling around for a code-generator doesn't work, you could write your own that uses XML as input and outputs objects in your language of choice.
It's not terribly difficult, however the three step process of Parse XML, Generate Code, Compile/Execute Script does making debugging a bit harder.
Related
I'm a newbie with Python and I'd like to remove the element openingHours and the child elements from the XML.
I have this input
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>05:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<station id= "2">
<name>foo</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>06:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<stations/>
<Root/>
I'd like this output
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<station/>
<station id= "2">
<name>foo</name>
<station/>
<stations/>
<Root/>
So far I've tried this from another thread How to remove elements from XML using Python
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//*[attribute::openingHour]'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
However, It doesn't seem to be working.
Thanks
I took your code for a spin but at first Python couldn't agree with the way you composed your XML, wanting the / in the closing tag to be at the beginning (like </...>) instead of at the end (<.../>).
That aside, the reason your code isn't working is because the xpath expression is looking for the attribute openingHour while in reality you want to look for elements called openingHours. I got it to work by changing the expression to //openingHours. Making the entire code:
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
You want to remove the tags <openingHours> and not some attribute with name openingHour:
from lxml import etree
doc = etree.parse('stations.xml')
for elem in doc.findall('.//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
I am trying to add an attribute to all child elements in all XML files in the current directory. This attribute should be equal to the length of each string. For example, the XML looks like this:
<?xml version="1.0" encoding="utf-8?>
<RootElement>
<String Id="PythonLove">I love Python.</String>
</RootElement>
So, if this worked the way it should, it would leave the child opening tag looking like this:
<String Id="PythonLove" length="14">
I have read many forums and all point to either .set or .attrib to add attributes into an existing XML. Neither of these have any effect on the files though. My script currently looks like this:
for child in root:
length_limit = len(child.text)
child.set('length', length_limit)
I've also tried child.attrib['length'] = length_limit. This also doesn't work. What am I doing wrong?
Thanks
You need to convert the value to string before set.
>>> xml = '''<?xml version="1.0" encoding="utf-8"?>
... <RootElement>
... <String Id="PythonLove">I love Python.</String>
... </RootElement>
... '''
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(xml)
>>> for child in root:
... child.set('length', str(len(child.text))) # <---
...
>>> print(ET.tostring(root).decode())
<RootElement>
<String Id="PythonLove" length="14">I love Python.</String>
</RootElement>
Got it! Pretty elated because that was a couple weeks of struggles. I ended up just writing to 'infile' (used for iterating through the files in the cwd) and it worked to overwrite the existing XML (had to register the namespace first which was another little hump I ran into). Full code:
import fileinput
import os, glob
import xml.etree.ElementTree as ET
path = os.getcwd()
for infile in glob.glob(os.path.join(path, '*.xml')):
try:
tree = ET.parse(infile)
root = tree.getroot() # sets variable 'root' to the root element
for child in root:
string_length = str(len(child.text))
child.set('length', length_limit)
ET.register_namespace('',"http://schemas.microsoft.com/wix/2006/XML")
tree.write(infile)
I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value
I have some database like the next one in XML and im trying to parser it with Python 2.7:
<team>
<generator>
<team_name>TeamMaster</team_name>
<team_year>2000</team_year>
<team_city>NewYork</team_city>
</generator>
<players>
<definition name="John V." number="4" age="25">
<criteria position="fow" side="right">
<criterion website="www.johnV.com" version="1" result="true"/>
</criteria>
<object debut="2003" version="3" flag="complete">
<history item_ref="team34"/>
<history item_ref="mainteam"/>
</definition>
<definition name="Emma" number="2" age="19">
<criteria position="mid" side="left">
<criterion website="www.emma.net" version="7" result="true"/>
</criteria>
<object debut="2008" version="1" flag="complete">
<history item_ref="newteam"/>
<history item_ref="youngteam"/>
<history item_ref="oldteam"/>
</definition>
</players>
</team>
With this small scrip I can parse easily the first part "generator" from my xml, where I know all elements that contains:
from xml.dom.minidom import parseString
mydb = {
"team_name": ,
"team_year": ,
"team_data":
}
file = open('mydb.xml','r')
data = file.read()
file.close()
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('team_name')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<team_name>','').replace('</team_name>','')
mydb["team_name"] = xmlData # TeamMaster
But my real problem came when I tried to parse the "players" elements, where attributes appears in "definition" and an unknown numbers of elements in "history".
Maybe there is another module that would help me for this better than minidon?
Better use xml.etree.ElementTree, it has a more pythonic syntax. Get the text of team_name by root.findtext('team_name') or iterate over all definitions with root.finditer('definitions').
You can use either Element Tree - XML Parser or use BeautifulSoup XML Parser.
I have created repo for usage of XML parser here XML Parsers Collection
Snippet code below:
#Get the data from XML parser.
users = xml_parser(users_file,'user')
#Iterate through root element.
for user in users:
print(user.find('country').text)
print(user.find('city').text)
I am trying to get the names of the children tags with python's sax library. I am using ContentHandler as the handler. Anyone has an idea how obtain the tag names?
Let's assume our xml document looks like:
<root>
<parent>
<child1>X</child1>
<child2>Y</child2>
</parent>
</root>
And let's assume we use the template for the handler:
class parserSAXHandler(handler.ContentHandler):
def __init__(self):
pass;
def startElement(self, name, attrs):
pass;
def endElement(self,name):
pass;
def characters(self, content):
pass;
How can I obtain the strings "child1" and "child2" assuming that I only know the name of the parent?
SAX-style parsers require that you keep track of all the state you need, such as which tags you've seen. At a nminimum, what you need to do is write a startElement() handler that sets a flag when it sees a <parent> tag and an endElement() that clears that flag when it sees the closing tag. The startElement() handler also needs to accumulate tags it's seen in a list when this flag is set.
class parserSAXHandler(handler.ContentHandler):
def __init__(self):
self.parentflag = False
self.childlist = []
def startElement(self, name, attrs):
if name == "parent":
self.parentflag = True
elif self.parentflag:
self.childlist.append(name)
def endElement(self,name):
if name == "parent":
self.parentflag = False
After parsing, the instance's childlist attribute will have the list you want.
You may need more sophisticated logic if it's possible for additional tags to be nested inside <child> tags and you don't want these tag names. As it is, any tag nested inside a <parent> container at any level is included. The easiest way to keep track of nesting is probably with a stack: push each opening tag, pop each closing tag, and then you can just check to see if parent is at the top of the stack.
class parserSAXHandler(handler.ContentHandler):
def __init__(self):
self.tagstack = []
self.childlist = []
def startElement(self, name, attrs):
if self.tagstack[-1] == "parent":
self.childlist.append(name)
self.tagstack.append(name)
def endElement(self,name):
if name == self.tagstack[-1]:
self.tagstack.pop()
else:
raise SAXParseException("tag closed without being open")
A DOM-style parser, such as xml.dom.minidom or lxml, is a lot easier to work with for these kinds of tasks because it keeps track of the relationships between elements for you. Such a parser may be a better choice for your needs:
from xml.dom.minidom import parseString
xml = """
<root>
<parent>
<child1>X</child1>
<child2>Y</child2>
</parent>
</root>
"""
dom = parseString(xml)
children = [c.localName for p in dom.getElementsByTagName("parent")
for c in p.childNodes if c.nodeType == c.ELEMENT_NODE]
You'll notice that once the minidom module has parsed our XML, your query is a single Python statement (which contains two loops, of course, but it's a single statement nonetheless). You can't really achieve that level of conciseness with a SAX-style parser.
Now, SAX-style parsers are faster and use less memory than DOM parsers, which was important ten years ago, but the gap is much smaller on modern processors, especially on smallish documents. Programmer time is much more valuable.