Read XML with Python tree.getroot

Read XML with Python tree.getroot - python

I am new to Python, I have this XML and this code. This is an invoice, where "SalesOrderRet" is the header and "SalesOrderLineRet" is each line of the invoice. The problem that I have is I don't know how to read the SalesOrderLineRet individually for each header. The code that I have here is adding me all the "SalesOrderLineRet" from the entire XML and not just one for the header.
def read_xml():
tree = ET.parse('LastResponse.xml')
root = tree.getroot()
form_data = {}
collection = db["tracking"]
for item in root.iter('SalesOrderRet'):
WO = item.find('RefNumber').text
TimeCreatedQB = item.find('TimeCreated').text
Client = item.find('CustomerRef/FullName').text
for items in root.iter('SalesOrderLineRet'):
descrip = getattr(items.find('Desc'), 'text', None)

For an XML file like this,
<?xml version="1.0"?>
<data>
<SalesOrderRet>
<SalesOrderLineRet>
<RefNumber>1</RefNumber>
<TimeCreated>0:00</TimeCreated>
<CustomerRef>
<FullName>John Doe</FullName>
</CustomerRef>
</SalesOrderLineRet>
<SalesOrderLineRet>
<RefNumber>2</RefNumber>
<TimeCreated>0:00</TimeCreated>
<CustomerRef>
<FullName>Jack Doe</FullName>
</CustomerRef>
</SalesOrderLineRet>
</SalesOrderRet>
<SalesOrderRet>
<SalesOrderLineRet>
<RefNumber>3</RefNumber>
<TimeCreated>0:00</TimeCreated>
<CustomerRef>
<FullName>Mary Doe</FullName>
</CustomerRef>
</SalesOrderLineRet>
<SalesOrderLineRet>
<RefNumber>4</RefNumber>
<TimeCreated>0:00</TimeCreated>
<CustomerRef>
<FullName>Susan Doe</FullName>
</CustomerRef>
</SalesOrderLineRet>
</SalesOrderRet>
</data>
This function should read the tags and attributes individually. If not already, index each <SalesOrderRet> tag and store the individual attributes under that index.
def get_xml(filename):
tree = ET.parse(filename)
root = tree.getroot()
for SalesOrderRet in root:
print(SalesOrderRet.tag, SalesOrderRet.attrib)
for SalesOrderLineRet in SalesOrderRet.iter('SalesOrderLineRet'):
print(' ', SalesOrderLineRet.tag, SalesOrderLineRet.attrib)
WO = SalesOrderLineRet.find('RefNumber').text
TimeCreatedQB = SalesOrderLineRet.find('TimeCreated').text
Client = SalesOrderLineRet.find('CustomerRef/FullName').text
print(' ', WO, TimeCreatedQB, Client)
This code is based off of the docs

Related

Python XML findall is returning the wrong thing

I want to read data from an xml file, but its not returning the right thing.
i get only the first of the child nodes instead of all of them
The XML looks something like this:
<?xml version="1.0" encoding="UTF-8" ?>
<medicalData>
<pacijent> #patient1
<lbo>12345678901</lbo>
<ime>bob</ime>
<prezime>smith</prezime>
<datumRodj>13.10.1954.</datumRodj>
<pregledi>nema</pregledi>
</pacijent>
<pacijent> #patient2
<lbo>22345678901</lbo>
<ime>bobert</ime>
<prezime>smith</prezime>
<datumRodj>30.03.2003</datumRodj>
<pregledi>nema</pregledi>
</pacijent>
<lekar>
<id>111</id>
<ime>john</ime>
<prezime>doe</prezime>
<spacijalizacija>aaa</spacijalizacija>
</lekar>
</medicalData>
Here, if i search for a patient like:
d = etree.parse("pacijent.xml")
listaPodataka = d.getroot()
pacijenti = {}
p = []
for podatak in listaPodataka.findall('pacijent'):
p.append(podatak)
for pacijent in p:
lbo=pacijent[0].text
ime = pacijent[1].text
prezime = pacijent[2].text
datumRodjenja = pacijent[3].text
pregledi=pacijent[4].text
pacijenti[lbo]=Pacijent(lbo,ime,prezime,datumRodjenja,pregledi)
return pacijenti
it would return patient1 but not patient 2
Any ideas what i am doing wrong? I have tried different solutions but nothing seems to work (from the things i have tried).

Here (56605102.xml is the XML taken from you post)
import xml.etree.ElementTree as ET
root = ET.parse("56605102.xml")
for pacijent in root.findall('pacijent'):
print(pacijent)
for child in pacijent:
print('\t' + child.tag + ':' + child.text)
output
<Element 'pacijent' at 0x108d70d68>
lbo:12345678901
ime:bob
prezime:smith
datumRodj:13.10.1954.
pregledi:nema
<Element 'pacijent' at 0x108f50868>
lbo:22345678901
ime:bobert
prezime:smith
datumRodj:30.03.2003
pregledi:nema

Reshape xml using python?

I have a xml like this
<data>
<B>Head1</B>
<I>Inter1</I>
<I>Inter2</I>
<I>Inter3</I>
<I>Inter4</I>
<I>Inter5</I>
<O>,</O>
<B>Head2</B>
<I>Inter6</I>
<I>Inter7</I>
<I>Inter8</I>
<I>Inter9</I>
<O>,</O>
<O> </O>
</data>
and I want the XML to look like
<data>
<combined>Head1 Inter1 Inter2 Inter3 Inter4 Inter5</combined>,
<combined>Head2 Inter6 Inter7 Inter8 Inter9</combined>
</data>
I tried to get all values of "B"
for value in mod.getiterator(tag='B'):
print (value.text)
Head1
Head2
for value in mod.getiterator(tag='I'):
print (value.text)
Inter1
Inter2
Inter3
Inter4
Inter5
Inter6
Inter7
Inter8
Inter9
Now How should I save the first iteration value to one tag and then the second one in diffrent tag. ie. How do make the iteration to start at tag "B" find all the tag "I" which are following it and then iterate again if I again find a tag "B" and save them all in a new tag.
tag "O" will always be present at the end

You can use ElementTree module from xml.etree:
from xml.etree import ElementTree
struct = """
<data>
{}
</data>
"""
def reformat(tree):
root = tree.getroot()
seen = []
for neighbor in root.iter('data'):
for child in neighbor.getchildren():
tx = child.text
if tx == ',':
yield "<combined>{}<combined>".format(' '.join(seen))
seen = []
else:
seen.append(tx)
with open('test.xml') as f:
tree = ElementTree.parse(f)
print(struct.format(',\n'.join(reformat(tree))))
result:
<data>
<combined>Head1 Inter1 Inter2 Inter3 Inter4 Inter5<combined>,
<combined>Head2 Inter6 Inter7 Inter8 Inter9<combined>
</data>
Note that if you're not sure all the blocks are separated wit comma you can simply change the condition if tx == ',': according your file format. You can also check when the tx is started with 'Head' then if seen is not empty yield the seen and clear its content, otherwise append the tx and continue.

xmlns namespace breaking lxml

I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
<provider>filmgroup</provider>
<language>en-GB</language>
<actor name="John Smith" display="Doe John"</actor>
</package>
And here is a sample of my python code:
metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
providerValue = tree.find('//provider')
providerValue = providerValue.text
print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
When I run this it can't find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer" then all work as expected.
My question is how can I remove this namespace, as i'm not at all interested in this, so I can get the tag values I need using lxml?

The provider tag is in the http://apple.com/itunes/importer namespace, so you either need to use the fully qualified name
{http://apple.com/itunes/importer}provider
or use one of the lxml methods that has the namespaces parameter, such as root.xpath. Then you can specify it with a namespace prefix (e.g. ns:provider):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/#name',
namespaces=namespaces))
for provider, actor in zip(*[items]*2):
print(provider, actor)
yields
('filmgroup', 'John Smith')
Note that the XPath used above assumes that <provider> and <actor> elements always appear in alternation. If that is not true, then there are of course ways to handle it, but the code becomes a bit more verbose:
for package in root.xpath('//ns:package', namespaces=namespaces):
for provider in package.xpath('ns:provider', namespaces=namespaces):
providerValue = provider.text
print providerValue
for actor in package.xpath('ns:actor', namespaces=namespaces):
print actor.attrib['name']

My suggestion is to not ignore the namespace but, instead, to take it into account. I wrote some related functions (copied with slight modification) for my work on the django-quickbooks library. With these functions, you should be able to do this:
providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')
Here are those functions:
def get_tag_with_ns(tag_name, ns):
return '{%s}%s' % (ns, tag_name)
def getel(elt, tag_name, ns=None):
""" Gets the first tag that matches the specified tag_name taking into
account the QB namespace.
:param ns: The namespace to use if not using the default one for
django-quickbooks.
:type ns: string
"""
res = elt.find(get_tag_with_ns(tag_name, ns=ns))
if res is None:
raise TagNotFound('Could not find tag by name "%s"' % tag_name)
return res
def getels(elt, *path, **kwargs):
""" Gets the first set of elements found at the specified path.
Example:
>>> xml = (
"<root>" +
"<item>" +
"<id>1</id>" +
"</item>" +
"<item>" +
"<id>2</id>"* +
"</item>" +
"</root>")
>>> el = etree.fromstring(xml)
>>> getels(el, 'root', 'item', ns='correct/namespace')
[<Element item>, <Element item>]
"""
ns = kwargs['ns']
i=-1
for i in range(len(path)-1):
elt = getel(elt, path[i], ns=ns)
tag_name = path[i+1]
return elt.findall(get_tag_with_ns(tag_name, ns=ns))

XML Parsing in Python using document builder factory

I am working in STAF and STAX. Here python is used for coding . I am new to python.
Basically my task is to parse a XML file in python using Document Factory Parser.
The XML file I am trying to parse is :
<?xml version="1.0" encoding="utf-8"?>
<operating_system>
<unix_80sp1>
<tests type="quick_sanity_test">
<prerequisitescript>preparequicksanityscript</prerequisitescript>
<acbuildpath>acbuildpath</acbuildpath>
<testsuitscript>test quick sanity script</testsuitscript>
<testdir>quick sanity dir</testdir>
</tests>
<machine_name>u80sp1_L004</machine_name>
<machine_name>u80sp1_L005</machine_name>
<machine_name>xyz.pxy.dxe.cde</machine_name>
<vmware id="155.35.3.55">144.35.3.90</vmware>
<vmware id="155.35.3.56">144.35.3.91</vmware>
</unix_80sp1>
</operating_system>
I need to read all the tags .
For the tags machine_name i need to read them into a list
say all machine names should be in a list machname.
so machname should be [u80sp1_L004,u80sp1_L005,xyz.pxy.dxe.cde] after reading the tags.
I also need all the vmware tags:
all attributes should be vmware_attr =[155.35.3.55,155.35.3.56]
all vmware values should be vmware_value = [ 144.35.3.90,155.35.3.56]
I am able to read all tags properly except vmware tags and machine name tags:
I am using the following code:(i am new to xml and vmware).Help required.
The below code needs to be modified.
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = None
vmware_attr = None
machname = None
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value = thisChild.getNodeValue()
vmware_attr ==??? what method to use ?
# Get the text value for the element with tag name "machine_name"
nodeList = document.getElementsByTagName("machine_name")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
machname = thisChild.getNodeValue()
Also how to check if a tag exists or not at all. I need to code the parsing properly.

You are need to instantiate vmware_value, vmware_attr and machname as lists not as strings, so instead of this:
vmware_value = None
vmware_attr = None
machname = None
do this:
vmware_value = []
vmware_attr = []
machname = []
Then, to add items to the list, use the append method on your lists. E.g.:
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = []
vmware_attr = []
machname = []
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
vmware_attr.append(node.attributes["id"].value)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value.append(thisChild.getNodeValue())
I've also edited the code to something I think should work to append the correct values to vmware_attr and vmware_value.
I had to make the assumption that STAX uses xml.dom syntax, so if that isn't the case, you will have to edit my suggestion appropriately.

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?

This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}

I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.

For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>

My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)

Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)

Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )

most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text

XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read XML with Python tree.getroot - python

Related

Python XML findall is returning the wrong thing

Reshape xml using python?

xmlns namespace breaking lxml

XML Parsing in Python using document builder factory

Editing XML as a dictionary in python?

Categories

Resources