I have the folowing test.xml
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
What Im trying to accomplish is the following: I want to iterate through the test.xml and for every parent I want to put all of the child nodes in a dictionary where the tag is the index and the text is the value and once i get to the end of the parent add that to the database and reset the dictionary and move onto the next parent.
So for the first parent I would want
insert = {'ID':1,'child1':'value1','child2':'value11','subchild':'value111'}
Use it in an SQL query, And then move onto the next parent reset the dictionary and do the same thing.
Not every parent has the same amount of children, and some children have sub children.
I have tried with:
value = []
tag = []
from elementtree import ElementTree as ET
for parent in tree.getiterator():
for child in parent:
value.append(child.text)
tag.append(child.tag)
But I couldn't figure out how to get my desired results. I left out retrieving and opening the xml in order to keep the post as simple as possible. This is the method I was attempting to use but I don't think its the right one because I haven't been able to stop the iteration at the end of the parent tag in order to insert.
Any help would be greatly appreciated! thanks
Try this using the lxml library:
from lxml import etree
source = """
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
"""
document = etree.fromstring(source)
inserts = []
id_number = 3
for parent in document.findall('parent'):
insert = {}
cont = 0
for element in parent.iterdescendants():
if element.tag == 'ID':
if element.text == str(id_number):
cont = 1
if element.getchildren() == []:
insert[element.tag] = element.text
if cont:
inserts.append(insert)
print inserts
There is also an etree API shipped with python (it does not have pretty printing and some other features that lxml has though): http://docs.python.org/library/xml.etree.elementtree.html
Related
I am trying to parse an XML file in Python with the built in xml module and Elemnt tree, but what ever I try to do according to the documentation, it does not give me what I need.
I am trying to extract all the value tags into a list
<?xml version="1.0" encoding="UTF-8"?>
<CustomField xmlns="http://soap.sforce.com/2006/04/metadata">
<fullName>testPicklist__c</fullName>
<externalId>false</externalId>
<label>testPicklist</label>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<type>Picklist</type>
<valueSet>
<restricted>true</restricted>
<valueSetDefinition>
<sorted>false</sorted>
<value>
<fullName>a 32</fullName>
<default>false</default>
<label>a 32</label>
</value>
<value>
<fullName>23 432;:</fullName>
<default>false</default>
<label>23 432;:</label>
</value>
and here is the example code that I cant get to work. It's very basic and all I have issues is the xpath.
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
print(root.findall(".//value")
print(root.findall(".//*/value")
print(root.findall("./*/value")
Since the root element has attribute xmlns="http://soap.sforce.com/2006/04/metadata", every element in the document will belong to this namespace. So you're actually looking for {http://soap.sforce.com/2006/04/metadata}value elements.
To search all <value> elements in this document you have to specify the namespace argument in the findall() function
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
# get the namespace of root
ns = root.tag.split('}')[0][1:]
# create a dictionary with the namespace
ns_d = {'my_ns': ns}
# get all the values
values = root.findall('.//my_ns:value', namespaces=ns_d)
# print the values
for value in values:
print(value)
Outputs:
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043ba0>
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043e20>
Alternatively you can just search for the {http://soap.sforce.com/2006/04/metadata}value
# get all the values
values = root.findall('.//{http://soap.sforce.com/2006/04/metadata}value')
I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')
I'm trying to add a new element to an xml file using the python ElementTree library with the following code.
from xml.etree import ElementTree as et
def UpdateXML(pre):
xml_file = place/file.xml
tree = et.parse(xml_file)
root = tree.getroot()
for parent in root.findall('Parent'):
et.SubElement(parent,"NewNode", attribute=pre)
tree.write(xml_file)
The XML I want it to render is in the following format
<Parent>
<Child1 Attribute="Stuff"/>
<NewNode Attribute="MoreStuff"/> <--- new
<Child3>
<Child4>
<CHild5>
<Child6>
</Parent>
However the xml it actually renders is in this incorrect format
<Parent>
<Child1 Attribute="Stuff"/>
<Child3>
<Child4>
<CHild5>
<Child6>
<NewNode Attribute="MoreStuff"/> <--- new
</Parent>
What do I change in my code to render the correct xml?
You want the insert operation:
node = et.Element('NewNode')
parent.insert(1,node)
Which in my testing gets me:
<Parent>
<Child1 Attribute="Stuff" />
<NewNode /><Child3 />
<Child4 />
<CHild5 />
<Child6 />
</Parent>
I have some xml;
<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>
I'm using python and the etree module and I'd like to select all <parent> nodes whose child starts with "foo". I know etree has limited xpath support but i'm an xpath rookie so I'm struggling to land on the best solution. I'd think something to this effect
parent[(contains(child,'foo'))]
but i would want to reject parent nodes that contained foo but didn't start with foo (ie <child>125456foo</child>) so i'm not sure this would work. Furthermore, I'm not sure etree supports this level of xpath...
EDIT:
Another acceptable solution would be to to select parents whose children's text are in a list.
pseudo code
parent=>child[text = "foo1" || "bar1" || "bar2"]
Is that possible?
This will get what you want:
[elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Watch it in action:
s = """<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>"""
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
elems = [elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Checking the data:
for elem in elems:
print elem.find('child').text
>>>
foo987654
foo123456
As you can see from the xml.etree documentation, this library doesn't support the contains() operator from XPath. My suggestion would be to select all children with the XPath /parent and then iterating on each result to remove children's content that do not start with foo.
with xpath
import lxml.html
doc = lxml.html.document_fromstring(s)
for e in doc.xpath(".//child[starts-with(text(), 'foo')]"):
print e.text
How can I retrieve the value of code with this (below) xml string and when using xml.dom.minidom?
<data>
<element1>
<name>myname</name>
</element1>
<element2>
<code>3</code>
<name>another name</name>
</element2>
</data>
Because multiple 'name' tags can appear I would like to do something like this:
from xml.dom.minidom import parseString
dom = parseString("<data>...</data>")
dom.getElementsByTagName("element1").getElementsByTagName("name")
But that doesn't work unfortunately.
The below code worked fine for me. I think you had multiple tags and you want to get the name from the second tag.
myxml = """\
<data>
<element>
<name>myname</name>
</element>
<element>
<code>3</code>
<name>another name</name>
</element>
</data>
"""
dom = xml.dom.minidom.parseString(myxml)
nodelist = dom.getElementsByTagName("element")[1].getElementsByTagName("name")
for node in nodelist:
print node.toxml()