Adding a new XML element using python ElementTree library - python

I'm trying to add a new element to an xml file using the python ElementTree library with the following code.
from xml.etree import ElementTree as et
def UpdateXML(pre):
xml_file = place/file.xml
tree = et.parse(xml_file)
root = tree.getroot()
for parent in root.findall('Parent'):
et.SubElement(parent,"NewNode", attribute=pre)
tree.write(xml_file)
The XML I want it to render is in the following format
<Parent>
<Child1 Attribute="Stuff"/>
<NewNode Attribute="MoreStuff"/> <--- new
<Child3>
<Child4>
<CHild5>
<Child6>
</Parent>
However the xml it actually renders is in this incorrect format
<Parent>
<Child1 Attribute="Stuff"/>
<Child3>
<Child4>
<CHild5>
<Child6>
<NewNode Attribute="MoreStuff"/> <--- new
</Parent>
What do I change in my code to render the correct xml?

You want the insert operation:
node = et.Element('NewNode')
parent.insert(1,node)
Which in my testing gets me:
<Parent>
<Child1 Attribute="Stuff" />
<NewNode /><Child3 />
<Child4 />
<CHild5 />
<Child6 />
</Parent>

Related

How to modify XML element using Python Elementtree

I would like to modify a key value of an attribute(e.g Change the value of "strokeColor" inside the "style" attribute), and the other values of this attribute can not be changed. I'm using Python's ElementTree included with Python.
Here is an example of what I did before:
Part of my XML example code:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
My python code:
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
target = tree.find('.//mxCell[#id="line1"]')
target.set("strokeColor","#FF0000")
tree.write('output.xml')
My output XML:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" strokeColor="#FF0000" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
As you can see, there is a new attribute called "strokeColor", but not changing the strokeColor value inside the "style" attribute. I want to change the strokeColor inside "style" attribute. How can I fix this?
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
html = '''
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
'''
doc = SimplifiedDoc(html)
mxCell = doc.select('mxCell#line1')
style = doc.replaceReg(mxCell['style'],'strokeColor=.*?;','strokeColor=#FF0000;')
mxCell.setAttr('style',style)
print(doc.html)
Result:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#FF0000;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

How can I use python to remove xml node

I want to remove elements from xml files. When I was using ElementTree, I can get all the elements from xml files, but I cannot get the xml statements and annotations.
So if I use:
# get xml nodes
tree = ElementTree.pares()
# do filter things ...
# write to files
tree.write(file_path)
I will miss all the statements and annotations. Is there a way to remove xml elements from *.xml files and keep the annotations, statements or any other things in the files?
For example, the source:
<?xml version="1.0" encoding="utf-8"?>
<!-- I am annotation -->
<string name="name">content</string><string left="left">left things</string>
And my target:
<?xml version="1.0" encoding="utf-8"?>
<!-- I am annotation -->
<string left="left">left things</string>
But when I use tree.write(file_path), it will miss the annotation and statement, become:
<string left="left">left things</string>
Possible using lxml which provide remove_comments=False option to preserve XML comments :
from lxml import etree
parser = etree.XMLParser(remove_comments=False)
tree = etree.parse("input.xml", parser=parser)
root = tree.getroot()
for c in root.findall(".//string[#name='name']"):
root.remove(c)
tree.write("output.xml")
"input.xml" :
<root>
<!-- I am annotation -->
<string name="name">content</string><string left="left">left things</string>
</root>
"output.xml" :
<root>
<!-- I am annotation -->
<string left="left">left things</string>
</root>
Related question :
How to prevent xml.ElementTree fromstring from dropping commentnode
Use https://docs.python.org/2/library/xml.etree.elementtree.html
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for country in root.findall('//string[#name='left']'):
root.remove(country)
tree.write('output_data.xml')

Find non-root parent node where child contains some text

I have some xml;
<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>
I'm using python and the etree module and I'd like to select all <parent> nodes whose child starts with "foo". I know etree has limited xpath support but i'm an xpath rookie so I'm struggling to land on the best solution. I'd think something to this effect
parent[(contains(child,'foo'))]
but i would want to reject parent nodes that contained foo but didn't start with foo (ie <child>125456foo</child>) so i'm not sure this would work. Furthermore, I'm not sure etree supports this level of xpath...
EDIT:
Another acceptable solution would be to to select parents whose children's text are in a list.
pseudo code
parent=>child[text = "foo1" || "bar1" || "bar2"]
Is that possible?
This will get what you want:
[elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Watch it in action:
s = """<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>"""
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
elems = [elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Checking the data:
for elem in elems:
print elem.find('child').text
>>>
foo987654
foo123456
As you can see from the xml.etree documentation, this library doesn't support the contains() operator from XPath. My suggestion would be to select all children with the XPath /parent and then iterating on each result to remove children's content that do not start with foo.
with xpath
import lxml.html
doc = lxml.html.document_fromstring(s)
for e in doc.xpath(".//child[starts-with(text(), 'foo')]"):
print e.text

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

XML to MYSQL Using Python

I have the folowing test.xml
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
What Im trying to accomplish is the following: I want to iterate through the test.xml and for every parent I want to put all of the child nodes in a dictionary where the tag is the index and the text is the value and once i get to the end of the parent add that to the database and reset the dictionary and move onto the next parent.
So for the first parent I would want
insert = {'ID':1,'child1':'value1','child2':'value11','subchild':'value111'}
Use it in an SQL query, And then move onto the next parent reset the dictionary and do the same thing.
Not every parent has the same amount of children, and some children have sub children.
I have tried with:
value = []
tag = []
from elementtree import ElementTree as ET
for parent in tree.getiterator():
for child in parent:
value.append(child.text)
tag.append(child.tag)
But I couldn't figure out how to get my desired results. I left out retrieving and opening the xml in order to keep the post as simple as possible. This is the method I was attempting to use but I don't think its the right one because I haven't been able to stop the iteration at the end of the parent tag in order to insert.
Any help would be greatly appreciated! thanks
Try this using the lxml library:
from lxml import etree
source = """
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
"""
document = etree.fromstring(source)
inserts = []
id_number = 3
for parent in document.findall('parent'):
insert = {}
cont = 0
for element in parent.iterdescendants():
if element.tag == 'ID':
if element.text == str(id_number):
cont = 1
if element.getchildren() == []:
insert[element.tag] = element.text
if cont:
inserts.append(insert)
print inserts
There is also an etree API shipped with python (it does not have pretty printing and some other features that lxml has though): http://docs.python.org/library/xml.etree.elementtree.html

Categories

Resources