How to modify XML element using Python Elementtree - python

I would like to modify a key value of an attribute(e.g Change the value of "strokeColor" inside the "style" attribute), and the other values of this attribute can not be changed. I'm using Python's ElementTree included with Python.
Here is an example of what I did before:
Part of my XML example code:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
My python code:
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
target = tree.find('.//mxCell[#id="line1"]')
target.set("strokeColor","#FF0000")
tree.write('output.xml')
My output XML:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" strokeColor="#FF0000" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
As you can see, there is a new attribute called "strokeColor", but not changing the strokeColor value inside the "style" attribute. I want to change the strokeColor inside "style" attribute. How can I fix this?

Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
html = '''
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
'''
doc = SimplifiedDoc(html)
mxCell = doc.select('mxCell#line1')
style = doc.replaceReg(mxCell['style'],'strokeColor=.*?;','strokeColor=#FF0000;')
mxCell.setAttr('style',style)
print(doc.html)
Result:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#FF0000;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Related

add comment in XML root element python code

I am parsing data from csv to xml using Python library import xml.etree.ElementTree.
I want to put a comment before first node so the output will be like
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Comment -->
<a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<b>
<c>xxxxx</c>
</b>
</a>
The code I have try is:
from lxml import etree
import xml.dom.minidom
import xml.etree.ElementTree as et
name_space = {
# namespace defined below
"xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance"
}
comment = et.Comment
root = et.Element('')
root.tag = None
root.insert(0, comment)
root2 = et.Element('a', namespace)
root.insert(1, root2)
xml_data = et.tostring(root, encoding='iso-8859-1', method='xml')
xmlstr = xml.dom.minidom.parseString(xml_data, parser=None).toprettyxml(indent=" ", encoding='iso-8859-1')
Last sentence xml.dom.minidom.parseString() gives me the error xml.parsers.expat.ExpatError: junk after document element: line 2, column 135
If I print xml_data content it is:
<?xml version=\'1.0\' encoding=\'iso-8859-1\'?>\n<!--Comment--><a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" /><b><c>xxxx</c></b>
Do you know if is there any other way to add the comment?
I believe you're making it a bit too complicated. Try something like the below. Note that this assumes (like in your question) that xxxxx has already been extract from the csv file.
cmt = """<?xml version="1.0" encoding="iso-8859-1"?>
<a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!-- Comment -->
<b>
<c>xxxxx</c>
</b>
</a>
"""
parser = etree.XMLParser(remove_comments=False)
doc = etree.XML(cmt.encode(),parser=parser)
print(etree.tostring(doc).decode())
The output should be what you're looking for.

How to parse .trs XML file for text between self-closing tags?

I have file as such
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="MSPLAB" audio_filename="Combine001" version="5" version_date="110525">
<Episode>
<Section type="report" startTime="0" endTime="2613.577">
<Turn startTime="0" endTime="308.0620625">
<Sync time="0"/>
<Event desc="music" type="noise" extent="instantaneous"/>
<Sync time="2.746"/>
TARGET_TEXT1
<Sync time="5.982"/>
TARGET_TEXT2
</Turn>
</Section>
</Episode>
</Trans>
Is this considered well-formed xml file? I am trying to extract TARGET_TEXT1 and TARGET_TEXT2 in Python but I don't understand where these content belong to as it is between self-closing tags. I saw this other post here but it is done in Java.
Using itertext from ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
data = [text.strip() for node in root.findall('.//Turn') for text in node.itertext() if text.strip()]
print(data)
Output:
['TARGET_TEXT1', 'TARGET_TEXT2']
Update:
If you want dictionary as output try this:
data = {float(x.attrib['time']): x.tail.strip() for node in root.findall('.//Turn') for x in node if x.tail.strip()}
#{2.746: 'TARGET_TEXT1', 5.982: 'TARGET_TEXT2'}
an alternative, using xpath via parsel:
from parsel import Selector
#xml is wrapped into a variable called data
selector = Selector(text=data, type="xml")
selector.xpath(".//Turn/text()").re("\w+")
['TARGET_TEXT1', 'TARGET_TEXT2']

Adding a new XML element using python ElementTree library

I'm trying to add a new element to an xml file using the python ElementTree library with the following code.
from xml.etree import ElementTree as et
def UpdateXML(pre):
xml_file = place/file.xml
tree = et.parse(xml_file)
root = tree.getroot()
for parent in root.findall('Parent'):
et.SubElement(parent,"NewNode", attribute=pre)
tree.write(xml_file)
The XML I want it to render is in the following format
<Parent>
<Child1 Attribute="Stuff"/>
<NewNode Attribute="MoreStuff"/> <--- new
<Child3>
<Child4>
<CHild5>
<Child6>
</Parent>
However the xml it actually renders is in this incorrect format
<Parent>
<Child1 Attribute="Stuff"/>
<Child3>
<Child4>
<CHild5>
<Child6>
<NewNode Attribute="MoreStuff"/> <--- new
</Parent>
What do I change in my code to render the correct xml?
You want the insert operation:
node = et.Element('NewNode')
parent.insert(1,node)
Which in my testing gets me:
<Parent>
<Child1 Attribute="Stuff" />
<NewNode /><Child3 />
<Child4 />
<CHild5 />
<Child6 />
</Parent>

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

xml.dom.minidom getting elements by tagname

How can I retrieve the value of code with this (below) xml string and when using xml.dom.minidom?
<data>
<element1>
<name>myname</name>
</element1>
<element2>
<code>3</code>
<name>another name</name>
</element2>
</data>
Because multiple 'name' tags can appear I would like to do something like this:
from xml.dom.minidom import parseString
dom = parseString("<data>...</data>")
dom.getElementsByTagName("element1").getElementsByTagName("name")
But that doesn't work unfortunately.
The below code worked fine for me. I think you had multiple tags and you want to get the name from the second tag.
myxml = """\
<data>
<element>
<name>myname</name>
</element>
<element>
<code>3</code>
<name>another name</name>
</element>
</data>
"""
dom = xml.dom.minidom.parseString(myxml)
nodelist = dom.getElementsByTagName("element")[1].getElementsByTagName("name")
for node in nodelist:
print node.toxml()

Categories

Resources