I am parsing data from csv to xml using Python library import xml.etree.ElementTree.
I want to put a comment before first node so the output will be like
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Comment -->
<a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<b>
<c>xxxxx</c>
</b>
</a>
The code I have try is:
from lxml import etree
import xml.dom.minidom
import xml.etree.ElementTree as et
name_space = {
# namespace defined below
"xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance"
}
comment = et.Comment
root = et.Element('')
root.tag = None
root.insert(0, comment)
root2 = et.Element('a', namespace)
root.insert(1, root2)
xml_data = et.tostring(root, encoding='iso-8859-1', method='xml')
xmlstr = xml.dom.minidom.parseString(xml_data, parser=None).toprettyxml(indent=" ", encoding='iso-8859-1')
Last sentence xml.dom.minidom.parseString() gives me the error xml.parsers.expat.ExpatError: junk after document element: line 2, column 135
If I print xml_data content it is:
<?xml version=\'1.0\' encoding=\'iso-8859-1\'?>\n<!--Comment--><a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" /><b><c>xxxx</c></b>
Do you know if is there any other way to add the comment?
I believe you're making it a bit too complicated. Try something like the below. Note that this assumes (like in your question) that xxxxx has already been extract from the csv file.
cmt = """<?xml version="1.0" encoding="iso-8859-1"?>
<a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!-- Comment -->
<b>
<c>xxxxx</c>
</b>
</a>
"""
parser = etree.XMLParser(remove_comments=False)
doc = etree.XML(cmt.encode(),parser=parser)
print(etree.tostring(doc).decode())
The output should be what you're looking for.
Related
my xml file looks like this:
<feed>
<doc>
<title>Main title</title>
<url>https://test.com</url>
<abstract>some text</abstract>
</doc>
<doc>
<title>Wikipedia</title>
<url>https://wikipedia.org</url>
<abstract>screenshot</abstract>
</doc>
</feed>
and this is my code:
from xml.etree import ElementTree as et
import re
source = "simple.xml"
root = et.fromstring(source)
for child in root: # read abstract tags
title = child.find('title').text
result = child.find('abstract').text
print("{}: {}".format(title, result)
I want this output:
Main title: some text
Wikipedia: screenshot
but I can't get title tag content...
now I can't get xml file content by et.fromstring(source)
f='''<root><doc><title>Main title</title>
<url>https://test.com</url>
<abstract>some text</abstract>
</doc>
<doc>
<title>Wikipedia</title>
<url>https://wikipedia.org</url>
<abstract>screenshot</abstract>
</doc></root>'''
import xml.etree.ElementTree as ET
root = ET.fromstring(f)
for child in root:
title=child.find('title').text
abstract=child.find('abstract').text
print('{}: {}'.format(title,abstract))
Output:
Main title: some text
Wikipedia: screenshot
The xml given was broken, so I had to add a root to make it complete, if you can paste the proper xml I can modify the code.
I have file as such
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="MSPLAB" audio_filename="Combine001" version="5" version_date="110525">
<Episode>
<Section type="report" startTime="0" endTime="2613.577">
<Turn startTime="0" endTime="308.0620625">
<Sync time="0"/>
<Event desc="music" type="noise" extent="instantaneous"/>
<Sync time="2.746"/>
TARGET_TEXT1
<Sync time="5.982"/>
TARGET_TEXT2
</Turn>
</Section>
</Episode>
</Trans>
Is this considered well-formed xml file? I am trying to extract TARGET_TEXT1 and TARGET_TEXT2 in Python but I don't understand where these content belong to as it is between self-closing tags. I saw this other post here but it is done in Java.
Using itertext from ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
data = [text.strip() for node in root.findall('.//Turn') for text in node.itertext() if text.strip()]
print(data)
Output:
['TARGET_TEXT1', 'TARGET_TEXT2']
Update:
If you want dictionary as output try this:
data = {float(x.attrib['time']): x.tail.strip() for node in root.findall('.//Turn') for x in node if x.tail.strip()}
#{2.746: 'TARGET_TEXT1', 5.982: 'TARGET_TEXT2'}
an alternative, using xpath via parsel:
from parsel import Selector
#xml is wrapped into a variable called data
selector = Selector(text=data, type="xml")
selector.xpath(".//Turn/text()").re("\w+")
['TARGET_TEXT1', 'TARGET_TEXT2']
I would like to modify a key value of an attribute(e.g Change the value of "strokeColor" inside the "style" attribute), and the other values of this attribute can not be changed. I'm using Python's ElementTree included with Python.
Here is an example of what I did before:
Part of my XML example code:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
My python code:
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
target = tree.find('.//mxCell[#id="line1"]')
target.set("strokeColor","#FF0000")
tree.write('output.xml')
My output XML:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" strokeColor="#FF0000" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
As you can see, there is a new attribute called "strokeColor", but not changing the strokeColor value inside the "style" attribute. I want to change the strokeColor inside "style" attribute. How can I fix this?
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
html = '''
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#32AC2D;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
'''
doc = SimplifiedDoc(html)
mxCell = doc.select('mxCell#line1')
style = doc.replaceReg(mxCell['style'],'strokeColor=.*?;','strokeColor=#FF0000;')
mxCell.setAttr('style',style)
print(doc.html)
Result:
<?xml version="1.0"?>
<mxCell edge="1" id="line1" parent="1" source="main_wins" style="endArrow=none;html=1;entryX=0;entryY=0.25;entryDx=0;entryDy=0;strokeWidth=5;strokeColor=#FF0000;rounded=0;edgeStyle=orthogonalEdgeStyle;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="main-switch" value="">
</mxCell>
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.
So to mitigate this, I attempted to write a Python script to properly put out XML.
It starts with this XML file (this is just an excerpt):
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
</Row>
...
</Root>
The desired output is:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like , <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
Unfortunately, I'm getting the following with weird escape characters like:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
So I'd like to fix the following:
1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name"
2) Is it possible to cleanly pretty print this (for human-readability)? How?
My Python code currently looks like this:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
intro_tree = directory+introduction
with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
intro_text = f.read()
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = '<![CDATA[' + intro_text + ']]>'
#tree.write('new_' + data_file) #same result but leaves out the xml header
f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()
Thanks,
Johnny
I would recommend you switch to lxml. It is well-documented and (almost) completely compatible with python's own xml. You might only have to minimally change your code. lxml supports CDATA very handily:
> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)
<root><![CDATA[abcd]]></root>
That aside, you should definitely use whatever library you use not only for parsing xml, but also for writing it! lxml will do the declaration for you:
> print(etree.tostring(elmnt, encoding="utf-8"))
<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>
I need to change the value of an attribute named approved-by in an xml file from 'no' to 'yes'. Here is my xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5" xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data>
<?Pub Dtl?>
<confidentiality class="mycompany-internal" />
<doc-name>INSTRUCTIONS</doc-name>
<doc-id>
<doc-no type="registration">1/1531-CRA 119 1364/2</doc-no>
<language code="en" />
<rev>PA1</rev>
<date>
<y>2013</y>
<m>03</m>
<d>12</d>
</date>
</doc-id>
<company-id>
<business-unit></business-unit>
<company-name></company-name>
<company-symbol logotype="X"></company-symbol>
</company-id>
<title>SIM Software Installation Guide</title>
<drafted-by>
<person>
<name>Shahul Hameed</name>
<signature>epeeham</signature>
</person>
</drafted-by>
<approved-by approved="no">
<person>
<name>AB</name>
<signature>errrrrn</signature>
</approved-by>
I tried in two ways, and failed in both. My first way is
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element
root = ET.parse('Path/1_1531-CRA 119 1364_2.xml')
sh = root.find('approved-by')
sh.set('approved', 'yes')
print etree.tostring(root)
In this way, I got an error message saying AttributeError: 'NoneType' object has no attribute 'set'.
So I tried another way.
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element
root = ET.parse('C:/Path/1_1531-CRA 119 1364_2.xml')
elem = Element("approved-by")
elem.attrib["approved"] = "yes"
I didn't get any error, also it didn't set the attribute either. I am confused, and not able to find whats wrong with this script.
Since the xml you've provided is not valid, here's an example:
import xml.etree.ElementTree as ET
xml = """<?xml version="1.0" encoding="UTF-8"?>
<body>
<approved-by approved="no">
<name>AB</name>
<signature>errrrrn</signature>
</approved-by>
</body>
"""
tree = ET.fromstring(xml)
sh = tree.find('approved-by')
sh.set('approved', 'yes')
print ET.tostring(tree)
prints:
<body>
<approved-by approved="yes">
<name>AB</name>
<signature>errrrrn</signature>
</approved-by>
</body>
So, the first way you've tried works. Hope that helps.