Parsing XML CDATA section and convert it to CSV using ElementTree python - python

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.
My XML pattern is as follow:
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>
And, Here is my parsing code:
def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
rows = []
#Parse XML file
tree = ET.parse(os.path.join(folderpath, xmlfilename))
root = tree.getroot()
for elem in root.findall("DOC") :
rows = []
sentence = elem.find("TEXT")
if sentence != None:
sentence = re.sub('\n', '', sent.text)
rows.append(sentence)
csvwriter.writerow(rows)
csv_file.close()
I appreciate any help.

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child
The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
<TEXT>
<![CDATA[more text]]>
</TEXT>
</DOC></root>'''
root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
data = list(text)[0].tail.strip() if list(text) else text.text.strip()
print(f'{idx}) {data}')
output
1) The section I want to access to
2) more text

Related

How to add a root to an XML file? [duplicate]

I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass

How to add a root to an existing XML which doesn't have a single root tag

I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass

How to determine what the root tag name is for a XML document

I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.
For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName
this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)

Find and replacing text in elementtree

i am very new to programming and python. I am trying to find and replace a text in an xml file. Here is my xml file
<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5"
xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data></meta-data>
<front></front>
<body>
<chl1><title xml:id="id_881i">Installation</title>
<p>To install SDK, perform the tasks mentioned in the following
table.</p>
<p><input>ln -s /sim/<var>user_id</var>/.VirtualBox $home/.VirtualBox</input
></p>
</chl1>
</body>
</doc>
<?Pub *0000021917 0?>
I need to replace all entries of "virtual box" with "Xen". For this i tried Elementtree. But i dont know how to replace and write back to the file. Here is my try.
import xml.etree.ElementTree as ET
tree=ET.parse('C:/My_location/1_1531-CRA 119 1364_2.xml')
doc=tree.getroot()
iterator=doc.getiterator()
for body in iterator:
old_text=body.replace("Virtualbox", "Xen")
The texts are available in many sub tags under body.I got the method to remove the subelement and append a new element, but didnt get to replace only the texts.
Replace text, tail attributes.
import lxml.etree as ET
with open('1.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.getiterator():
if elem.text:
elem.text = elem.text.replace('VirtualBox', 'Xen')
if elem.tail:
elem.tail = elem.tail.replace('VirtualBox', 'Xen')
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()
Probably the simplest way is to do:
ifile = open('input_file','r')
ofile = open('output_file','w')
for line in ifile.readlines():
ofile.write(line.replace('VirtualBox','Xen'))
ifile.close()
ofile.close()

Trying to extract xml element using python 2.7

I am trying to extract the name elements under the sequence in xml files. I have pasted in the top of a sample xml to illustrate. With this I want to get the text from 01 Interview_been successful through mentorship and write it to a file. There are multiple sequence tags in the xml and I am trying to figure out how to go through it and extract it. I have tried to figure out how to use xml.etree and xml.dom.minidom but I can't seem to wrap my brain around it. I was able to get all of the id values from the sequence tags but not the name elements. I'm pasting in my code before the xml.
from xml.etree import ElementTree
file = open("xmldump.txt", "r")
filedata = file.read()
file.close()
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('name'):
sequenceid = node.attrib.get('name')
print ' %s' % (sequenceid)
newLine = sequenceid + "\n"
file = open("xmldump.txt", "w")
file.write(newLine)
file.close()
Here is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<bin>
<uuid>0F5D72FA-54E4-4DE8-81D7-CC33F5C43836</uuid>
<updatebehavior>add</updatebehavior>
<name>Logged</name>
<children>
<sequence id="01 Interview_been successful through mentorship">
<uuid>12FB944D-83EA-4527-9A54-2130A42E3A06</uuid>
<updatebehavior>add</updatebehavior>
<name>01 Interview_been successful through mentorship</name>
<duration>1195</duration>
<rate>
<ntsc>TRUE</ntsc>
<timebase>24</timebase>
</rate>
<timecode>
Well, I'm not sure if you want the "id" attribute or the name tag(your code is confusing, it tries to extract a "name" attribute out of the "sequence" tag, but that tag only has an "id" attribute). Below is code that extract both, should help you get started on figuring out how ElementTree works
from xml.etree import ElementTree
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('sequence'):
sequenceid = node.attrib.get('id')
name = node.findtext('name')

Categories

Resources