I am trying to parse an XML file in python. Here is a small portion of the XML code:
<body>
<p feature="XXX">
<ph>text1 </ph>
DESIRED TEXT
<ph>text2</ph>
<ph>sometext...</ph>
</p>
</body>
I want to get "DESIRED TEXT". I did the following:
import xml.etree.ElementTree as ET
tree = ET.parse(dir)
root = tree.getroot()
for el in root.findall("./body/p"):
print(el.attrib, el.text)
el.attrib return the correct values (which is XXX in this case) but el.text return None.
What am I missing? What should I use instead of .text?
Thanks in advance.
You can use xmltodict lib:
import xmltodict
with open('file.xml', 'r') as f:
result = xmltodict.parse(f.read())['body']['p']['#text']
Output:
DESIRED TEXT
below (no need to install an external library)
import xml.etree.ElementTree as ET
xml = '''<body>
<p feature="XXX">
<ph>text1 </ph>
DESIRED TEXT
<ph>text2</ph>
<ph>sometext...</ph>
</p>
</body>'''
root = ET.fromstring(xml)
print(root.findall('.//ph')[0].tail.strip())
Related
I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass
I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass
I have a xml file like this:
<root>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>1</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>2</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
...
...
more nodes
</root>
What i am trying to do is to extract and write each node (from <article> to </article> tags) to a separate txt or xml file. I want to keep the tags also.
Is it possible to do it without regular expressions? Are there any suggestions?
Here is one way to do it using ElementTree:
import xml.etree.ElementTree as ElementTree
def main():
with open('data.xml') as f:
et = ElementTree.parse(f)
for article in et.findall('article'):
xml_string = ElementTree.tostring(article)
# Now you can write xml_string to a new file
# Take care to name the files sequentially
if __name__ == '__main__':
main()
try something like this:
from xml.dom import minidom
xmlfile = minidom.parse('yourfile.xml')
#for example for 'article_body'
article_body = xmlfile.getElementsByTagName('article_body')
or
import xml.etree.ElementTree as ET
xmlfile = ET.parse('yourfile.xml')
root_tag = xmlfile.getroot()
for each_article in root_tag.findall('article'):
article_taxonomy = each_article.find('article_taxonomy').text
article_place = each_article.find('article_place').text
# etc etc
i am very new to programming and python. I am trying to find and replace a text in an xml file. Here is my xml file
<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5"
xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data></meta-data>
<front></front>
<body>
<chl1><title xml:id="id_881i">Installation</title>
<p>To install SDK, perform the tasks mentioned in the following
table.</p>
<p><input>ln -s /sim/<var>user_id</var>/.VirtualBox $home/.VirtualBox</input
></p>
</chl1>
</body>
</doc>
<?Pub *0000021917 0?>
I need to replace all entries of "virtual box" with "Xen". For this i tried Elementtree. But i dont know how to replace and write back to the file. Here is my try.
import xml.etree.ElementTree as ET
tree=ET.parse('C:/My_location/1_1531-CRA 119 1364_2.xml')
doc=tree.getroot()
iterator=doc.getiterator()
for body in iterator:
old_text=body.replace("Virtualbox", "Xen")
The texts are available in many sub tags under body.I got the method to remove the subelement and append a new element, but didnt get to replace only the texts.
Replace text, tail attributes.
import lxml.etree as ET
with open('1.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.getiterator():
if elem.text:
elem.text = elem.text.replace('VirtualBox', 'Xen')
if elem.tail:
elem.tail = elem.tail.replace('VirtualBox', 'Xen')
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()
Probably the simplest way is to do:
ifile = open('input_file','r')
ofile = open('output_file','w')
for line in ifile.readlines():
ofile.write(line.replace('VirtualBox','Xen'))
ifile.close()
ofile.close()
xml =
<company>Mcd</company>
<Author>Dr.D</Author>
I want to fetch Mcd and Dr.D.
My try
import xml.etree.ElementTree as et
e = et.parse(xml)
root = e.getroot()
for node in root.getiterator("company"):
print node.tag
Hopping for a generous help.
Simply find the one tag that matches, then take the .text attribute:
company = root.find('.//company').text
author = root.find('.//Author').text
Try this.
from xml.etree import ElementTree as ET
xmlFile = ET.iterparse(open('some_file.xml','r'))
for tag, value in xmlFile:
print value.text