Extract each xml node in separate txt file - python

I have a xml file like this:
<root>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>1</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>2</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
...
...
more nodes
</root>
What i am trying to do is to extract and write each node (from <article> to </article> tags) to a separate txt or xml file. I want to keep the tags also.
Is it possible to do it without regular expressions? Are there any suggestions?

Here is one way to do it using ElementTree:
import xml.etree.ElementTree as ElementTree
def main():
with open('data.xml') as f:
et = ElementTree.parse(f)
for article in et.findall('article'):
xml_string = ElementTree.tostring(article)
# Now you can write xml_string to a new file
# Take care to name the files sequentially
if __name__ == '__main__':
main()

try something like this:
from xml.dom import minidom
xmlfile = minidom.parse('yourfile.xml')
#for example for 'article_body'
article_body = xmlfile.getElementsByTagName('article_body')
or
import xml.etree.ElementTree as ET
xmlfile = ET.parse('yourfile.xml')
root_tag = xmlfile.getroot()
for each_article in root_tag.findall('article'):
article_taxonomy = each_article.find('article_taxonomy').text
article_place = each_article.find('article_place').text
# etc etc

Related

Parsing XML CDATA section and convert it to CSV using ElementTree python

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.
My XML pattern is as follow:
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>
And, Here is my parsing code:
def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
rows = []
#Parse XML file
tree = ET.parse(os.path.join(folderpath, xmlfilename))
root = tree.getroot()
for elem in root.findall("DOC") :
rows = []
sentence = elem.find("TEXT")
if sentence != None:
sentence = re.sub('\n', '', sent.text)
rows.append(sentence)
csvwriter.writerow(rows)
csv_file.close()
I appreciate any help.
My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child
The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
<TEXT>
<![CDATA[more text]]>
</TEXT>
</DOC></root>'''
root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
data = list(text)[0].tail.strip() if list(text) else text.text.strip()
print(f'{idx}) {data}')
output
1) The section I want to access to
2) more text

How to add a root to an XML file? [duplicate]

I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass

Can't get text of an XML element in python

I am trying to parse an XML file in python. Here is a small portion of the XML code:
<body>
<p feature="XXX">
<ph>text1 </ph>
DESIRED TEXT
<ph>text2</ph>
<ph>sometext...</ph>
</p>
</body>
I want to get "DESIRED TEXT". I did the following:
import xml.etree.ElementTree as ET
tree = ET.parse(dir)
root = tree.getroot()
for el in root.findall("./body/p"):
print(el.attrib, el.text)
el.attrib return the correct values (which is XXX in this case) but el.text return None.
What am I missing? What should I use instead of .text?
Thanks in advance.
You can use xmltodict lib:
import xmltodict
with open('file.xml', 'r') as f:
result = xmltodict.parse(f.read())['body']['p']['#text']
Output:
DESIRED TEXT
below (no need to install an external library)
import xml.etree.ElementTree as ET
xml = '''<body>
<p feature="XXX">
<ph>text1 </ph>
DESIRED TEXT
<ph>text2</ph>
<ph>sometext...</ph>
</p>
</body>'''
root = ET.fromstring(xml)
print(root.findall('.//ph')[0].tail.strip())

How to add a root to an existing XML which doesn't have a single root tag

I'm having one XML file which doesn't have a single root tag. I want to add a new Root tag to this XML file.
Below is the existing XML:
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
Now I want to add a Root tag 'X', so the final XML will look like:
<X>
<A>
<Val>123</Val>
</A>
<B>
<Val1>456</Val1>
</B>
</X>
I've tried using the below python code:
from xml.etree import ElementTree as ET
root = ET.parse(Input_FilePath).getroot()
newroot = ET.Element("X")
newroot.insert(0, root)
tree = ET.ElementTree(newroot)
tree.write(Output_FilePath)
But at the first line I'm getting the below error:
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 4
As pointed out in the comments by #kjhughes, the XML spec requires that a document must have a single root element.
from xml.etree import ElementTree as ET
node = ET.parse(Input_FilePath)
xml.etree.ElementTree.ParseError: junk after document element: line 4, column 0
You'll need to read the file manually and add the tags yourself:
from xml.etree import ElementTree as ET
with open(Input_FilePath) as f:
xml_string = '<X>' + f.read() + '</X>'
node = ET.fromstring(xml_string)
I think your can do in without xml parsers.
If your know that root tag missing, you can add it by such way.
with open('test.xml', 'r') as f:
data = f.read()
with open('test.xml', 'w') as f:
f.write("<x>\n" + data + "\n</x>")
f.close()
If dont know, your can check it by:
import re
if re.match(u"\s*<x>.*</x>", text, re.S) != None:
#do something
pass

Find and replacing text in elementtree

i am very new to programming and python. I am trying to find and replace a text in an xml file. Here is my xml file
<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5"
xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data></meta-data>
<front></front>
<body>
<chl1><title xml:id="id_881i">Installation</title>
<p>To install SDK, perform the tasks mentioned in the following
table.</p>
<p><input>ln -s /sim/<var>user_id</var>/.VirtualBox $home/.VirtualBox</input
></p>
</chl1>
</body>
</doc>
<?Pub *0000021917 0?>
I need to replace all entries of "virtual box" with "Xen". For this i tried Elementtree. But i dont know how to replace and write back to the file. Here is my try.
import xml.etree.ElementTree as ET
tree=ET.parse('C:/My_location/1_1531-CRA 119 1364_2.xml')
doc=tree.getroot()
iterator=doc.getiterator()
for body in iterator:
old_text=body.replace("Virtualbox", "Xen")
The texts are available in many sub tags under body.I got the method to remove the subelement and append a new element, but didnt get to replace only the texts.
Replace text, tail attributes.
import lxml.etree as ET
with open('1.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.getiterator():
if elem.text:
elem.text = elem.text.replace('VirtualBox', 'Xen')
if elem.tail:
elem.tail = elem.tail.replace('VirtualBox', 'Xen')
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()
Probably the simplest way is to do:
ifile = open('input_file','r')
ofile = open('output_file','w')
for line in ifile.readlines():
ofile.write(line.replace('VirtualBox','Xen'))
ifile.close()
ofile.close()

Categories

Resources