counting the words in xml file results error - python

I am new to python,I am trying to parse a xml document to count the total no. of words,I tried the below program to count the n no. of words in the file,But i get the error as follows:
After getting this error,i installed "utils",but still it comes.
Is there any other easy way of getting the totla no. of words of an xml document in python,Please help!
Traceback (most recent call last):
File "C:\Python27\xmlp.py", line 1, in <module>
from xml.dom import utils,core
ImportError: cannot import name utils
Coding
from xml.dom import utils,core
import string
reader = utils.FileReader('Greeting.xml')
doc = reader.document
Storage = ""
for n in doc.documentElement.childNodes:
if n.nodeType == core.TEXT_NODE:
# Accumulate contents of text nodes
Storage = Storage + n.nodeValue
print len(string.split(Storage))

You'll find it easier to use ElementTree, eg:
from xml.etree import ElementTree as ET
xml = '<a>one two three<b>four five<c>Six Seven</c></b></a>'
tree = ET.fromstring(xml)
total = sum(len(text.split()) for text in tree.itertext())
# 7
But use tree = ET.parse('Greeting.xml') to load your real data.

imho you do not need utils and core
just from xml.dom import minidom
look a similar example here: Python XML File Open

Related

I have a very large xml file (almost 1 gb) I need to split the xml file into 3 smaller files. All with the same headers. I would like to do in Python

I'm opening the file with the code below, but it won't open because it is too big.
from xml.dom import minidom
Test_file = open("C:\\Users\\samue\\OneDrive\\Desktop\\mopar.xml","r", encoding="utf8")
xmldox = minidom.parse(Test_file)
Test_file.close()
def printNode(node):
print (node)
for child in node.childNodes:
printNode(child)
printNode(xmldoc.documentElement)
although I don't see the error messages like the call stack you pasted, I suppose your code maybe failed at the second or the third line.
Have you tried to parse your xml file by xml.etree.cElementTree?
For example, use the codes below and you can know how long ET parses your XML file.
import os
import time
import xml.etree.cElementTree as ET
def read_xml_file(xml_file, element):
"""
Parse the xml file to xml.etree.cElementTree
"""
tree = ET.parse(xml_file)
root = tree.getroot()
number_of_element = len(root.findall(element))
return '{:,.0f}'.format(number_of_element)
start_time = time.perf_counter()
counter = read_xml_file(xml_file_name, 'ProteinEntry/header') # the element here depends on your XML header tag
end_time = time.perf_counter()
total_time = round(end_time - start_time, 2)
print(f'xml.etree.cElementTree - Total time taken:[{total_time}] seconds to identify the number of elements: [{counter}]')

How to increase version number of a xml file after each change in the file using ETree

I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.

Errno 36: File name too long error parsing python XML

I have an XML file I am trying to parse and access one root of: DonorAdvisedFundInd which I shouldn't have a problem with but when I'm trying to parse the XML file I get an error message saying:
[Errno 36] File name too long:`
Here's the code I'm currently using: I cut off most of it so it's easier to see the problem. The error is occurring on the parse line.
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.parse(xml_data)
Now the reason I'm so confused is if you open that link, the XML file really isn't all that long. It should be able to be parsed. I'm using IBM Watson Studio's online compiler if it makes any difference.
I'd appreciate any insight or feedback anyone can provide.
Try fromstring:
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.fromstring(xml_data)
Update (for finding the specific element):
for i in xtree.findall(".//"):
if 'DonorAdvisedFundInd' in i.tag:
print(i.tag, i.attrib, i.text)
Another way would have been using this xmltodict lib like this:
result = xmltodict.parse(xml_data)
result['Return']['ReturnData']['IRS990']['DonorAdvisedFundInd']

parse xml file in python by xmltodict

I am using xmltodict library in python (https://pypi.org/project/xmltodict/) to parse a xml file by:
import xmltodict
with open("MyXML.xml") as MyXML:
doc = xmltodict.parse(MyXML.read())
The xml file looks good but I get this error:
ExpatError: no element found: line 1, column 0
What should I do?
In my uses of xmltodict, I have always parsed a string and to get an xml string is use etree. Try this:
import xml.etree.ElementTree as ET
import xmltodict
tree = ET.parse("MyXml.xml")
root = tree.getroot()
data = xmltodict.parse(ET.toString(root))
if you have your MyXml.xml file in a different locatin than this file you will need to handle that using file and the import os.
Good Luck, Hope this helps.

Parsing PubMed Central XML using Biopython Bio Entrez parse

I am trying to parse PubMed Central XML files using Biopython's Bio Entrez parse function. This is what I've tried so far:
from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
print xmlfile
fh = open (xmlfile, "r")
read_xml (fh, outfp)
fh.close()
def read_xml (handle, outh):
records = Entrez.parse(handle)
for record in records:
print record
I am getting the following error:
Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces
I have already downloaded archivearticle.dtd file. Are there any other DTD files that need to be installed that would describe the schema of PMC files? Has anyone successfully used the Bio Entrez function or any other method to parse PMC articles?
Thanks for your help!
Use another parser, like the minidom
from xml.dom import minidom
data = minidom.parse("pmc_full.xml")
Now depending on what data do you want to extract, dive into the XML and have fun:
for title in data.getElementsByTagName("article-title"):
for node in title.childNodes:
if node.nodeType == node.TEXT_NODE:
print node.data

Categories

Resources