parse xml file in python by xmltodict - python

I am using xmltodict library in python (https://pypi.org/project/xmltodict/) to parse a xml file by:
import xmltodict
with open("MyXML.xml") as MyXML:
doc = xmltodict.parse(MyXML.read())
The xml file looks good but I get this error:
ExpatError: no element found: line 1, column 0
What should I do?

In my uses of xmltodict, I have always parsed a string and to get an xml string is use etree. Try this:
import xml.etree.ElementTree as ET
import xmltodict
tree = ET.parse("MyXml.xml")
root = tree.getroot()
data = xmltodict.parse(ET.toString(root))
if you have your MyXml.xml file in a different locatin than this file you will need to handle that using file and the import os.
Good Luck, Hope this helps.

Related

How to increase version number of a xml file after each change in the file using ETree

I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.

Errno 36: File name too long error parsing python XML

I have an XML file I am trying to parse and access one root of: DonorAdvisedFundInd which I shouldn't have a problem with but when I'm trying to parse the XML file I get an error message saying:
[Errno 36] File name too long:`
Here's the code I'm currently using: I cut off most of it so it's easier to see the problem. The error is occurring on the parse line.
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.parse(xml_data)
Now the reason I'm so confused is if you open that link, the XML file really isn't all that long. It should be able to be parsed. I'm using IBM Watson Studio's online compiler if it makes any difference.
I'd appreciate any insight or feedback anyone can provide.
Try fromstring:
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.fromstring(xml_data)
Update (for finding the specific element):
for i in xtree.findall(".//"):
if 'DonorAdvisedFundInd' in i.tag:
print(i.tag, i.attrib, i.text)
Another way would have been using this xmltodict lib like this:
result = xmltodict.parse(xml_data)
result['Return']['ReturnData']['IRS990']['DonorAdvisedFundInd']

OSError: [Errno 36] File name too long:

I need to convert a web page to XML (using Python 3.4.3). If I write the contents of the URL to a file then I can read and parse it perfectly but if I try to read directly from the web page I get the following error in my terminal:
File "./AnimeXML.py", line 22, in
xml = ElementTree.parse (xmlData)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 587, in parse
source = open(source, "rb")
OSError: [Errno 36] File name too long:
My python code:
# AnimeXML.py
#! /usr/bin/Python
# Import xml parser.
import xml.etree.ElementTree as ElementTree
# XML to parse.
sampleUrl = "http://cdn.animenewsnetwork.com/encyclopedia/api.xml?anime=16989"
# Read the xml as a file.
content = urlopen (sampleUrl)
# XML content is stored here to start working on it.
xmlData = content.readall().decode('utf-8')
# Close the file.
content.close()
# Start parsing XML.
xml = ElementTree.parse (xmlData)
# Get root of the XML file.
root = xml.getroot()
for info in root.iter("info"):
print (info.attrib)
Is there any way I can fix my code so that I can read the web page directly into python without getting this error?
As explained in the Parsing XML section of the ElementTree docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)
You're passing the whole XML contents as a giant pathname. Your XML file is probably bigger than 2K, or whatever the maximum pathname size is for your platform, hence the error. If it weren't, you'd just get a different error about there being no directory named [everything up to the first / in your XML file].
Just use fromstring instead of parse.
Or, notice that parse can take a file object, not just a filename. And the thing returned by urlopen is a file object.
Also notice the very next line in that section:
fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree.
So, you don't want that root = tree.getroot() either.
So:
# ...
content.close()
root = ElementTree.fromstring(xmlData)

counting the words in xml file results error

I am new to python,I am trying to parse a xml document to count the total no. of words,I tried the below program to count the n no. of words in the file,But i get the error as follows:
After getting this error,i installed "utils",but still it comes.
Is there any other easy way of getting the totla no. of words of an xml document in python,Please help!
Traceback (most recent call last):
File "C:\Python27\xmlp.py", line 1, in <module>
from xml.dom import utils,core
ImportError: cannot import name utils
Coding
from xml.dom import utils,core
import string
reader = utils.FileReader('Greeting.xml')
doc = reader.document
Storage = ""
for n in doc.documentElement.childNodes:
if n.nodeType == core.TEXT_NODE:
# Accumulate contents of text nodes
Storage = Storage + n.nodeValue
print len(string.split(Storage))
You'll find it easier to use ElementTree, eg:
from xml.etree import ElementTree as ET
xml = '<a>one two three<b>four five<c>Six Seven</c></b></a>'
tree = ET.fromstring(xml)
total = sum(len(text.split()) for text in tree.itertext())
# 7
But use tree = ET.parse('Greeting.xml') to load your real data.
imho you do not need utils and core
just from xml.dom import minidom
look a similar example here: Python XML File Open

Parsing PubMed Central XML using Biopython Bio Entrez parse

I am trying to parse PubMed Central XML files using Biopython's Bio Entrez parse function. This is what I've tried so far:
from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
print xmlfile
fh = open (xmlfile, "r")
read_xml (fh, outfp)
fh.close()
def read_xml (handle, outh):
records = Entrez.parse(handle)
for record in records:
print record
I am getting the following error:
Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces
I have already downloaded archivearticle.dtd file. Are there any other DTD files that need to be installed that would describe the schema of PMC files? Has anyone successfully used the Bio Entrez function or any other method to parse PMC articles?
Thanks for your help!
Use another parser, like the minidom
from xml.dom import minidom
data = minidom.parse("pmc_full.xml")
Now depending on what data do you want to extract, dive into the XML and have fun:
for title in data.getElementsByTagName("article-title"):
for node in title.childNodes:
if node.nodeType == node.TEXT_NODE:
print node.data

Categories

Resources