Python Minidom Parsing File Objects - python

I wrote a code using minidom which takes an xml script, opens it as a file object and then parses that file object. Not only that, but I want the script to open multiple files that are all contained in a folder, and parse each one individually.
An example of the xml script is:
<?xml version="1.0"?>
<Data>
<data1>1</data1>
<data2>2</data2>
<data3>3</data3>
<Sub_data>
<sub_data1>0.1111111111111</sub_data1>
<sub_data2>0.2222222222222</sub_data2>
... and so on.
i.e., it's pretty standard.
Now, my code looks like this:
import os
import io
from xml.dom import minidom
#folder where xml files are located
indir = '/foo/bar/docs/'
masterlist = []
for root, dirs, filenames in os.walk(indir):
for f in filenames:
row = []
fsock = io.open(indir + f, mode = 'rt', encoding = 'cp1252')
xmldoc = minidom.parse(fsock)
...
and the error I am getting is:
Traceback (most recent call last): File "kgp_2.py", line 34, in
<module> xmldoc = minidom.parse(fsock) File
"/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse return
expatbuilder.parse(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 211, in parseFile
parser.Parse("", True) xml.parsers.expat.ExpatError: no element found:
line 203, column 1381
Now, when I make the change:
fsock = io.open(indir + filenames[0], mode = 'rt', encoding = 'cp1252')
this works fine, that is, it opens the first file in the folder; but I want to parse all the files in the folder. When I do a loop like:
m = 0
... in loop:
fsock = io.open(indir + filenames[m], mode = 'rt', encoding = 'cp1252')
...
m = m+1
I get the original error.
The reason I am using the io library instead of the usual file open function is that a previous stack overflow article recommended it. Using:
fsock = open(indir + filenames[0])
like before, gets no error, but:
fsock = open(indir + f)
or
#with a loop over m, like above
fsock = open(infir + filenames[m])
get the same error as above.
A strange problem. When I print the filenames they are correct. And they are being opened, there's no error there. It's the parser that just won't parse the object files, even with filenames[m] where m = 0, surely this should be no problem?
EDIT:
Parsing document with python minidom
in this post they had a similar problem, the resolution was to use
xmldoc.seek(0)
however, for me this returns
Traceback (most recent call last):
File "kgp_2.py", line 45, in <module>
xmldoc.seek(0)
AttributeError: Document instance has no attribute 'seek'
EDIT 2: THIS HAS BEEN RESOLVED. IT WAS A CASE OF A CORRUPTED INPUT XML FILE.

Are you sure the XML data contained in all XML files is correct? Perhaps one is empty an you have to handle such Exception. Anyhow I recommend you to use xml.etree doc.

Related

How to increase version number of a xml file after each change in the file using ETree

I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.

ElementTree errors, html files will not parse using Python/Sublime

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks, but the first one is this: I can not get it to properly parse the file. Below is a brief explanation, the python code and the traceback info.
Using Python & Sublime to parse html files, I am getting several errors. What IS working: it runs fine up until if '.html' in file:. It does not execute that loop. It will iterate through print allFiles just fine. It also creates the csv file and creates the headers (though not in separate columns, but I can ask about that later).
It seems that the problem is in the if tree = ET.parse(HTML_PATH+"/"+file) piece. I've written this several different ways (without "/" and/or "file", for example)--so far I have yet to resolve this problem.
If I can provide more information or if anyone can direct me to other documenation, it would be greatly appreciated. So far I have yet to find anything that addresses this issue.
Many thanks for your thoughts.
//C
# Parses out data from crawled html files under "html files"
# and places the output in output.csv.
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = ET.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
print "tbody"
# The tbody attribute spells it all (or does it):
name = node.attrib.get('/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font')
# Check common header stuff
if name=='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
#print ' ------------------'
#print ' Category:'
category=node.text
print "category"
f.close()
Traceback:
File "/Users/C/data/Folder_NS/data_parse.py", line 34, in
tree = ET.parse(HTML_PATH+"/"+file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 63, column 2
You are trying to parse HTML with an XML parser, and valid HTML is not always valid XML. You would be better off using the HTML parsing library in the lxml package.
import xml.etree.ElementTree as ET
# ...
tree = ET.parse(HTML_PATH + '/' + file)
would be changed to
import lxml.html
# ...
tree = lxml.html.parse(HTML_PATH + '/' + file)

lxml not performing xslt transform

With this code:
from lxml import etree
with open( 'C:\\Python33\\projects\\xslt', 'r' ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+' ) as result, open( 'C:\\Python33\\projects\\xml', 'r' ) as xml:
s_xml = xml.read()
s_xslt = xslt.read()
transform = etree.XSLT(etree.XML(s_xslt))
out = transform(etree.XML(s_xml))
result.write(out)
I get this error:
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
from projects.xslt_transform import trans
File ".\projects\xslt_transform.py", line 17, in <module>
transform = etree.XSLT(etree.XML(s_xslt))
File "xslt.pxi", line 409, in lxml.etree.XSLT.__init__ (src\lxml\lxml.etree.c:150256)
lxml.etree.XSLTParseError: Invalid expression
this couple xml/xslt files works with other tools.
Also I had to get rid of the encoding attribute in the top declarations for both files in order not to get:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
can it be related ?
EDIT:
this does not work either (i get the same error):
with open( 'C:\\Python33\\projects\\xslt', 'r',encoding="utf-8" ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+',encoding="utf-8" ) as result, open( 'C:\\Python33\\projects\\xml', 'r',encoding="utf-8" ) as xml:
s_xml = etree.parse(BytesIO(bytes(xml.read(),'UTF-8')))
s_xslt = etree.parse(BytesIO(bytes(xslt.read(),'UTF-8')))
transform = etree.XSLT(s_xslt)
out = transform(s_xml)
print(out.tostring())
reading lxml source code: this returns an exception:
xslt.xsltParseStylesheetDoc(c_doc)
so it seems an actual parse error. Can it be namespace related ?
EDIT SOLVED:
s_xml = etree.parse(xml.read())
s_xslt = etree.parse(xslt.read())
thanks tomalak
Parsing XML is more complicated than "open a text file, stuff the resulting string into etree".
XML files are serialized representations of a DOM tree. They are not to be handled as text even though they come in the shape of a text file. They come in multiple byte encodings and finding out which encoding a certain file uses is anything but trivial.
XML parsers have proper detection mechanisms built in and therefore they should be used to open XML files. The the basic open() + read() calls are not enough to correctly handle the file contents.
lxml.etree provides the parse() function that can accept a number of argument types:
an open file object (make sure to open it in binary mode)
a file-like object that has a .read(byte_count) method returning a byte string on each call
a filename string
an HTTP or FTP URL string
and then will correctly parse the associated document back into a DOM tree.
Your code should look more like this:
from lxml import etree
f_xsl = 'C:\\Python33\\projects\\xslt'
f_xml = 'C:\\Python33\\projects\\xml'
f_out = 'C:\\Python33\\projects\\result'
transform = etree.XSLT(etree.parse(f_xsl))
result = transform(etree.parse(f_xml))
result.write(f_out)

pypdf for lists of pdfs

I have got pypdf to work just fine for a single pdf file, but I can not seem to get it to work for a lits of files, or in a for loop for multiple pdfs, without failing because of the string not being callable. Any ideas I can use as a work around?
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
#print getPDFContent(r"Z:\GIS\MasterPermits\12300983.pdf").encode("ascii", "ignore")
#find pdfs
for root, dirs, files in os.walk(folder1):
for file in files:
if file.endswith(('.pdf')):
d=os.path.join(root, file)
print getPDFContent(d).encode("ascii", "ignore")
Traceback (most recent call last):
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 50, in <module>
print getPDFContent(d).encode("ascii", "ignore")
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 32, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
TypeError: 'str' object is not callable
I was using a list, but I got the exact same error, I didnt think this would be a big deal, but as of right now it is becoming one. I know I was able to work around similar issues in arcpy, but this is nothing close
Try not to use built-in types for variable names:
Don't do this:
for file in files:
Do this instead:
for myfile in files:

XML (.xsd) feed validation against a schema

I have a XML file and I have a XML schema. I want to validate the file against that schema and check if it adheres to that. I am using python but am open to any language for that matter if there is no such useful library in python.
What would be my best options here? I would worry about the how fast I can get this up and running.
Definitely lxml.
Define an XMLParser with a predefined schema, load the the file fromstring() and catch any XML Schema errors:
from lxml import etree
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'r') as f:
etree.fromstring(f.read(), xmlparser)
return True
except etree.XMLSchemaError:
return False
schema_file = 'schema.xsd'
with open(schema_file, 'r') as f:
schema_root = etree.XML(f.read())
schema = etree.XMLSchema(schema_root)
xmlparser = etree.XMLParser(schema=schema)
filenames = ['input1.xml', 'input2.xml', 'input3.xml']
for filename in filenames:
if validate(xmlparser, filename):
print("%s validates" % filename)
else:
print("%s doesn't validate" % filename)
Note about encoding
If the schema file contains an xml tag with an encoding (e.g. <?xml version="1.0" encoding="UTF-8"?>), the code above will generate the following error:
Traceback (most recent call last):
File "<input>", line 2, in <module>
schema_root = etree.XML(f.read())
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
A solution is to open the files in byte mode: open(..., 'rb')
[...]
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'rb') as f:
[...]
with open(schema_file, 'rb') as f:
[...]
The python snippet is good, but an alternative is to use xmllint:
xmllint -schema sample.xsd --noout sample.xml
import xmlschema
def get_validation_errors(xml_file, xsd_file):
schema = xmlschema.XMLSchema(xsd_file)
validation_error_iterator = schema.iter_errors(xml_file)
errors = list()
for idx, validation_error in enumerate(validation_error_iterator, start=1):
err = validation_error.__str__()
errors.append(err)
print(err)
return errors

Categories

Resources