lxml not performing xslt transform - python

With this code:
from lxml import etree
with open( 'C:\\Python33\\projects\\xslt', 'r' ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+' ) as result, open( 'C:\\Python33\\projects\\xml', 'r' ) as xml:
s_xml = xml.read()
s_xslt = xslt.read()
transform = etree.XSLT(etree.XML(s_xslt))
out = transform(etree.XML(s_xml))
result.write(out)
I get this error:
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
from projects.xslt_transform import trans
File ".\projects\xslt_transform.py", line 17, in <module>
transform = etree.XSLT(etree.XML(s_xslt))
File "xslt.pxi", line 409, in lxml.etree.XSLT.__init__ (src\lxml\lxml.etree.c:150256)
lxml.etree.XSLTParseError: Invalid expression
this couple xml/xslt files works with other tools.
Also I had to get rid of the encoding attribute in the top declarations for both files in order not to get:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
can it be related ?
EDIT:
this does not work either (i get the same error):
with open( 'C:\\Python33\\projects\\xslt', 'r',encoding="utf-8" ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+',encoding="utf-8" ) as result, open( 'C:\\Python33\\projects\\xml', 'r',encoding="utf-8" ) as xml:
s_xml = etree.parse(BytesIO(bytes(xml.read(),'UTF-8')))
s_xslt = etree.parse(BytesIO(bytes(xslt.read(),'UTF-8')))
transform = etree.XSLT(s_xslt)
out = transform(s_xml)
print(out.tostring())
reading lxml source code: this returns an exception:
xslt.xsltParseStylesheetDoc(c_doc)
so it seems an actual parse error. Can it be namespace related ?
EDIT SOLVED:
s_xml = etree.parse(xml.read())
s_xslt = etree.parse(xslt.read())
thanks tomalak

Parsing XML is more complicated than "open a text file, stuff the resulting string into etree".
XML files are serialized representations of a DOM tree. They are not to be handled as text even though they come in the shape of a text file. They come in multiple byte encodings and finding out which encoding a certain file uses is anything but trivial.
XML parsers have proper detection mechanisms built in and therefore they should be used to open XML files. The the basic open() + read() calls are not enough to correctly handle the file contents.
lxml.etree provides the parse() function that can accept a number of argument types:
an open file object (make sure to open it in binary mode)
a file-like object that has a .read(byte_count) method returning a byte string on each call
a filename string
an HTTP or FTP URL string
and then will correctly parse the associated document back into a DOM tree.
Your code should look more like this:
from lxml import etree
f_xsl = 'C:\\Python33\\projects\\xslt'
f_xml = 'C:\\Python33\\projects\\xml'
f_out = 'C:\\Python33\\projects\\result'
transform = etree.XSLT(etree.parse(f_xsl))
result = transform(etree.parse(f_xml))
result.write(f_out)

Related

Write ElementTree directly to zip with utf-8 encoding

I want to modify a large number of XMLs. They are stored in ZIP-files. The source-XMLs are utf-8 encoded (at least to the guesses of the file tool on Linux) and have a correct XML declaration:
<?xml version='1.0' encoding='UTF-8'?>.
The target ZIPs and the XMLs contained therein should also have the correct XML declaration. However, the (at least to me) most obvious method (using ElementTree.tostring) fails.
Here is a self-contained example, that should work out of the box.
Short walkthrough:
imports
preparations (creating src.zip, these ZIPs are a given in my actual application)
actual work of program (modifying XMLs), starting at # read XMLs from zip
Please focus on the lower part, especially # APPROACH 1, APPROACH 2, APPROACH 3:
import os
import tempfile
import zipfile
from xml.etree.ElementTree import Element, parse
src_1 = os.path.join(tempfile.gettempdir(), "one.xml")
src_2 = os.path.join(tempfile.gettempdir(), "two.xml")
src_zip = os.path.join(tempfile.gettempdir(), "src.zip")
trgt_appr1_zip = os.path.join(tempfile.gettempdir(), "trgt_appr1.zip")
trgt_appr2_zip = os.path.join(tempfile.gettempdir(), "trgt_appr2.zip")
trgt_appr3_zip = os.path.join(tempfile.gettempdir(), "trgt_appr3.zip")
# file on hard disk that must be used due to ElementTree insufficiencies
tmp_xml_name = os.path.join(tempfile.gettempdir(), "curr_xml.tmp")
# prepare src.zip
tree1 = ElementTree(Element('hello', {'beer': 'good'}))
tree1.write(os.path.join(tempfile.gettempdir(), "one.xml"), encoding="UTF-8", xml_declaration=True)
tree2 = ElementTree(Element('scnd', {'äkey': 'a value'}))
tree2.write(os.path.join(tempfile.gettempdir(), "two.xml"), encoding="UTF-8", xml_declaration=True)
with zipfile.ZipFile(src_zip, 'a') as src:
with open(src_1, 'r', encoding="utf-8") as one:
string_representation = one.read()
# write to zip
src.writestr(zinfo_or_arcname="one.xml", data=string_representation.encode("utf-8"))
with open(src_2, 'r', encoding="utf-8") as two:
string_representation = two.read()
# write to zip
src.writestr(zinfo_or_arcname="two.xml", data=string_representation.encode("utf-8"))
os.remove(src_1)
os.remove(src_2)
# read XMLs from zip
with zipfile.ZipFile(src_zip, 'r') as zfile:
updated_trees = []
for xml_name in zfile.namelist():
curr_file = zfile.open(xml_name, 'r')
tree = parse(curr_file)
# modify tree
updated_tree = tree
updated_tree.getroot().append(Element('new', {'newkey': 'new value'}))
updated_trees.append((xml_name, updated_tree))
for xml_name, updated_tree in updated_trees:
# write to target file
with zipfile.ZipFile(trgt_appr1_zip, 'a') as trgt1_zip, zipfile.ZipFile(trgt_appr2_zip, 'a') as trgt2_zip, zipfile.ZipFile(trgt_appr3_zip, 'a') as trgt3_zip:
#
# APPROACH 1 [DESIRED, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration missing
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml')
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 2 [WORKS IN THEORY, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration is faulty (is 'utf8', should be 'utf-8' or 'UTF-8')
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='utf8', method='xml')
# write XML directly to zip
trgt2_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 3 [WORKS, BUT LACKS PERFORMANCE]: write to file, then read from file, then write to zip
#
# write to file
updated_tree.write(tmp_xml_name, encoding="UTF-8", method="xml", xml_declaration=True)
# read from file
with open(tmp_xml_name, 'r', encoding="utf-8") as tmp:
string_representation = tmp.read()
# write to zip
trgt3_zip.writestr(zinfo_or_arcname=xml_name, data=string_representation.encode("utf-8"))
os.remove(tmp_xml_name)
APPROACH 3 works, but it is much more resource-intensive than the other two.
APPROACH 2 is the only way I could get an ElementTree object to be written with an actual XML declaration -- which then turns out to be invalid (utf8 instead of UTF-8/utf-8).
APPROACH 1 would be most desired -- but fails during reading later in the pipeline, as the XML declaration is missing.
Question: How can I get rid of writing the whole XML to disk first, only to read it afterwards, write it to the zip and delete it after being done with the zip? What am I missing?
You can use an io.BytesIO object.
This allows using ElementTree.write, while avoiding exporting the tree to disk:
import zipfile
from io import BytesIO
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
bio = BytesIO()
tree.write(bio, encoding='UTF-8', xml_declaration=True)
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
z.writestr('test.xml', bio.getvalue())
If you are using Python 3.6 or higher, there's an even shorter solution:
you can get a writable file object from the ZipFile object, which you can pass to ElementTree.write:
import zipfile
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
with z.open('test.xml', 'w') as f:
tree.write(f, encoding='UTF-8', xml_declaration=True)
This also has the advantage that you don't store multiple copies of the tree in memory, which could be a relevant issue for large trees.
The only thing that is really missing in approach one is the XML declaration header. For ElementTree.write(...) you can use the xml_declaration, unfortunately for your version this isn't available in ElementTree.tostring yet.
Starting with Python 3.8, the ElementTree.tostring method does have a xml_declaration argument, see:
https://docs.python.org/3.8/library/xml.etree.elementtree.html
Even though that implementation is unavailable to you when using Python 3.6, you can easily copy the 3.8 implementation in your own Python file:
import io
def tostring(element, encoding=None, method=None, *,
xml_declaration=None, default_namespace=None,
short_empty_elements=True):
"""Generate string representation of XML element.
All subelements are included. If encoding is "unicode", a string
is returned. Otherwise a bytestring is returned.
*element* is an Element instance, *encoding* is an optional output
encoding defaulting to US-ASCII, *method* is an optional output which can
be one of "xml" (default), "html", "text" or "c14n", *default_namespace*
sets the default XML namespace (for "xmlns").
Returns an (optionally) encoded string containing the XML data.
"""
stream = io.StringIO() if encoding == 'unicode' else io.BytesIO()
ElementTree(element).write(stream, encoding,
xml_declaration=xml_declaration,
default_namespace=default_namespace,
method=method,
short_empty_elements=short_empty_elements)
return stream.getvalue()
(See https://github.com/python/cpython/blob/v3.8.0/Lib/xml/etree/ElementTree.py#L1116)
In that case you can simply use approach one:
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml', xml_declaration=True)
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)

How to compress a text file?

I have a text file created and I want to compress it.
How would I accomplish this?
I have done some research, around the forum ; found a question, similar to this but when I tried it out, it did not work as it was text typed in, not a file, for example
import zlib, base64
text = 'STACK OVERFLOW'
code = base64.b64encode(zlib.compress(text,9))
print code
source from: (Compressing a file in python and keep the grammar exact when opening it again)
When i tried it out this error came up, for example:
hTraceback (most recent call last):
File "C:\Users\Shahid\Desktop\Suhail\Task 3.py", line 3, in <module>
code = base64.b64encode(zlib.compress(text,9))
TypeError: must be string or read-only buffer, not file
Here is the code that I have used:
import zlib, base64
text = open('Suitable.txt','r')
code = base64.b64encode(zlib.compress(text,9))
print code
But what i want is a text file to be compressed.
there is a section entitled "Example of how to GZIP compress an existing file" at the bottom of https://docs.python.org/2/library/gzip.html
you should use this code to do what you tried:
import zlib, base64
file = open('Suitable.txt','r')
text = file.read()
file.close()
code = base64.b64encode(zlib.compress(text.encode('utf-8'),9))
code = code.decode('utf-8')
print(code)
but it actually want be compressed because code is longer than text.

Python Minidom Parsing File Objects

I wrote a code using minidom which takes an xml script, opens it as a file object and then parses that file object. Not only that, but I want the script to open multiple files that are all contained in a folder, and parse each one individually.
An example of the xml script is:
<?xml version="1.0"?>
<Data>
<data1>1</data1>
<data2>2</data2>
<data3>3</data3>
<Sub_data>
<sub_data1>0.1111111111111</sub_data1>
<sub_data2>0.2222222222222</sub_data2>
... and so on.
i.e., it's pretty standard.
Now, my code looks like this:
import os
import io
from xml.dom import minidom
#folder where xml files are located
indir = '/foo/bar/docs/'
masterlist = []
for root, dirs, filenames in os.walk(indir):
for f in filenames:
row = []
fsock = io.open(indir + f, mode = 'rt', encoding = 'cp1252')
xmldoc = minidom.parse(fsock)
...
and the error I am getting is:
Traceback (most recent call last): File "kgp_2.py", line 34, in
<module> xmldoc = minidom.parse(fsock) File
"/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse return
expatbuilder.parse(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 211, in parseFile
parser.Parse("", True) xml.parsers.expat.ExpatError: no element found:
line 203, column 1381
Now, when I make the change:
fsock = io.open(indir + filenames[0], mode = 'rt', encoding = 'cp1252')
this works fine, that is, it opens the first file in the folder; but I want to parse all the files in the folder. When I do a loop like:
m = 0
... in loop:
fsock = io.open(indir + filenames[m], mode = 'rt', encoding = 'cp1252')
...
m = m+1
I get the original error.
The reason I am using the io library instead of the usual file open function is that a previous stack overflow article recommended it. Using:
fsock = open(indir + filenames[0])
like before, gets no error, but:
fsock = open(indir + f)
or
#with a loop over m, like above
fsock = open(infir + filenames[m])
get the same error as above.
A strange problem. When I print the filenames they are correct. And they are being opened, there's no error there. It's the parser that just won't parse the object files, even with filenames[m] where m = 0, surely this should be no problem?
EDIT:
Parsing document with python minidom
in this post they had a similar problem, the resolution was to use
xmldoc.seek(0)
however, for me this returns
Traceback (most recent call last):
File "kgp_2.py", line 45, in <module>
xmldoc.seek(0)
AttributeError: Document instance has no attribute 'seek'
EDIT 2: THIS HAS BEEN RESOLVED. IT WAS A CASE OF A CORRUPTED INPUT XML FILE.
Are you sure the XML data contained in all XML files is correct? Perhaps one is empty an you have to handle such Exception. Anyhow I recommend you to use xml.etree doc.

lxml.etree: Start tag expected, '<' not found, line 1, column 1

I want to take some simple xml files and convert them all to CSV in one go (though this code is just for one at a time). It looks to me like there are no official name spaces, but I'm not sure.
I have this code (I used one header, SubmittingSystemVendor, but I really want to write all of them to CSV:
import csv
import lxml.etree
x = r'C:\Users\...\jh944.xml'
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow('SubmittingSystemVendor')
root = lxml.etree.fromstring(x)
writer.writerow(row)
Here is a sample of the XML file:
<?xml version="1.0" encoding="utf-8"?>
<EOYGeneralCollectionGroup SchemaVersionMajor="2014-2015" SchemaVersionMinor="1" CollectionId="157" SubmittingSystemName="MISTAR" SubmittingSystemVendor="WayneRESA" SubmittingSystemVersion="2014" xsi:noNamespaceSchemaLocation="http://cepi.state.mi.us/msdsxml/EOYGeneralCollection2014-20151.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EOYGeneralCollection>
<SubmittingEntity>
<SubmittingEntityTypeCode>D</SubmittingEntityTypeCode>
<SubmittingEntityCode>82730</SubmittingEntityCode>
</SubmittingEntity>
The error is:
lxml.etree: Start tag expected, '<' not found, line 1, column 1
You are using lxml.etree.fromstring, but giving it a file path as the argument. This means it's trying to interpret "C:\Users...\jh944.xml" as the XML data to be parsed.
Instead, you want to open the file containing this XML. You can simply replace the call to fromstring with lxml.etree.parse, which will accept a filename or open file object as the argument.

OSError: [Errno 36] File name too long:

I need to convert a web page to XML (using Python 3.4.3). If I write the contents of the URL to a file then I can read and parse it perfectly but if I try to read directly from the web page I get the following error in my terminal:
File "./AnimeXML.py", line 22, in
xml = ElementTree.parse (xmlData)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 587, in parse
source = open(source, "rb")
OSError: [Errno 36] File name too long:
My python code:
# AnimeXML.py
#! /usr/bin/Python
# Import xml parser.
import xml.etree.ElementTree as ElementTree
# XML to parse.
sampleUrl = "http://cdn.animenewsnetwork.com/encyclopedia/api.xml?anime=16989"
# Read the xml as a file.
content = urlopen (sampleUrl)
# XML content is stored here to start working on it.
xmlData = content.readall().decode('utf-8')
# Close the file.
content.close()
# Start parsing XML.
xml = ElementTree.parse (xmlData)
# Get root of the XML file.
root = xml.getroot()
for info in root.iter("info"):
print (info.attrib)
Is there any way I can fix my code so that I can read the web page directly into python without getting this error?
As explained in the Parsing XML section of the ElementTree docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)
You're passing the whole XML contents as a giant pathname. Your XML file is probably bigger than 2K, or whatever the maximum pathname size is for your platform, hence the error. If it weren't, you'd just get a different error about there being no directory named [everything up to the first / in your XML file].
Just use fromstring instead of parse.
Or, notice that parse can take a file object, not just a filename. And the thing returned by urlopen is a file object.
Also notice the very next line in that section:
fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree.
So, you don't want that root = tree.getroot() either.
So:
# ...
content.close()
root = ElementTree.fromstring(xmlData)

Categories

Resources