how to parse xml without dtd validation and using lxml?

how to parse xml without dtd validation and using lxml? - python

I've tried using the following code which has invalid dtd/xml
<city>
<address>
<zipcode>4455</zipcode>
</address>
I'm trying to parse with with lxml
like this,
from lxml import etree as ET
parser = ET.XMLParser(dtd_validation=False)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))
Unfortunately, This code still gives xml errors,
Any idea how i can get a non-validating parse of the above xml?

Assuming that by 'invalid dtd' you meant that the <city> tag is not closed in above XML sample, then your document is actually invalid XML or frankly it isn't XML at all because it doesn't follow XML rules.
You need to fix the document somehow to be able to treat it as an XML document. For this simple unclosed tag case, setting recover=True will do the job :
from lxml import etree as ET
parser = ET.XMLParser(recover=True)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))

Related

Python XML parsing removing empty CDATA nodes

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?

Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Python XML parinsg is seeing unexpected tag

I'm trying to parse an XML which is including following code example:
<?xml version="1.0" encoding="UTF-8"?>
<ops:OPS xmlns:ops="http://www.si2.org/schemas/OPS"
opsVersion="OPS_1.1">
<ops:RulesSection>
[...some stuff..]
</ops:RulesSection>
</ops:OPS>
When I'm parsing this file and checking Elements value I'm expecting to find an element called "ops:RulesSection"
from xml.etree import ElementTree as ET
from xml.etree.ElementTree import ElementTree, Elem
tree = ET.parse(OPSDRMFile)
root = tree.getroot()
for elem in root:
print("!!!",elem)
However I am getting following results
!!! <Element '{http://www.si2.org/schemas/OPS}RulesSection' at 0x2b557da50f98>
it seems that "ops:" tag has been replaced by its value from the beginning of the file... can someone explain if this is expected behavior? Is there a way to NOT expand such parameter?
(as of now if I want to perform further parsing I need to transfrom "ops:NewTagToFind" as "{http://www.si2.org/schemas/OPS}NewTagToFind" when I using "findall" command.
Thanks,
Chris

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.

As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)

Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

Parse Solr output in python

I am trying to parse a solr output of the form:
<doc>
<str name="source">source:A</str>
<str name="url">URL:A</str>
<date name="p_date">2012-09-08T10:02:01Z</date>
</doc>
<doc>
<str name="source">source:B</str>
<str name="url">URL:B</str>
<date name="p_date">2012-08-08T11:02:01Z</date>
</doc>
I am keen on using beautiful soup (versions that have BeautifulStoneSoup; I think prior to BS4) for parsing the docs.
I have used beautiful soup for HTML parsing but some how I am not able to find a effecient way to extract the contents of the tag.
I have written:
for tags in soup('doc'):
print tags.renderContents()
I do sense that I can work my way through it forcibly to get the outputs (like say 'soup'ing it again), but would appreciate an effecient solution to extract data.
My output required is:
source:A
URL:A
2012-09-08T10:02:01Z
source:B
URL:B
2012-08-08T11:02:01Z
Thanks

Use a XML parser for task instead; xml.etree.ElementTree is included with Python:
from xml.etree import ElementTree as ET
# `ET.fromstring()` expects a string containing XML to parse.
# tree = ET.fromstring(solrdata)
# Use `ET.parse()` for a filename or open file object, such as returned by urllib2:
ET.parse(urllib2.urlopen(url))
for doc in tree.findall('.//doc'):
for elem in doc:
print elem.attrib['name'], elem.text

Do you have to use this particular output format? Solr supports Python output format out of the box (in version 4 at least), just use wt=python in the query.

Python xml.dom.minidom.parse() function ignores DTDs

I have the following Python code:
import xml.dom.minidom
import xml.parsers.expat
try:
domTree = ml.dom.minidom.parse(myXMLFileName)
except xml.parsers.expat.ExpatError, e:
return e.args[0]
which I am using to parse an XML file. Although it quite happily spots simple XML errors like mismatched tags, it completely ignores the DTD specified at the top of the XML file:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE ServerConfig SYSTEM "ServerConfig.dtd">
so it doesn't notice when mandatory elements are missing, for example. How can I switch on DTD checking?

See this question - the accepted answer is to use lxml validation.

Just by way of explanation: Python xml.dom.minidom and xml.sax use the expat parser by default, which is a non-validating parser. It may read the DTD in order to do entity replacement, but it won't validate against the DTD.
gimel and Tim recommend lxml, which is a nicely pythonic binding for the libxml2 and libxslt libraries. It supports validation against a DTD. I've been using lxml, and I like it a lot.

Just for the record, this is what my code looks like now:
from lxml import etree
try:
parser = etree.XMLParser(dtd_validation=True)
domTree = etree.parse(myXMLFileName, parser=parser)
except etree.XMLSyntaxError, e:
return e.args[0]

I recommend lxml over xmlproc because the PyXML package (containing xmlproc) is not being developed any more; the latest Python version that PyXML can be used with is 2.4.

I believe you need to switch from expat to xmlproc.
See:
http://code.activestate.com/recipes/220472/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to parse xml without dtd validation and using lxml? - python

Related

Python XML parsing removing empty CDATA nodes

Python XML parinsg is seeing unexpected tag

how to parse xml with multiple root element

Parse Solr output in python

Python xml.dom.minidom.parse() function ignores DTDs

Categories

Resources