how to parse xml with multiple root element - python

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.

As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)

Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

Related

Python XML parsing removing empty CDATA nodes

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?
Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Python XML parinsg is seeing unexpected tag

I'm trying to parse an XML which is including following code example:
<?xml version="1.0" encoding="UTF-8"?>
<ops:OPS xmlns:ops="http://www.si2.org/schemas/OPS"
opsVersion="OPS_1.1">
<ops:RulesSection>
[...some stuff..]
</ops:RulesSection>
</ops:OPS>
When I'm parsing this file and checking Elements value I'm expecting to find an element called "ops:RulesSection"
from xml.etree import ElementTree as ET
from xml.etree.ElementTree import ElementTree, Elem
tree = ET.parse(OPSDRMFile)
root = tree.getroot()
for elem in root:
print("!!!",elem)
However I am getting following results
!!! <Element '{http://www.si2.org/schemas/OPS}RulesSection' at 0x2b557da50f98>
it seems that "ops:" tag has been replaced by its value from the beginning of the file... can someone explain if this is expected behavior? Is there a way to NOT expand such parameter?
(as of now if I want to perform further parsing I need to transfrom "ops:NewTagToFind" as "{http://www.si2.org/schemas/OPS}NewTagToFind" when I using "findall" command.
Thanks,
Chris

Checking if XML declaration is present

I am trying to check whether an xml file contains the necessary xml declaration ("header"), let's say:
<?xml version="1.0" encoding="UTF-8"?>
...rest of xml file...
I am using xml ElementTree for reading and getting info out of the file, but it seems to load a file just fine even if it does not have the header.
What I tried so far is this:
import xml.etree.ElementTree as ET
tree = ET.parse(someXmlFile)
try:
xmlFile = ET.tostring(tree.getroot(), encoding='utf8').decode('utf8')
except:
sys.stderr.write("Wrong xml2 header\n")
exit(31)
if re.match(r"^\s*<\?xml version=\'1\.0\' encoding=\'utf8\'\?>\s+", xmlFile) is None:
sys.stderr.write("Wrong xml1 header\n")
exit(31)
But the ET.tostring() function just "makes up" a header if it is not present in the file.
Is there any way to check for a xml header with ET? Or somehow throw an error while loading the file with ET.parse, if a file does not contain the xml header?
tl;dr
from xml.dom.minidom import parseString
def has_xml_declaration(xml):
return parseString(xml).version
From Wikipedia's XML declaration
If an XML document lacks encoding specification, an XML parser assumes
that the encoding is UTF-8 or UTF-16, unless the encoding has already
been determined by a higher protocol.
...
The declaration may be optionally omitted because it declares as its
encoding the default encoding. However, if the document instead makes
use of XML 1.1 or another character encoding, a declaration is
necessary. Internet Explorer prior to version 7 enters quirks mode, if
it encounters an XML declaration in a document served as text/html
So even if the XML declaration is omitted in an XML document, the code-snippet:
if re.match(r"^<\?xml\s*version=\'1\.0\' encoding=\'utf8\'\s*\?>", xmlFile.decode('utf-8')) is None:
will find "the" default XML declaration in this XML document. Please note, that I have used xmlFile.decode('utf-8') instead of xmlFile.
If you don't worry to use minidom, you can use the following code-snippet:
from xml.dom.minidom import parse
dom = parse('bookstore-003.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))
Here is a working fiddle
Int bookstore-001.xml an XML declaration ist present, in bookstore-002.xml no XML declaration ist present and in bookstore-003.xml a different XML declaration than in the first example ist present. The print instruction prints accordingly the version and the encoding:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="None" encoding="None"?>
<?xml version="1.0" encoding="ISO-8859-1"?>

how to parse xml without dtd validation and using lxml?

I've tried using the following code which has invalid dtd/xml
<city>
<address>
<zipcode>4455</zipcode>
</address>
I'm trying to parse with with lxml
like this,
from lxml import etree as ET
parser = ET.XMLParser(dtd_validation=False)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))
Unfortunately, This code still gives xml errors,
Any idea how i can get a non-validating parse of the above xml?
Assuming that by 'invalid dtd' you meant that the <city> tag is not closed in above XML sample, then your document is actually invalid XML or frankly it isn't XML at all because it doesn't follow XML rules.
You need to fix the document somehow to be able to treat it as an XML document. For this simple unclosed tag case, setting recover=True will do the job :
from lxml import etree as ET
parser = ET.XMLParser(recover=True)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))

Parse Solr output in python

I am trying to parse a solr output of the form:
<doc>
<str name="source">source:A</str>
<str name="url">URL:A</str>
<date name="p_date">2012-09-08T10:02:01Z</date>
</doc>
<doc>
<str name="source">source:B</str>
<str name="url">URL:B</str>
<date name="p_date">2012-08-08T11:02:01Z</date>
</doc>
I am keen on using beautiful soup (versions that have BeautifulStoneSoup; I think prior to BS4) for parsing the docs.
I have used beautiful soup for HTML parsing but some how I am not able to find a effecient way to extract the contents of the tag.
I have written:
for tags in soup('doc'):
print tags.renderContents()
I do sense that I can work my way through it forcibly to get the outputs (like say 'soup'ing it again), but would appreciate an effecient solution to extract data.
My output required is:
source:A
URL:A
2012-09-08T10:02:01Z
source:B
URL:B
2012-08-08T11:02:01Z
Thanks
Use a XML parser for task instead; xml.etree.ElementTree is included with Python:
from xml.etree import ElementTree as ET
# `ET.fromstring()` expects a string containing XML to parse.
# tree = ET.fromstring(solrdata)
# Use `ET.parse()` for a filename or open file object, such as returned by urllib2:
ET.parse(urllib2.urlopen(url))
for doc in tree.findall('.//doc'):
for elem in doc:
print elem.attrib['name'], elem.text
Do you have to use this particular output format? Solr supports Python output format out of the box (in version 4 at least), just use wt=python in the query.

Categories

Resources