python libxml2 reader and XML_PARSE_RECOVER - python

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.
However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):
Sample code (with small example):
import cStringIO
import libxml2
DOC = "<a>some broken & xml</a>"
reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)
ret = reader.Read()
while ret:
print 'ret: %d' % ret
print "node name: ", reader.Name(), reader.NodeType()
ret = reader.Read()
Any ideas how to recover correctly?

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:
from cStringIO import StringIO
from lxml import etree
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())
The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.
Edit:
If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:
DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)
for data in StringIO(DOC).read():
reader.feed(data)
tree = reader.close()
print etree.tostring(tree)

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?
For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.
EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.
You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.
In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.
In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).
Good luck.

Or, you could use BeautifulSoup. It does a nice job recovering broken ML.

Related

Obtain DTD information from XML using Python

I'm trying to extract the DTD information from an XML document using Python and preferably the standard library. At first glance, it seems xml.sax.handler.DTDHandler is the way to go, so I wrote the following example code to extract the DTD of a trivial DocBook v4 document:
import xml.sax
from contextlib import closing
XML_CODE = '''<!DOCTYPE example PUBLIC
"-//OASIS//DTD DocBook XML V4.1.2//EN"
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<example><title>Hello World in Python</title>
<programlisting>
print('Hello World!')
</programlisting>
</example>'''
class DTDPrinter(xml.sax.handler.DTDHandler):
def notationDecl(self, name, publicId, systemId):
print('name={}, publicId={}'.format(name, publicId))
if __name__ == '__main__':
with closing(xml.sax.make_parser()) as parser:
parser.setFeature(xml.sax.handler.feature_external_pes, False)
parser.setFeature(xml.sax.handler.feature_validation, False)
parser.setDTDHandler(DTDPrinter())
print('---------- before feed')
parser.feed(XML_CODE)
print('---------- after feed')
My expectation was that when running this code with Python 3.5 the output would be something like:
---------- before feed
name=example, publicId=-//OASIS//DTD DocBook XML V4.1.2//EN
---------- after feed
Instead I get an output with DTD's seemingly related to various image formats but not the one specified in the document:
---------- before feed
name=BMP, publicId=+//ISBN 0-7923-9432-1::Graphic Notation//NOTATION Microsoft Windows bitmap//EN
name=CGM-CHAR, publicId=ISO 8632/2//NOTATION Character encoding//EN
name=CGM-BINARY, publicId=ISO 8632/3//NOTATION Binary encoding//EN
...
name=WMF, publicId=+//ISBN 0-7923-9432-1::Graphic Notation//NOTATION Microsoft Windows Metafile//EN
name=WPG, publicId=None
name=linespecific, publicId=None
---------- after feed
Although maybe the last entry with the name linespecific might refer to the document DTD in a crippled way?
I also noticed a couple of seconds delay after the last output despite the document being trivial. Maybe the parsers attempts to connect to the internet? I tried to disable this by settings the features
parser.setFeature(xml.sax.handler.feature_external_pes, False)
parser.setFeature(xml.sax.handler.feature_validation, False)
but to no avail.
How can I convince the DTDHandler to react to the DTD occurring in the document and not connect to the internet?
As I didn't receive an answer and couldn't get this to work properly in the past 2 weeks I resorted to violence and simply used a regular expression to extract the information. This is ugly and won't be able to properly process all valid ways to express a DTD but is good enough for my purpose.
import re
DTD_REGEX = re.compile(
r'<!DOCTYPE\s+(?P<name>[a-zA-Z][a-zA-Z-]*)\s+PUBLIC\s+"(?P<public_id>.+)"')
dtd_match = DTD_REGEX.match(XML_CODE)
if dtd_match is not None:
public_id = dtd_match.group('public_id')
print(public_id)

Python lxml.etree - Is it more effective to parse XML from string or directly from link?

With the lxml.etree python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2), to return a string and then parse from that? Or does it make no difference at all?
Method 1 - Parse directly from link
from lxml import etree as ET
parsed = ET.parse(url_link)
Method 2 - Parse from string
from lxml import etree as ET
import urllib2
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)
# note: I do not have access to python
# at the moment, so not sure whether
# the .fromstring() function is correct
Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?
I ran the two methods with a simple timing rapper.
Method 1 - Parse XML Directly From Link
from lxml import etree as ET
#timing
def parseXMLFromLink():
parsed = ET.parse(url_link)
print parsed.getroot()
for n in range(0,100):
parseXMLFromLink()
Average of 100 = 98.4035 ms
Method 2 - Parse XML From String Returned By Urllib2
from lxml import etree as ET
import urllib2
#timing
def parseXMLFromString():
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.fromstring(xml_string)
print parsed
for n in range(0,100):
parseXMLFromString()
Average of 100 = 286.9630 ms
So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink() function would still remain quicker as it is urllib2 that seems to slow the second function down.
I ran this a few times and the results stayed the same.
If by 'effective' you mean 'efficient', I'm relatively certain you will see no difference between the two at all (unless ET.parse(link) is horribly implemented).
The reason is that the network time is going to be the most significant part of parsing an online XML file, a lot longer than storing the file to disk or keeping it in memory, and a lot longer than actually parsing it.

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file

Basic Python file searching and I/O

I'm trying to complete a simple task in Python and I'm new to the language (I'm C++). I hope someone might be able to point me in the right direction.
Problem:
I have an XML file (12mb) full of data and within the file there are start tags 'xmltag' and end tags '/xmltag' that represent the start and end of the data sections I would like to pull out.
I would like to navigate through this open file with a loop and for each instance locate a start tag and copy the data within the section to a new file until the end tag. I would then like to repeat this to the end of the file.
I'm happy with the file I/O but not the most efficient looping, searching and extracting of the data.
I really like the look of the language and hopefully I'm going to get more involved so I can give back to the community.
Big thanks!
Check BeautifulSoup
from BeautifulSoup import BeautifulSoup
with open('bigfile.xml', 'r') as xml:
soup = BeautifulSoup(xml):
for xmltag in soup('xmltag'):
print xmltag.contents
Dive Into Python 3 have a great chapter about this:
http://diveintopython3.org/xml.html#xml-parse
It'a great free book about python, worth reading !
The BeautifulSoup answer is good but this executes faster and doesn't require an external library:
import xml.etree.cElementTree as ET
tree = ET.parse('xmlfile.xml')
results = (elem for elem in tree.getiterator('xmltag'))
# in Python 2.7+, getiterator() is deprecated; use tree.iter('xmltag')
No need to install BeautifulSoup, Python includes the ElementTree parser in its standard library.
from xml.etree import cElementTree as ET
tree = ET.parse('myfilename')
new_tree = ET('new_root_element')
for element in tree.findall('.//xmltag'):
new_tree.append(tree.element)
print ET.tostring(new_tree)
xml=open("xmlfile").read()
x=xml.split("</xmltag>")
for block in x:
if "<xmltag>" in block:
print block.split("<xmltag>")[-1]

Showing progress of python's XML parser when loading a huge file

Im using Python's built in XML parser to load a 1.5 gig XML file and it takes all day.
from xml.dom import minidom
xmldoc = minidom.parse('events.xml')
I need to know how to get inside that and measure its progress so I can show a progress bar.
any ideas?
minidom has another method called parseString() that returns a DOM tree assuming the string you pass it is valid XML, If I were to split up the file myself into chunks and pass them to parseString one at a time, could I possibly merge all the DOM trees back together at the end?
you usecase requires that you use sax parser instead of dom, dom loads everything in memory , sax instead will do line by line parsing and you write handlers for events as you need
so could be effective and you would be able to write progress indicator also
I also recommend trying expat parser sometime it is very useful
http://docs.python.org/library/pyexpat.html
for progress using sax:
as sax reads file incrementally you can wrap the file object you pass with your own and keep track how much have been read.
edit:
I also don't like idea of splitting file yourselves and joining DOM at end, that way you are better writing your own xml parser, i recommend instead using sax parser
I also wonder what your purpose of reading 1.5 gig file in DOM tree?
look like sax would be better here
Did you consider to use other means of parsing XML? Building a tree of such big XML files will always be slow and memory intensive. If you don't need the whole tree in memory, stream based parsing will be much faster. It can be a little daunting if you're used to tree based XML manipulation, but it will pay of in form of a huge speed increase (minutes instead of hours).
http://docs.python.org/library/xml.sax.html
I have something very similar for PyGTK, not PyQt, using the pulldom api. It gets called a little bit at a time using Gtk idle events (so the GUI doesn't lock up) and Python generators (to save the parsing state).
def idle_handler (fn):
fh = open (fn) # file handle
doc = xml.dom.pulldom.parse (fh)
fsize = os.stat (fn)[stat.ST_SIZE]
position = 0
for event, node in doc:
if position != fh.tell ():
position = fh.tell ()
# update status: position * 100 / fsize
if event == ....
yield True # idle handler stays until False is returned
yield False
def main:
add_idle_handler (idle_handler, filename)
Merging the tree at the end would be pretty easy. You could just create a new DOM, and basically append the individual trees to it one by one. This would give you pretty finely tuned control over the progress of the parsing too. You could even parallelize it if you wanted by spawning different processes to parse each section. You just have to make sure you split it intelligently (not splitting in the middle of a tag, etc.).

Categories

Resources