Need to read XML files as a stream using BeautifulSoup in Python

Need to read XML files as a stream using BeautifulSoup in Python - python

I have a dilemma.
I need to read very large XML files from all kinds of sources, so the files are often invalid XML or malformed XML. I still must be able to read the files and extract some info from them. I do need to get tag information, so I need XML parser.
Is it possible to use Beautiful Soup to read the data as a stream instead of the whole file into memory?
I tried to use ElementTree, but I cannot because it chokes on any malformed XML.
If Python is not the best language to use for this project please add your recommendations.

Beautiful Soup has no streaming API that I know of. You have, however, alternatives.
The classic approach for parsing large XML streams is using an event-oriented parser, namely SAX. In python, xml.sax.xmlreader. It will not choke with malformed XML. You can avoid erroneous portions of the file and extract information from the rest.
SAX, however, is low-level and a bit rough around the edges. In the context of python, it feels terrible.
The xml.etree.cElementTree implementation, on the other hand, has a much nicer interface, is pretty fast, and can handle streaming through the iterparse() method.
ElementTree is superior, if you can find a way to manage the errors.

Related

Is there an existing library to parse Wikpedia tables from dump?

I need to extract the data from tables in a wiki dump in a somewhat convenient form, e.g. a list of lists. However, due to the format of the dump it looks sort of tricky. I am aware of the WikiExtractor, which is useful for getting clean text from a dump, but it drops tables altogether. Is there a parser that would get me conveniently readable tables in a same way?

I did not manage to find a good way to parse Wikipedia tables from XML dumps. However, there seem to be some ways to do so using HTML parsers, e.g. wikitables parser. This would require a lot of scraping unless you need to analyze only the tables from specific pages. However, it seems possible to do it offline as it seems HTML Wiki dumps are about to resume (dumps, phabricator task)

How to ignore mismatched tags while parsing xml in Python

I want to parse an XML file with Python. I don't need the hierarchical tag structure -- all I want is a simple SAX or Expat-based parser. However, they both fail with mismatched tag-related error messages when the XML file is not well formed.
Is there a way to tell the parser to ignore these errors? I tried to
parser.setFeature(sax.handler.feature_validation, False)
, but that didn't help either.
Is there a solution? Either SAX/Expat would do.

You should give Beautiful Soup a try. Its main purpose is to parse HTML even in the presence of malformations. You may find it parses your invalid XML without much trouble.

Would you also use lxml? It has a function called iterparse, which is event-driven parsing in a (according to the documentation) "SAX-like fashion", and has a parameter to force parsing broken input. It's fairly easy to use, also.
lxml iterparse tutorial
lxml iterparse class definition

Efficiently parsing broken XML/HTML in python

I would like to be able to efficiently parse large HTML documents in Python. I am aware of Liza Daly's fastiter and the similar concept in the Python's own cElementTree. However, neither of these handle broken XML, which HTML reads as, well. In addition, the document may contain other broken XML.
Similarly, I'm aware of answers like this, which suggest not using any form of iterparse at all, and this is, in fact, what I'm using. However, I am trying to optimize past the biggest bottleneck in my program, which is the parsing of the documents.
Furthermore, I've done a little bit of experimentation using the SAX-style target handler for lxml parsers- I'm not sure what's going on, but it outright causes Python to stop working! Not merely throwing an exception, but a "python.exe has stopped working" message popup. I've no idea what's happening here, but I'm not even sure if this method is actually better than the standard parser, because I've seen very little about it on the Internet.
As such, my question is: Is there anything similar to iterparse, allowing me to quickly and efficiently parse through a document, that doesn't throw a snit fit when the document isn't well formed XML (IE. has recovery from poorly formed XML)?

I would use this one.
https://github.com/iogf/ehp
It is faster than lxml and it handles broken html like.
from ehp import *
doc = '''<html>
<body>
<p> cool </html></body>'''
html = Html()
dom = html.feed(doc)
print dom
It builds an AST according to the most possible HTML structure possible.
Then you can work on the AST.

Map xml file with extracted information

I am trying to map one xml file to another, based on a configuration file (that too can be an xml file).
Input
<ia>
<ib>...</ib>
<ic>...</ic>
</ia>
Output
<oa>
<ob>...</ob>
<oc>...</oc>
</oa>
Config
<config>
<conf>
<input>ia</input>
<output>oa</output>
</conf>
<conf>
<input>ib</input>
<output>ob</output>
</conf>
.....
</config>
So, the intention is to parse an xml file and retrieve the information interesting to me, and write into another xml file where the mapping information is specified in the config file.
Due to the scripting nature (and extending with plugins lateron), and the support for xml processing I was considering python. I just learned the syntax and basics of language, and came to know about lxml
One way of doing this
parse the config file (where , tag can have xpath to the node I am interested in)
read the input file
write into output, using etbuilder based on the config file
Being new to python, and not seeing xpath support for etbuilder I wonder is this the best approach. Also not sure all the exceptional cases. Is there an easier way, or native support in any other libraries. If possible, I do not want to spend too much time on this task as I could focus on the core task.
thanks ins advance.

If you wish to transform an XML file into another XML file then XSLT was made for this purpose. You have to define a .xslt file that describes the transformation of XML content and what the eventual output should look like. That's one way to do it.
You can also read the XML file using lxml and generate the output XML with lxml.etree.ElementTree. I'm not familiar with etbuilder but I don't think generating the desired output is that difficult. Once you have parsed the input files, you can build the config XML and write it to a file.
XPath is primarily for reading XML content, you don't need it for constructing XML files. In fact, if you use a proper XML parser then you don't need XPath either to read the file contents, although XPath could make life a bit easier.

XML Processing in Python

I am about to build a piece of a project that will need to construct and post an XML document to a web service and I'd like to do it in Python, as a means to expand my skills in it.
Unfortunately, whilst I know the XML model fairly well in .NET, I'm uncertain what the pros and cons are of the XML models in Python.
Anyone have experience doing XML processing in Python? Where would you suggest I start? The XML files I'll be building will be fairly simple.

Personally, I've played with several of the built-in options on an XML-heavy project and have settled on pulldom as the best choice for less complex documents.
Especially for small simple stuff, I like the event-driven theory of parsing rather than setting up a whole slew of callbacks for a relatively simple structure. Here is a good quick discussion of how to use the API.
What I like: you can handle the parsing in a for loop rather than using callbacks. You also delay full parsing (the "pull" part) and only get additional detail when you call expandNode(). This satisfies my general requirement for "responsible" efficiency without sacrificing ease of use and simplicity.

ElementTree has a nice pythony API. I think it's even shipped as part of python 2.5
It's in pure python and as I say, pretty nice, but if you wind up needing more performance, then lxml exposes the same API and uses libxml2 under the hood. You can theoretically just swap it in when you discover you need it.

There are 3 major ways of dealing with XML, in general: dom, sax, and xpath. The dom model is good if you can afford to load your entire xml file into memory at once, and you don't mind dealing with data structures, and you are looking at much/most of the model. The sax model is great if you only care about a few tags, and/or you are dealing with big files and can process them sequentially. The xpath model is a little bit of each -- you can pick and choose paths to the data elements you need, but it requires more libraries to use.
If you want straightforward and packaged with Python, minidom is your answer, but it's pretty lame, and the documentation is "here's docs on dom, go figure it out". It's really annoying.
Personally, I like cElementTree, which is a faster (c-based) implementation of ElementTree, which is a dom-like model.
I've used sax systems, and in many ways they're more "pythonic" in their feel, but I usually end up creating state-based systems to handle them, and that way lies madness (and bugs).
I say go with minidom if you like research, or ElementTree if you want good code that works well.

I've used ElementTree for several projects and recommend it.
It's pythonic, comes 'in the box' with Python 2.5, including the c version cElementTree (xml.etree.cElementTree) which is 20 times faster than the pure Python version, and is very easy to use.
lxml has some perfomance advantages, but they are uneven and you should check the benchmarks first for your use case.
As I understand it, ElementTree code can easily be ported to lxml.

It depends a bit on how complicated the document needs to be.
I've used minidom a lot for writing XML, but that's usually been just reading documents, making some simple transformations, and writing them back out. That worked well enough until I needed the ability to order element attributes (to satisfy an ancient application that doesn't parse XML properly). At that point I gave up and wrote the XML myself.
If you're only working on simple documents, then doing it yourself can be quicker and simpler than learning a framework. If you can conceivably write the XML by hand, then you can probably code it by hand as well (just remember to properly escape special characters, and use str.encode(codec, errors="xmlcharrefreplace")). Apart from these snafus, XML is regular enough that you don't need a special library to write it. If the document is too complicated to write by hand, then you should probably look into one of the frameworks already mentioned. At no point should you need to write a general XML writer.

You can also try untangle to parse simple XML documents.

Since you mentioned that you'll be building "fairly simple" XML, the minidom module (part of the Python Standard Library) will likely suit your needs. If you have any experience with the DOM representation of XML, you should find the API quite straight forward.

I write a SOAP server that receives XML requests and creates XML responses. (Unfortunately, it's not my project, so it's closed source, but that's another problem).
It turned out for me that creating (SOAP) XML documents is fairly simple if you have a data structure that "fits" the schema.
I keep the envelope since the response envelope is (almost) the same as the request envelope. Then, since my data structure is a (possibly nested) dictionary, I create a string that turns this dictionary into <key>value</key> items.
This is a task that recursion makes simple, and I end up with the right structure. This is all done in python code and is currently fast enough for production use.
You can also (relatively) easily build lists as well, although depending upon your client, you may hit problems unless you give length hints.
For me, this was much simpler, since a dictionary is a much easier way of working than some custom class. For the books, generating XML is much easier than parsing!

For serious work with XML in Python use lxml
Python comes with ElementTree built-in library, but lxml extends it in terms of speed and functionality (schema validation, sax parsing, XPath, various sorts of iterators and many other features).
You have to install it, but in many places, it is already assumed to be part of standard equipment (e.g. Google AppEngine does not allow C-based Python packages, but makes an exception for lxml, pyyaml, and few others).
Building XML documents with E-factory (from lxml)
Your question is about building XML document.
With lxml there are many methods and it took me a while to find the one, which seems to be easy to use and also easy to read.
Sample code from lxml doc on using E-factory (slightly simplified):
The E-factory provides a simple and compact syntax for generating XML and HTML:
>>> from lxml.builder import E
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!"),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1>Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
</body>
</html>
I appreciate on E-factory it following things
Code reads almost as the resulting XML document
Readability counts.
Allows creation of any XML content
Supports stuff like:
use of namespaces
starting and ending text nodes within one element
functions formatting attribute content (see func CLASS in full lxml sample)
Allows very readable constructs with lists
e.g.:
from lxml import etree
from lxml.builder import E
lst = ["alfa", "beta", "gama"]
xml = E.root(*[E.record(itm) for itm in lst])
etree.tostring(xml, pretty_print=True)
resulting in:
<root>
<record>alfa</record>
<record>beta</record>
<record>gama</record>
</root>
Conclusions
I highly recommend reading lxml tutorial - it is very well written and will give you many more reasons to use this powerful library.
The only disadvantage of lxml is, that it must be compiled. See SO answer for more tips how to install lxml from wheel format package within a fraction of a second.

I strongly recommend SAX - Simple API for XML - implementation in the Python libraries. They are fairly easy to setup and process large XML by even driven API, as discussed by previous posters here, and have low memory footprint unlike validating DOM style XML parsers.

If you're going to be building SOAP messages, check out soaplib. It uses ElementTree under the hood, but it provides a much cleaner interface for serializing and deserializing messages.

I assume that the .NET way of processing XML builds on some version of MSXML and in that case I assume that using, for example, minidom would make you feel somewhat at home. However, if it is simple processing you are doing, any library will probably do.
I also prefer working with ElementTree when dealing with XML in Python because it is a very neat library.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need to read XML files as a stream using BeautifulSoup in Python - python

Related

Is there an existing library to parse Wikpedia tables from dump?

How to ignore mismatched tags while parsing xml in Python

Efficiently parsing broken XML/HTML in python

Map xml file with extracted information

XML Processing in Python

Categories

Resources