Generating lex matching rules and yacc grammar rules from an XML DTD - python

Overview
Although this question implicates lex/yacc, which are written in C, it's fundamentally centered around programming in python.
I have several very similar DTDs that I'm using to parse a document. That section of the program is written in C, and there's just no need to invoke a full SAX handler (viz., libxml2) for this purpose. Since the DTDs (and therefore the XML files) have a static format, I think that this problem can best be solved with lex and yacc.
While writing a full lexical parser for any XML document is far too complex, writing one for a specific subset of XML documents is entirely manageable. The DTD could be used to generate the lexical analyzer (which tokenizes the input) as well as the parser generator in YACC.
There are two assumptions I am willing to make:
The XML document is well-formed vis-à-vis REC-xml-19980210
The XML document is valid vis-à-vis its DTD
Therefore, if an XML document fails to satisfy any of the above, the lexical analyzer/parser should simply fail for that particular file.
Questions
My ultimate goal is to write a python script that successfully: (1) parses the DTD; and (2)
generates the lex/yacc files. Before I begin, I have several questions:
Has this problem already been solved?
If so, are there any libraries that I should consider looking at?
If not, is it because there is no solution using the tools I've mentioned?
Are there better (as measured by performance) ways to extract the non-markup 'content' from XML files than using a static parser?
I realize that I can use PLY to parse the DTD, but because I'm interested in generating the lex/yacc files for inclusion in a C program, that option will not work. As such, I'm thinking that I might use xml.parsers.expat to parse the DTD. This allows me to register callbacks that track the element names, their position within the tree, whether they're required, etc. This should offer me enough information to generate lex/yacc files, but I would like to see what advice you guys have.

Use a combination of the XML Lexer, the yacc grammar, and the YAXX extension to generate the respective files.

Related

Escaping string to use in xml tags

Preliminary: This is not the question that has been answered here:
Escaping strings for use in XML
The actual question:
I have inherited an xml schema that unfortunately encodes the filename of the xml itself as an element (which is highly unusual and not a good idea, I know):
<?xml version="1.0" encoding="utf-8"?>
<this_actual_xml_filename.xml>
<content>
</content>
</this_actual_xml_filename.xml>
I know this is not very useful for a number of reasons, but since I can't change the structure with putting a lot of effort into a tool that uses this file, I am stuck with it for the moment. One of the reasons this is not a good idea is that filenames are much less restricted than xml element names so it is easy to imagine how problems can arise, e.g. generating an xml named valid_filename_(but_invalid_xml).xml would lead to this xml format:
<?xml version="1.0" encoding="utf-8"?>
<valid_filename_(but_invalid_xml).xml>
<content>
</content>
</valid_filename_(but_invalid_xml).xml>
My question is whether there is a way, in Python, to escape any characters that are not allowed in xml elements. Escaping this in a somewhat transparent manner would allow me to reconstruct the original filename in the tool that reads the xml.
I could roll my own using the standard: https://www.w3schools.com/xml/xml_elements.asp but I was wondering whether there is something ready-made for an unusual case like this.
Addendum: Let me stress that this construct is very bad style and that it would be strongly recommended to refactor the file format instead of finding a workaround. Therefore I assume there is no ready-made solution for this problem in any library, since the construct itself goes against basic xml design guidelines.
I posted this question in case that a solution does exist, so that I wouldn't have to reinvent the wheel. I will accept a simple "does not exist" as an answer if nothing else comes up.
As you anticipated, there is no existing way in Python to map from the characters allowed in a filename (for whatever OS) to characters allowed in an XML element name. To be able to do so reversibly would be additionally challenging.
As you also acknowledge, the XML design is unconventional and problematic, for reasons that only begin with the trouble you're currently having regarding allowed characters.
Recommendations, best first:
Fix the problematic design, even if this means fixing upstream and downstream dependencies.
Pre- and/or post-process to map filenames to legal XML element names.
Design and implement the sort of reversible name mapping scheme you have in mind. The level of effort here, combined with the regrettable perpetuation of previous design mistakes, makes this approach unattractive.
See also
Allowed symbols in XML element name
A convention I have seen used is to replace all "special" characters (for some definition of "special") by _HHHH_ where HHHH is the hexadecimal character code. But I don't know of any handy library that does this for you. And it would probably be much easier to write the element out as
<file name="valid_filename_(but_invalid_xml).xml">
...
</file>

Is there a reliable python library for taking a BibTex entry and outputting it into specific formats?

I'm developing using Python and Django for a website. I want to take a BibTex entry and output it in a view in 3 different formats, MLA, APA, and Chicago. Is there a library out there that already does this or am I going to have to manually do the string formatting?
There are the following projects:
BibtexParser
Pybtex
Pybliographer
BabyBib
If you need complex parsing and output, Pybtex is recommended. Example:
>>> from pybtex.database.input import bibtex
>>> parser = bibtex.Parser()
>>> bib_data = parser.parse_file('examples/foo.bib')
>>> bib_data.entries.keys()
[u'ruckenstein-diffusion', u'viktorov-metodoj', u'test-inbook', u'test-booklet']
>>> print bib_data.entries['ruckenstein-diffusion'].fields['title']
Predicting the Diffusion Coefficient in Supercritical Fluids
Good luck.
Having tried them, all of these projects are bad, for various reasons: terrible APIs, bad documentation, and a failure to parse valid BibTeX files. The implementation you want doesn't show up in most Google searches, from my own searching: it's biblib. This text from the README should sell it:
There are a lot of BibTeX parsers out there. Most of them are complete nonsense based on some imaginary grammar made up by the module's author that is almost, but not quite, entirely unlike BibTeX's actual grammar. BibTeX has a grammar. It's even pretty simple, though it's probably not what you think it is. The hardest part of BibTeX's grammar is that it's only written down in one place: the BibTeX source code.
The accepted answer of using pybtex is fraught with danger as Pybtex does not preserve the bibtex format of even simple bibtex files. (https://bitbucket.org/pybtex-devs/pybtex/issues/130/need-to-specially-represent-bibtex-markup)
Pybtex is therefore losing bibtex information when reading and re-writing a simple .bib file without making any changes. Users should be very careful following the recommendations to use pybtex.
I will try biblib as well and report back but the accepted answer should be edited to not recommend pybtex.
Edit:
I was able to import the data using Bibtex Parser, without any loss of data. However, I had to compile from https://github.com/sciunto-org/python-bibtexparser as the version installed via pip was bugged at the time. Users should verify that pip is getting the latest version.
As for exporting, once the data has been imported via BibTex Parser, it's in a dictionary, and can be exported as the user desires. BibTex Parser does not have built in functions for exporting in common formats. As I did not need this functionality, I didn't specifically test it. However, once imported into a dictionary, the string output can be converted to any citation format rather easily.
Here, pybtex and a custom style file can help. I used the style file provided by the journal and compiled in LaTeX instead, but PyBtex has python style files (but also allows ingesting .sty files). So I would recommend taking the Bibtex Parser input and transferring it to PyBtex (or similar) for outputting in a certain style.
The closest thing I know of is the pybtex package

Translate JS code to it's XML-encoded AST-tree

When I am writing scrapers, I always use excellent XPath querying language to extract data from HTML or XML.
Often I am working with dynamic HTML, and have a need to extract some variables from Javascript code, so I am compelled to write ugly regexps to do that.
I am looking for some better way to do this, without involving any heavy-weight Javascript interpretators like PhantomJS.
I know, that where is a lot of tools, which is parsing syntax into XML or JSON files, and looking for something like, that is usable for parsing JS syntax.
You are right that "ugly regexps" can't really be used to process arbitrary JS (or any other standard programming language for that matter). You need a full fledged parser.
There aren't "lots of tools" that parse (language) syntax to XML. Most real language tools have parsers which build an internal AST data structure designed for efficient access, which the tool then uses to achieve its purpose (analysis, transformation, execution). You say "translate to its tree" as if that tree were unique; it isn't. The ASTs built are a function of the parsing technology, the grammar used, and what the designer thought was important to access, so no two language tools agree on what the AST should look like. Tree shapes are thus tool-dependent.
If you get your hands on the source code for any such tool, you can throw away its post-parsing machinery, and add code to walk the AST and dump XML; this is not particularly hard (although getting all the output character escaping/encoding right is a royal PITA). The XML you get will be shaped according to the original tools AST, of course. That means any tool you build to process the XML must implicitly understand the shape of the particular tool's parser that you started with.
I happen to build general-purpose program transformation machinery (see bio), which has parsers for many languages including JavaScript. We get the "I wish I had XML" request enough so our particular tool will produce XML by a flip of a command-line switch, using exactly the means described above. Here's a link to an SO question showing the XML output for Java, and one for C++. If you want to see one for JavaScript, I could produce that and attach here with only a bit of effort.

python sax parser skipping over exception

Is there a way to "skip" a line using a SAX XML parser?
I've got a non-confirming XML document which is a concatenation of valid XML documents and thus the <?xml ...?> appears for each document. Also note I need to use a SAX parser since the input documents are huge.
I tried crafting a "custom stream" class as feeder for the parser but quickly realized that SAX uses the read method and thus reads stuff in "byte arrays" thereby exploding the complexity of this project.
thanks!
UPDATE: I know there is way around this using csplit but I am after a Python based solution if at all possible within reasonable limits.
Update2: Maybe I should have said "skipping to next document", that would have made more sense. Anyhow, that's what I need: a way of parsing multiple documents from a single input stream.
When you are concatenating the documents together, just replace the beginning <? and ?> with <!-- and -->, this will comment out the xml declarations.

XML Processing in Python

I am about to build a piece of a project that will need to construct and post an XML document to a web service and I'd like to do it in Python, as a means to expand my skills in it.
Unfortunately, whilst I know the XML model fairly well in .NET, I'm uncertain what the pros and cons are of the XML models in Python.
Anyone have experience doing XML processing in Python? Where would you suggest I start? The XML files I'll be building will be fairly simple.
Personally, I've played with several of the built-in options on an XML-heavy project and have settled on pulldom as the best choice for less complex documents.
Especially for small simple stuff, I like the event-driven theory of parsing rather than setting up a whole slew of callbacks for a relatively simple structure. Here is a good quick discussion of how to use the API.
What I like: you can handle the parsing in a for loop rather than using callbacks. You also delay full parsing (the "pull" part) and only get additional detail when you call expandNode(). This satisfies my general requirement for "responsible" efficiency without sacrificing ease of use and simplicity.
ElementTree has a nice pythony API. I think it's even shipped as part of python 2.5
It's in pure python and as I say, pretty nice, but if you wind up needing more performance, then lxml exposes the same API and uses libxml2 under the hood. You can theoretically just swap it in when you discover you need it.
There are 3 major ways of dealing with XML, in general: dom, sax, and xpath. The dom model is good if you can afford to load your entire xml file into memory at once, and you don't mind dealing with data structures, and you are looking at much/most of the model. The sax model is great if you only care about a few tags, and/or you are dealing with big files and can process them sequentially. The xpath model is a little bit of each -- you can pick and choose paths to the data elements you need, but it requires more libraries to use.
If you want straightforward and packaged with Python, minidom is your answer, but it's pretty lame, and the documentation is "here's docs on dom, go figure it out". It's really annoying.
Personally, I like cElementTree, which is a faster (c-based) implementation of ElementTree, which is a dom-like model.
I've used sax systems, and in many ways they're more "pythonic" in their feel, but I usually end up creating state-based systems to handle them, and that way lies madness (and bugs).
I say go with minidom if you like research, or ElementTree if you want good code that works well.
I've used ElementTree for several projects and recommend it.
It's pythonic, comes 'in the box' with Python 2.5, including the c version cElementTree (xml.etree.cElementTree) which is 20 times faster than the pure Python version, and is very easy to use.
lxml has some perfomance advantages, but they are uneven and you should check the benchmarks first for your use case.
As I understand it, ElementTree code can easily be ported to lxml.
It depends a bit on how complicated the document needs to be.
I've used minidom a lot for writing XML, but that's usually been just reading documents, making some simple transformations, and writing them back out. That worked well enough until I needed the ability to order element attributes (to satisfy an ancient application that doesn't parse XML properly). At that point I gave up and wrote the XML myself.
If you're only working on simple documents, then doing it yourself can be quicker and simpler than learning a framework. If you can conceivably write the XML by hand, then you can probably code it by hand as well (just remember to properly escape special characters, and use str.encode(codec, errors="xmlcharrefreplace")). Apart from these snafus, XML is regular enough that you don't need a special library to write it. If the document is too complicated to write by hand, then you should probably look into one of the frameworks already mentioned. At no point should you need to write a general XML writer.
You can also try untangle to parse simple XML documents.
Since you mentioned that you'll be building "fairly simple" XML, the minidom module (part of the Python Standard Library) will likely suit your needs. If you have any experience with the DOM representation of XML, you should find the API quite straight forward.
I write a SOAP server that receives XML requests and creates XML responses. (Unfortunately, it's not my project, so it's closed source, but that's another problem).
It turned out for me that creating (SOAP) XML documents is fairly simple if you have a data structure that "fits" the schema.
I keep the envelope since the response envelope is (almost) the same as the request envelope. Then, since my data structure is a (possibly nested) dictionary, I create a string that turns this dictionary into <key>value</key> items.
This is a task that recursion makes simple, and I end up with the right structure. This is all done in python code and is currently fast enough for production use.
You can also (relatively) easily build lists as well, although depending upon your client, you may hit problems unless you give length hints.
For me, this was much simpler, since a dictionary is a much easier way of working than some custom class. For the books, generating XML is much easier than parsing!
For serious work with XML in Python use lxml
Python comes with ElementTree built-in library, but lxml extends it in terms of speed and functionality (schema validation, sax parsing, XPath, various sorts of iterators and many other features).
You have to install it, but in many places, it is already assumed to be part of standard equipment (e.g. Google AppEngine does not allow C-based Python packages, but makes an exception for lxml, pyyaml, and few others).
Building XML documents with E-factory (from lxml)
Your question is about building XML document.
With lxml there are many methods and it took me a while to find the one, which seems to be easy to use and also easy to read.
Sample code from lxml doc on using E-factory (slightly simplified):
The E-factory provides a simple and compact syntax for generating XML and HTML:
>>> from lxml.builder import E
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!"),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1>Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
</body>
</html>
I appreciate on E-factory it following things
Code reads almost as the resulting XML document
Readability counts.
Allows creation of any XML content
Supports stuff like:
use of namespaces
starting and ending text nodes within one element
functions formatting attribute content (see func CLASS in full lxml sample)
Allows very readable constructs with lists
e.g.:
from lxml import etree
from lxml.builder import E
lst = ["alfa", "beta", "gama"]
xml = E.root(*[E.record(itm) for itm in lst])
etree.tostring(xml, pretty_print=True)
resulting in:
<root>
<record>alfa</record>
<record>beta</record>
<record>gama</record>
</root>
Conclusions
I highly recommend reading lxml tutorial - it is very well written and will give you many more reasons to use this powerful library.
The only disadvantage of lxml is, that it must be compiled. See SO answer for more tips how to install lxml from wheel format package within a fraction of a second.
I strongly recommend SAX - Simple API for XML - implementation in the Python libraries. They are fairly easy to setup and process large XML by even driven API, as discussed by previous posters here, and have low memory footprint unlike validating DOM style XML parsers.
If you're going to be building SOAP messages, check out soaplib. It uses ElementTree under the hood, but it provides a much cleaner interface for serializing and deserializing messages.
I assume that the .NET way of processing XML builds on some version of MSXML and in that case I assume that using, for example, minidom would make you feel somewhat at home. However, if it is simple processing you are doing, any library will probably do.
I also prefer working with ElementTree when dealing with XML in Python because it is a very neat library.

Categories

Resources