Best way transform custom XML like syntax - python

Using Python.
So basically I have a XML like tag syntax but the tags don't have attributes. So <a> but not <a value='t'>. They close regularly with </a>.
Here is my question. I have something that looks like this:
<al>
1. test
2. test2
test with new line
3. test3
<al>
1. test 4
<al>
2. test 5
3. test 6
4. test 7
</al>
</al>
4. test 8
</al>
And I want to transform it into:
<al>
<li>test</li>
<li> test2</li>
<li> test with new line</li>
<li> test3
<al>
<li> test 4 </li>
<al>
<li> test 5</li>
<li> test 6</li>
<li> test 7</li>
</al>
</li>
</al>
</li>
<li> test 8</li>
</al>
I'm not really looking for a completed solution but rather a push into the right direction. I am just wondering how the folks here would approach the problem. Solely REGEX? write a full custom parser for the attribute-less tag syntax? Hacking up existing XML parsers? etc.
Thanks in advance

I'd recommend start with the following:
from xml.dom.minidom import parse, parseString
xml = parse(...)
l = xml.getElementsByTagName('al')
then traverse all elements in l, examining their text subnodes (as well as <al> nodes recursively).
You may start playing with this right away in the Python console.
It is easy to remove text nodes, then split text chunks with chunk.split('\n') and add <li> nodes back, as you need.
After modifying all the <al> nodes you may just call xml.toxml() to get the resulting xml as text.
Note that the element objects you get from this are linked back to the original xml document object, so do not delete the xml object in the process.
This way I personally consider more straightforward and easy to debug than mangling with multiline regexps.

The way you've described your syntax, it is "XML without attributes". If that's so, it's still XML, so you can use XML tools such as XSLT and XQuery.
If you allow things that aren't allowed in XML, on the other hand, my approach would be to write a parser that handles your non-XML format and delivers XML-compatible SAX events. Then you'll be able to use any XML technology just by plugging in your parser in place of the regular XML parser.

It would depend on what you want to do with it exactly, if it is a one-of script the following suffices:
cat in.txt | perl -pe 'if(!/<\/?al>/){s#^(\s*)([0-9]+\.)?(.*)$#$1<li>$3</li>#}'
And it works. But I wouldn't say it's very robust ;) But if it's for a one-off it's fine.

I am just wondering how the folks here would approach the problem.
I would go for using a parser.
My reasoning is that the operation your are trying to perform isn't merely a syntactic or lexical substitution. It's much more of a grammar transformation, which imply understanding the structure of your document.
In your example, you are not simply enclosing each line between <li> and </li>; you are also enclosing recursively some blocks of document that spans over several lines, if these represent an "item".
Maybe you could put together a regex capable of capturing the interpretative logic and the recursive nature of the problem, but doing that would be like digging a trench with a teaspoon: you could do it, but using a spade (a parser) is a much more logical choice.
An additional reason to use a parser is the "real word". Regex are true "grammar nazis": a glitch in your markup and they won't work. On the other hand, all parser libraries are "flexible" (treat uniformly different spellings like <a></a> and <a/> or HTML's <br> and XHTML's <br/>) and some - like beautifulsoup - are even "forgiving", meaning that they will try to guess (with a surprisingly high level of accuracy) what the document's author wanted to code, even if the document itself fails validation.
Also, a parser-based solution is much more maintainable than a regex-based one. A small change in your document structure might need radical changes of your regex [which by nature tend to become obscure to their very own author after 72 hours or so].
Finally, because you are using python and therefore readability counts, a parser-based solution could potentially result in much more pythonic code than very complex/long/obscure regex.
HTH!

Related

Escaping string to use in xml tags

Preliminary: This is not the question that has been answered here:
Escaping strings for use in XML
The actual question:
I have inherited an xml schema that unfortunately encodes the filename of the xml itself as an element (which is highly unusual and not a good idea, I know):
<?xml version="1.0" encoding="utf-8"?>
<this_actual_xml_filename.xml>
<content>
</content>
</this_actual_xml_filename.xml>
I know this is not very useful for a number of reasons, but since I can't change the structure with putting a lot of effort into a tool that uses this file, I am stuck with it for the moment. One of the reasons this is not a good idea is that filenames are much less restricted than xml element names so it is easy to imagine how problems can arise, e.g. generating an xml named valid_filename_(but_invalid_xml).xml would lead to this xml format:
<?xml version="1.0" encoding="utf-8"?>
<valid_filename_(but_invalid_xml).xml>
<content>
</content>
</valid_filename_(but_invalid_xml).xml>
My question is whether there is a way, in Python, to escape any characters that are not allowed in xml elements. Escaping this in a somewhat transparent manner would allow me to reconstruct the original filename in the tool that reads the xml.
I could roll my own using the standard: https://www.w3schools.com/xml/xml_elements.asp but I was wondering whether there is something ready-made for an unusual case like this.
Addendum: Let me stress that this construct is very bad style and that it would be strongly recommended to refactor the file format instead of finding a workaround. Therefore I assume there is no ready-made solution for this problem in any library, since the construct itself goes against basic xml design guidelines.
I posted this question in case that a solution does exist, so that I wouldn't have to reinvent the wheel. I will accept a simple "does not exist" as an answer if nothing else comes up.
As you anticipated, there is no existing way in Python to map from the characters allowed in a filename (for whatever OS) to characters allowed in an XML element name. To be able to do so reversibly would be additionally challenging.
As you also acknowledge, the XML design is unconventional and problematic, for reasons that only begin with the trouble you're currently having regarding allowed characters.
Recommendations, best first:
Fix the problematic design, even if this means fixing upstream and downstream dependencies.
Pre- and/or post-process to map filenames to legal XML element names.
Design and implement the sort of reversible name mapping scheme you have in mind. The level of effort here, combined with the regrettable perpetuation of previous design mistakes, makes this approach unattractive.
See also
Allowed symbols in XML element name
A convention I have seen used is to replace all "special" characters (for some definition of "special") by _HHHH_ where HHHH is the hexadecimal character code. But I don't know of any handy library that does this for you. And it would probably be much easier to write the element out as
<file name="valid_filename_(but_invalid_xml).xml">
...
</file>

Parsing HTML with str.split in Python

I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:
<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don't Spam in Any Way
</div>
I'm trying to get the text inside the tag:
/Forum/ShowPost.aspx?PostID=80631954
The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:
htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]
There is nothing in the HTML code to indicate a post number on the page, just links.
In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:
results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]
I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.
In case regular expressions are fair game, here's how you could do it:
import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)
Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.
It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.
That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.
strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]
The problem with the code you posted is that whitespace and newlines are characters too.

Easiest way to parse specific pieces of information from HTML

I know the question title isn't amazing, but I can't think of a better way to word it. I have a bit of HTMl that I need to search:
<tr bgcolor="#e2d8d4">
<td>1</td>
<td>12:00AM</td>
<td>Show Name<a name="ID#"></a></td>
<td>Winter 12</td>
<td>Channel</td>
<td>Production Company</td>
<td nowrap>1d 11h 9m (air time)</td>
<td align="center">11</td>
<td>
AniDB</td>
<td>Home</td>
</tr>
The page is several dozen of these html blocks. I need to be able to, with just Show Name, pick out the air time of a given show, as well as the bgcolor. (full page here: http://www.mahou.org/Showtime/Planner/). I am assuming the best bet would be a regexp, but I am not confident in that assumption. I would prefer not to use 3rd party modules (BeautifulSoup). I apologize in advance if the question is vague.
Thank you for doing your research - it's good that you are aware of BeautifulSoup. This would really be the best way to go about solving your problem.
That aside... here is a generic strategy you can choose to implement using regexes (if your sanity is questionable) or using BeautifulSoup (if you're sane.)
It looks like the data you want is always in a table that starts off like:
<table summary="Showtime series for Sunday in a Planner format." border="0" bgcolor="#bfa89b" cellpadding="0" cellspacing="0" width="100%">
You can isolate this by looking for the summary="Showtime series for (Monday|Tuesday|....|Sunday)" attribute of the table, which is unique in the page.
One you have isolated that table, the format of the rows within the table is well defined. I would get <tr> at a time and assume that the second <td> will always contain the airing time, and the third <td> will always contain the show's name.
Regexes can be good for extracting very simple things from HTML, such as "the src paths of all img tags", but once you start talking about nested tags like "find the second <td> tag of each <tr> tag of the table with attribute summary="...", it becomes much harder to do. This is because regular expressions are not designed to work with nested structures.
See the canonical answer to 'regexps and HTML' questions, and Tom Christiansen's explanation of what it takes to use regexps on arbitrary HTML. tchrist proves that you can use regexps to parse any HTML you want - if you're sufficiently determined - but that a proper parsing library like BeautifulSoup is faster, easier, and will give better results.
This was supposed to be a comment, but it turned out too long.
BeautifulSoup's documentation is pretty good, as it contains quite a bit of examples, just be aware that there are two versions and not each of them plays nicely with every version of Python, although probably you won't have problems there (see this: "Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3.").
Furthermore, HTML parsers like BeautifulSoup or lxml clean your HTML before processing it (to make it valid and so you can traverse its tree properly), so they may move certain elements regarded as invalid. Usually, you can disable that feature but then it's not certain you're going to get the results you want.
There are other approaches to solve the task you're asking. However, they're much more involved to implement, so maybe it's not desirable under the conditions you described. But just to let you know, the whole field of information extraction (IE) deals with that kind of issues. Here (PDF) is a more or less recent survey about it, focused mainly on IE for extracting HTML (semi-structured, as they called it) webpages.

Formatting Python code for the web

Until recently, I posted Python code (whitespace matters) to blogspot.com using something like this:
<div style="overflow-x: scroll ">
<table bgcolor="#ffffb0" border="0" width="100%" padding="4">
<tbody><tr><td><pre style=" hidden;font-family:monaco;">
my code here
</pre></table></div>
About a week ago, the posts started acquiring additional newlines so all of this is double-spaced. Using a simple <pre> tag is no good (besides losing the color) b/c it also results in double newlines, while a <code> tag messes with the whitespace. I guess I could just add *4---but that's frowned upon or something by the HTML style gods.
The standard answer to this (like right here on SO) is to get syntax coloring or highlighting through the use of css (which I don't know much about), for example as discussed in a previous SO question here. The problem I have with that is that all such solutions require loading of a resource from a server on the web. But if (say 5 years from now) that resource is gone, the html version of the code will not render at all. If I knew Javascript I guess I could probably fix that.
The coloring problem itself is trivial, it could be solved through use of <style> tags with various definitions. But parsing is hard; at least I've not made much progress trying to parse Python myself. Multi-line strings are a particular pain. I could just ignore the hard cases and code the simple ones.
TextMate has a command Create HTML from Document. The result is fairly wordy but could just be pasted into a post. But say if you had 3 code segments, then it's like 1000 lines or something. And of course it's a document, so you have to actually cut before you paste.
Is there a simple Python parser? A better solution?
UPDATE: I wrote my own parser for syntax highlighting. Still a little buggy perhaps, but it is quite simple and a self-contained solution. I posted it here. Pygments is also a good choice as well.
Why don't you use pygments?
What worked for me was to use prettify. At the top of the HTML, add the line
<script src="https://cdn.jsdelivr.net/gh/google/code-prettify#master/loader/run_prettify.js"></script>
to auto-run prettify. Then use
<code class="prettify">
... enter your code here ...
</code>
in your HTML.
I actually used this code on blogspot.com; here is an example.

Filter out HTML tags and resolve entities in python

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.

Categories

Resources