How do I use genshi.builder to programmatically build an HTML document?

How do I use genshi.builder to programmatically build an HTML document? - python

I recently discovered the genshi.builder module. It reminds me of Divmod Nevow's Stan module. How would one use genshi.builder.tag to build an HTML document with a particular doctype? Or is this even a good thing to do? If not, what is the right way?

It's not possible to build an entire page using just genshi.builder.tag -- you would need to perform some surgery on the resulting stream to insert the doctype. Besides, the resulting code would look horrific. The recommended way to use Genshi is to use a separate template file, generate a stream from it, and then render that stream to the output type you want.
genshi.builder.tag is mostly useful for when you need to generate simple markup from within Python, such as when you're building a form or doing some sort of logic-heavy modification of the output.
See documentation for:
Creating and using templates
The XML-based template language
genshi.builder API docs
If you really want to generate a full document using only builder.tag, this (completely untested) code could be a good starting point:
from itertools import chain
from genshi.core import DOCTYPE, Stream
from genshi.output import DocType
from genshi.builder import tag as t
# Build the page using `genshi.builder.tag`
page = t.html (t.head (t.title ("Hello world!")), t.body (t.div ("Body text")))
# Convert the page element into a stream
stream = page.generate ()
# Chain the page stream with a stream containing only an HTML4 doctype declaration
stream = Stream (chain ([(DOCTYPE, DocType.get ('html4'), None)], stream))
# Convert the stream to text using the "html" renderer (could also be xml, xhtml, text, etc)
text = stream.render ('html')
The resulting page will have no whitespace in it -- it'll look normal, but you'll have a hard time reading the source code because it will be entirely on one line. Implementing appropriate filters to add whitespace is left as an exercise to the reader.

Genshi.builder is for "programmatically generating markup streams"[1]. I believe the purpose of it is as a backend for the templating language. You're probably looking for the templating language for generating a whole page.
You can, however do the following:
>>> import genshi.output
>>> genshi.output.DocType('html')
('html', '-//W3C//DTD HTML 4.01//EN', 'http://www.w3.org/TR/html4/strict.dtd')
See other Doctypes here: http://genshi.edgewall.org/wiki/ApiDocs/genshi.output#genshi.output:DocType
[1] genshi.builder.__doc__

Related

How do I find text after a given key word?

I am practicing my programming skills (in Python) and I realized that I don't know what to do when I need to find a value that is unknown but introduced by a key word. I am taking the information for this off a website where in the page source it says, '"size":"10","stockKeepingUnitId":"(random number)"'
How can I figure out what that number is.
This is what I have so far --
def stock():
global session
endpoint = '(website)'
reponse = session.get(endpoint)
soup = bs(response.text, "html.parser")
sizes = soup.find('"size":"10","stockKeepingUnitId":')

Off the top of my head there are two ways to do this. Say you have the string mystr = 'some text...content:"67588978"'. The first way is just to search for "content:" in the string and use string slicing to take everything after it:
num = mystr[mystr.index('content:"') + len('content:"'):-1]
Alternatively, as probably a better solution, you could use regular expressions
import re
nums = re.findall(r'.*?content:\"(\d+)\"')
As you haven't provided an example of the dataset you're trying to analyze, there could also be a number of other solutions. If you're trying to parse a JSON or YAML file, there are simple libraries to turn them into python dicts (json is part of the standard library, and PyYaml handles YAML files easily).

Generating XML in Python

I'm developing an API using Python which makes server calls using XML. I am debating on whether to use a library (ex. http://wiki.python.org/moin/MiniDom) or if it would be "better" (meaning less overhead and faster) to use string concatenation in order to generate the XML used for each request. Also, the XML I will be generating is going to be quite dynamic so I'm not sure if something that allows me to manage elements dynamically will be a benefit.

Since you are just using authorize.net, why not use a library specifically designed for the Authorize.net API and forget about constructing your own XML calls?
If you want or need to go your own way with XML, don't use minidom, use something with an ElementTree interface such as cElementTree (which is in the standard library). It will be much less painful and probably much faster. You will certainly need an XML library to parse the XML you produce, so you might as well use the same API for both.
It is very unlikely that the overhead of using an XML library will be a problem, and the benefit in clean code and knowing you can't generate invalid XML is very great.
If you absolutely, positively need to be as fast as possible, use one of the extremely fast templating libraries available for Python. They will probably be much faster than any naive string concatenation you do and will also be safe (i.e do proper escaping).

Another option is to use Jinja, especially if the dynamic nature of your xml is fairly simple. This is a common idiom in flask for generating html responses.
Here is an example of a jinja template that generates the XML of an aws S3 list objects response. I usually store the template in a separate file to avoid polluting my elegant python with ugly xml.
from datetime import datetime
from jinja2 import Template
list_bucket_result = """<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>{{bucket_name}}</Name>
<Prefix/>
<KeyCount>{{object_count}}</KeyCount>
<MaxKeys>{{max_keys}}</MaxKeys>
<IsTruncated>{{is_truncated}}</IsTruncated>
{%- for object in object_list %}
<Contents>
<Key>{{object.key}}</Key>
<LastModified>{{object.last_modified_date.isoformat()}}</LastModified>
<ETag></ETag>
<Size>{{object.size}}</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>{% endfor %}
</ListBucketResult>
"""
class BucketObject:
def __init__(self, key, last_modified_date, size):
self.key = key
self.last_modified_date = last_modified_date
self.size = size
object_list = [
BucketObject('/foo/bar.txt', datetime.utcnow(), 10*1024 ),
BucketObject('/foo/baz.txt', datetime.utcnow(), 29*1024 ),
]
template = Template(list_bucket_result)
result = template.render(
bucket_name='test-bucket',
object_count=len(object_list),
max_keys=1000,
is_truncated=False,
object_list=object_list
)
print result
Output:
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>test-bucket</Name>
<Prefix/>
<KeyCount>2</KeyCount>
<MaxKeys>1000</MaxKeys>
<IsTruncated>False</IsTruncated>
<Contents>
<Key>/foo/bar.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>10240</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
<Contents>
<Key>/foo/baz.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>29696</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>

I would definitely recommend that you make use of one of the Python libraries; such as MiniDom, ElementTree, lxml.etree or pyxser. There is no reason to not to, and the potential performance impact will be minimal.
Although, personally I prefer using simplejson (or simply json) instead.
my_list = ["Value1", "Value2"]
json = simplejson.dumps(my_list)
# send json

My real question is what are the biggest concerns for what you're trying to accomplish? If you're worried about speed/memory, then yes, minidom does take a hit. If you want something that's fairly reliable that you can deploy quickly, I'd say use it.
My suggestion for dealing with XML in any language(Java, Python, C#, Perl, etc) is to consider using something already existing. Everyone has written their own XML parser at least once, and then they never do so again because it's such a pain in the behind. And to be fair, these libraries will have already fixed 99.5% of any problems you would run into.

I recommend LXML. It's a Python library of bindings for the very fast C libraries libxml2 and libxslt.
LXML supports XPATH, and has an elementTree implementation. LXML also has an interface called objectify for writing XML as object hierarchies:
from lxml import etree, objectify
E = objectify.ElementMaker(annotate=False)
my_alpha = my_alpha = E.alpha(E.beta(E.gamma(firstattr='True')),
E.beta(E.delta('text here')))
etree.tostring(my_alpha)
# '<alpha><beta><gamma firstattr="True"/></beta><beta><delta>text here</delta></beta></alpha>'
etree.tostring(my_alpha.beta[0])
# '<beta><gamma firstattr="True"/></beta>'
my_alpha.beta[1].delta.text
# 'text here'

Python String parse

Im working on a data packet retrieval system which will take a packet, and process the various parts of the packet, based on a system of tags [similar to HTML tags].
[text based files only, no binary files].
Each part of the packet is contained between two identical tags, and here is a sample packet:
"<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
The entire packet is contained within the <PACKET><PACKET> tags.
All meta-data is contained within the <HEAD><HEAD> tags and the filename from which the packet is part of is contained within the, you guessed it, the <FILENAME><FILENAME> tags.
Lets say, for example, a single packet is received and stored in a temporary string variable called sTemp.
How do you efficiently retrieve, for example, only the contents of a single pair of tags, for example the contents of the <FILENAME><FILENAME> tags?
I was hoping for such functionality as saying getTagFILENAME( packetX ), which would return the textual string contents of the <FILENAME><FILENAME> tags of the packet.
Is this possible using Python?
Any suggestions or comments appreciated.

If the packet format effectively uses XML-looking syntax (i.e., if the "closing tags" actually include a slash), the xml.etree.ElementTree could be used.
This libray is part of Python Standard Library, starting in Py2.5. I find it a very convenient one to deal with this kind of data. It provides many ways to read and to modify this kind of tree structure. Thanks to the generic nature of XML languages and to the XML awareness built-in the ElementTree library, the packet syntax could evolve easily for example to support repeating elements, element attributes.
Example:
>>> import xml.etree.ElementTree
>>> myPacket = '<PACKET><HEAD><ID>123</ID><SEQ>1</SEQ><FILENAME>Test99.txt</FILE
NAME></HEAD><DATA>spam and cheese</DATA></PACKET>'
>>> xt = xml.etree.ElementTree.fromstring(myPacket)
>>> wrk_ele = xt.find('HEAD/FILENAME')
>>> wrk_ele.text
'Test99.txt'
>>>

Something like this?
import re
def getPacketContent ( code, packetName ):
match = re.search( '<' + packetName + '>(.*?)<' + packetName + '>', code )
return match.group( 1 ) if match else ''
# usage
code = "<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
print( getPacketContent( code, 'HEAD' ) )
print( getPacketContent( code, 'SEQ' ) )

As mjv points out, there's not the least sense in inventing an XML-like format if you can just use XML.
But: If you're going to use XML for your packet format, you need to really use XML for it. You should use an XML library to create your packets, not just to parse them. Otherwise you will come to grief the first time one of your field values contains an XML markup character.
You can, of course, write your own code to do the necessary escaping, filter out illegal characters, guarantee well-formedness, etc. For a format this simple, that may be all you need to do. But going down that path is a way to learn things about XML that you perhaps would rather not have to learn.
If using an XML library to create your packets is a problem, you're probably better off defining a custom format (and I'd define one that didn't look anything like XML, to keep people from getting ideas) and building a parser for it using pyparsing.

Python HTML generator

I am looking for an easily implemented HTML generator for Python. I found HTML.py, but there is no way to add CSS elements (id, class) for table.

Dominate is an HTML generation library that lets you easily create tags. In dominate, python reserved words are prefixed with an underscore, so it would look like this:
from dominate.tags import *
t = div(table(_id="the_table"), _class="tbl")
print(t)
<div class="tbl">
<table id="the_table"></table>
</div>
Disclaimer: I am the author of dominate

If you want programmatic generation rather than templating, Karrigell's HTMLTags module is one possibility; it can include e.g. the class attribute (which would be a reserved word in Python) by the trick of uppercasing its initial, i.e., quoting the doc URL I just gave:
Attributes with the same name as
Python keywords (class, type) must be
capitalized :
print DIV('bar', Class="title") ==> <DIV class="title">bar</DIV>

HTML Generation is usually done with one of the infinite amounts of HTML templating languages available for Python. Personally I like Templess, but Genshi is probably the most popular. There are infinite amounts of them, there is a list which is highly likely to be incomplete.
Otherwise you might want to use lxml, where you can generate it in a more programmatically XML-ish way. Although I have a hard time seeing the benefit.

There's the venerable HTMLGen by Robin Friedrich, which is hard to find but still available here (dated 2001, but HTML hasn't changed much since then ;-). There's also xist. Of course nowadays HTML generation, as Lennart points out, is generally better done using templating systems such as Jinja or Mako.

This is one ultra-simple HTML generator I have written. I use it build-time to generate html. If one is generating html pages run-time then there are better options available
Here is the link
http://pypi.python.org/pypi/sphc
And a quick example
>> import sphw
>> tf = sphw.TagFactory()
>>> div = tf.DIV("Some Text here.", Class='content', id='foo')
>>> print(div)
<DIV Class="content", id="foo">Some Text here.</DIV>

Actually you can add any attribute such as id and class to objects in HTML.py (http://www.decalage.info/python/html).
attribs is an optional parameter of Table, TableRow and TableCell classes. It is a dictionary of additional attributes you would like to set. For example, the following code sets id and class for a table:
import HTML
table_data = [
['Last name', 'First name', 'Age'],
['Smith', 'John', 30],
['Carpenter', 'Jack', 47],
['Johnson', 'Paul', 62],
]
htmlcode = HTML.table(table_data,
attribs={'id':'table1', 'class':'myclass'})
print htmlcode
The same parameter can be used with TableRow and TableCell objects to format rows and cells. It does not exist for columns yet, but should be easy to implement if needed.

Ok, here's another html generator, or I prefer to think of it as a compiler.
https://pypi.python.org/pypi/python-html-compiler
This is a set of base classes that can be used to define tags and attributes. Thus a tag class has attributes and children. Children are themselves Tag classes that have attributes and children etc etc. Also you can set parameters that start with your root class and work up the various branches.
This will allow you to define all the tag classes you want thus be able to create customised classes and implement any tags or attributes you want.
Just started on this, so if anybody wants to test :)

Html generation or any text generatio,jinja is a powerful template engine.

You might be interested in some of the Python HAML implementations. HAML is like HTML shorthand and only takes a few minutes to learn. There's a CSS version called SASS too.
http://haml.hamptoncatlin.com/
"Is there a HAML implementation for use with Python and Django" talks about Python and HAML a bit more.
I'm using HAML as much as possible when I'm programming in Ruby. And, as a footnote, there's also been some work getting modules for Perl which work with the nice MVC Mojolicious:
http://metacpan.org/pod/Text::Haml

Sanitising user input using Python

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML

If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I use genshi.builder to programmatically build an HTML document? - python

I recently discovered the genshi.builder module. It reminds me of Divmod Nevow's Stan module. How would one use genshi.builder.tag to build an HTML document with a particular doctype? Or is this even a good thing to do? If not, what is the right way?

Related

How do I find text after a given key word?

Generating XML in Python

Python String parse

Python HTML generator

Sanitising user input using Python

Categories

Resources