I am looking for an easily implemented HTML generator for Python. I found HTML.py, but there is no way to add CSS elements (id, class) for table.
Dominate is an HTML generation library that lets you easily create tags. In dominate, python reserved words are prefixed with an underscore, so it would look like this:
from dominate.tags import *
t = div(table(_id="the_table"), _class="tbl")
print(t)
<div class="tbl">
<table id="the_table"></table>
</div>
Disclaimer: I am the author of dominate
If you want programmatic generation rather than templating, Karrigell's HTMLTags module is one possibility; it can include e.g. the class attribute (which would be a reserved word in Python) by the trick of uppercasing its initial, i.e., quoting the doc URL I just gave:
Attributes with the same name as
Python keywords (class, type) must be
capitalized :
print DIV('bar', Class="title") ==> <DIV class="title">bar</DIV>
HTML Generation is usually done with one of the infinite amounts of HTML templating languages available for Python. Personally I like Templess, but Genshi is probably the most popular. There are infinite amounts of them, there is a list which is highly likely to be incomplete.
Otherwise you might want to use lxml, where you can generate it in a more programmatically XML-ish way. Although I have a hard time seeing the benefit.
There's the venerable HTMLGen by Robin Friedrich, which is hard to find but still available here (dated 2001, but HTML hasn't changed much since then ;-). There's also xist. Of course nowadays HTML generation, as Lennart points out, is generally better done using templating systems such as Jinja or Mako.
This is one ultra-simple HTML generator I have written. I use it build-time to generate html. If one is generating html pages run-time then there are better options available
Here is the link
http://pypi.python.org/pypi/sphc
And a quick example
>> import sphw
>> tf = sphw.TagFactory()
>>> div = tf.DIV("Some Text here.", Class='content', id='foo')
>>> print(div)
<DIV Class="content", id="foo">Some Text here.</DIV>
Actually you can add any attribute such as id and class to objects in HTML.py (http://www.decalage.info/python/html).
attribs is an optional parameter of Table, TableRow and TableCell classes. It is a dictionary of additional attributes you would like to set. For example, the following code sets id and class for a table:
import HTML
table_data = [
['Last name', 'First name', 'Age'],
['Smith', 'John', 30],
['Carpenter', 'Jack', 47],
['Johnson', 'Paul', 62],
]
htmlcode = HTML.table(table_data,
attribs={'id':'table1', 'class':'myclass'})
print htmlcode
The same parameter can be used with TableRow and TableCell objects to format rows and cells. It does not exist for columns yet, but should be easy to implement if needed.
Ok, here's another html generator, or I prefer to think of it as a compiler.
https://pypi.python.org/pypi/python-html-compiler
This is a set of base classes that can be used to define tags and attributes. Thus a tag class has attributes and children. Children are themselves Tag classes that have attributes and children etc etc. Also you can set parameters that start with your root class and work up the various branches.
This will allow you to define all the tag classes you want thus be able to create customised classes and implement any tags or attributes you want.
Just started on this, so if anybody wants to test :)
Html generation or any text generatio,jinja is a powerful template engine.
You might be interested in some of the Python HAML implementations. HAML is like HTML shorthand and only takes a few minutes to learn. There's a CSS version called SASS too.
http://haml.hamptoncatlin.com/
"Is there a HAML implementation for use with Python and Django" talks about Python and HAML a bit more.
I'm using HAML as much as possible when I'm programming in Ruby. And, as a footnote, there's also been some work getting modules for Perl which work with the nice MVC Mojolicious:
http://metacpan.org/pod/Text::Haml
Related
I have what is, probably, a very stupid question, but I'm stumped by it and would appreciate any help.
I'm trying to gather xbrl data from SEC filings using Python and BeautifulSoup. One problem I'm having is that certain line items are referred to differently in the instance document and the calculation linkbase.
As a concrete example, take this recent 10-K from PHI Group Inc.:
https://www.sec.gov/Archives/edgar/data/704172/000149315221015100/0001493152-21-015100-index.htm
A line item with the xbrl tag 'WriteoffOfFinancingCosts' shows up as
<PHIL:WriteoffOfFinancingCosts ...> in the instance document (along with a value and contexts)
but shows up as 'loc_PHILWriteoffOfFinancingCosts' in the calculation linkbase.
But this relationship, 'PHIL:' = 'loc_PHIL', isn't standard across XBRL filings. How does one know what prefix will be added to a tag in the calculation linkbase so that (with the prefix removed) it can be reliably tied back to a tag in the instance document?
I can think of various workarounds, but it just seems silly; isn't there somewhere I can look in the calculation linkbase or elsewhere that will just TELL me exactly what prefix is added?
As some (possibly relevant) nuance: lots of tags in lots of filings, of course, have a prefix like 'us-gaap', indicating the us-gaap namespace, but that doesn't seem to guarantee that a tag in the calculation linkbase will therefore look like 'us-gaapAccountsPayableCurrent' and not 'loc_us-gaapAccountsPayableCurrent' or 'us-gaap:AccountsPayableCurrent' or some other variation of the basic pattern, all of which, of course, look different to BeautifulSoup.
Can anyone point me in the right direction?
PHIL:WriteoffOfFinancingCosts is the name of the XBRL concept, while loc_PHILWriteoffOfFinancingCosts is the (calculation linkbase) label of the locator pointing to the concept PHIL:WriteoffOfFinancingCosts. This mechanism is the way linkbases connect concepts together: each locator is a "proxy" to a concept.
loc_PHILWriteoffOfFinancingCosts is thus an internal detail of the calculation linkbase. The names of linkbase labels are in principle "free to choose", however there are conventions that established themselves (such as prefixing with loc_) but I would not rely on them. Rather, you can "follow the trail" by looking at the definition of the linkbase label:
<link:loc xlink:type="locator"
xlink:href="phil-20200630.xsd#PHIL_WriteoffOfFinancingCosts"
xlink:label="loc_PHILWriteoffOfFinancingCosts" />
Where you see, thanks to the xlink:href attribute, that this locator points to the concept with the ID PHIL_WriteoffOfFinancingCosts in file phil-20200630.xsd.
<element id="PHIL_WriteoffOfFinancingCosts"
name="WriteoffOfFinancingCosts" .../>
And you can see that the local name of this concept is WriteoffOfFinancingCosts. It is in the namespace commonly associated with prefix PHIL: but never appears in a concept definition as all concepts in that file are in the namespace commonly associated with PHIL:. Now, how do we know this? because at the top of the xsd file, it says targetNamespace="http://phiglobal.com/20200630" and the prefix PHIL: is also attached to this namespace in the instance file phil-20200630.xml with xmlns:PHIL="http://phiglobal.com/20200630"
It is common practice to choose concept IDs with the prefix followed by underscore followed by the local name. Some users rely on it, but following the levels of indirection, in spite of being more complex, is "safer": linkbase label loc_PHILWriteoffOfFinancingCosts -> concept ID PHIL_WriteoffOfFinancingCosts -> concept local name WriteoffOfFinancingCosts -> concept's fully qualified name PHIL:WriteoffOfFinancingCosts.
You probably notice how complex this is. In fact, this is the reason why it is worth using an XBRL processor, which will do all of this for you.
#Ghislain Fourny: Many thanks. I'm glad to know that I wasn't crazy for finding the situation complex. Knowing now that the linkbase labels are "free-to-choose", here is the specific BeautifulSoup workaround that I've come up with, in case anyone is interested:
labeldict = {}
resp = requests.get(calcurl, headers = headers)
ctext = resp.text
soup = BeautifulSoup(ctext, 'lxml')
tags = soup.find_all()
for tag in tags:
if tag.name == 'link:loc':
if tag.has_attr('xlink:href') and tag.has_attr('xlink:label'):
href = tag['xlink:href']
firstsplit = href.split('#')[1] ## gets the part of the link after the pound symbol
value = firstsplit.split('_')[1] ## gets the part after the underscore
key = tag['xlink:label']
labeldict[key] = value
Which results in a dictionary where keys are the 'loc_Phil'-type label names and the values are the plain concept names, e.g. labeldict['loc_PHILWriteoffOfFinancingCosts'] = 'WriteoffOfFinancingCosts'
This assumes that xsd links will always follow a format of '...#..._concept'. I haven't found any that don't follow that format, but that's not a guarantee.
I need to write a bunch of elements in an xml file and every couple of elements should be empty so in order to make the code a bit more compact i thought I'd make a function that writes out the empty elements. Of course I can't make a code that looks like this:
def makeOne():
table=etree.SubElement(tables,'table')
values = etree.SubElement(table,'values')
and call it later in the function that does the actual input of values because as much as I gathered the file I'm working with isn't loaded inside of that function. I might be wrong. I didn't do much Python so I have no idea if there's a more elegant way of handling this. For clarity this is what i had in mind.
def writeVals():
tree = etree.parse('singleprog')
root = tree.getroot()
tables = etree.SubElement(korjen[0], 'tables')
makeOne()
I tihnk it's clear what I want to see happen here, the thing is I can't just put the two subelements in the writeVals() function because i need to use that code some 30 times at random places.
That's not really answer the question but alternatively but you can use lxml library and it's wonderful E factory method:
from lxml import etree
from lxml.builder import E
table = E.table(E.values)
etree.dump(table)
You'll get:
<table>
<values/>
</table>
To go further:
table = E.table(
E.values("one"),
E.values("two"),
E.values("there"),
)
etree.dump(table)
You'll get:
<table>
<values>one</values>
<values>two</values>
<values>there</values>
</table>
Introduction to lxml:
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.6. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.
I'm developing an API using Python which makes server calls using XML. I am debating on whether to use a library (ex. http://wiki.python.org/moin/MiniDom) or if it would be "better" (meaning less overhead and faster) to use string concatenation in order to generate the XML used for each request. Also, the XML I will be generating is going to be quite dynamic so I'm not sure if something that allows me to manage elements dynamically will be a benefit.
Since you are just using authorize.net, why not use a library specifically designed for the Authorize.net API and forget about constructing your own XML calls?
If you want or need to go your own way with XML, don't use minidom, use something with an ElementTree interface such as cElementTree (which is in the standard library). It will be much less painful and probably much faster. You will certainly need an XML library to parse the XML you produce, so you might as well use the same API for both.
It is very unlikely that the overhead of using an XML library will be a problem, and the benefit in clean code and knowing you can't generate invalid XML is very great.
If you absolutely, positively need to be as fast as possible, use one of the extremely fast templating libraries available for Python. They will probably be much faster than any naive string concatenation you do and will also be safe (i.e do proper escaping).
Another option is to use Jinja, especially if the dynamic nature of your xml is fairly simple. This is a common idiom in flask for generating html responses.
Here is an example of a jinja template that generates the XML of an aws S3 list objects response. I usually store the template in a separate file to avoid polluting my elegant python with ugly xml.
from datetime import datetime
from jinja2 import Template
list_bucket_result = """<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>{{bucket_name}}</Name>
<Prefix/>
<KeyCount>{{object_count}}</KeyCount>
<MaxKeys>{{max_keys}}</MaxKeys>
<IsTruncated>{{is_truncated}}</IsTruncated>
{%- for object in object_list %}
<Contents>
<Key>{{object.key}}</Key>
<LastModified>{{object.last_modified_date.isoformat()}}</LastModified>
<ETag></ETag>
<Size>{{object.size}}</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>{% endfor %}
</ListBucketResult>
"""
class BucketObject:
def __init__(self, key, last_modified_date, size):
self.key = key
self.last_modified_date = last_modified_date
self.size = size
object_list = [
BucketObject('/foo/bar.txt', datetime.utcnow(), 10*1024 ),
BucketObject('/foo/baz.txt', datetime.utcnow(), 29*1024 ),
]
template = Template(list_bucket_result)
result = template.render(
bucket_name='test-bucket',
object_count=len(object_list),
max_keys=1000,
is_truncated=False,
object_list=object_list
)
print result
Output:
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>test-bucket</Name>
<Prefix/>
<KeyCount>2</KeyCount>
<MaxKeys>1000</MaxKeys>
<IsTruncated>False</IsTruncated>
<Contents>
<Key>/foo/bar.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>10240</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
<Contents>
<Key>/foo/baz.txt</Key>
<LastModified>2017-10-31T02:28:34.551000</LastModified>
<ETag></ETag>
<Size>29696</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
I would definitely recommend that you make use of one of the Python libraries; such as MiniDom, ElementTree, lxml.etree or pyxser. There is no reason to not to, and the potential performance impact will be minimal.
Although, personally I prefer using simplejson (or simply json) instead.
my_list = ["Value1", "Value2"]
json = simplejson.dumps(my_list)
# send json
My real question is what are the biggest concerns for what you're trying to accomplish? If you're worried about speed/memory, then yes, minidom does take a hit. If you want something that's fairly reliable that you can deploy quickly, I'd say use it.
My suggestion for dealing with XML in any language(Java, Python, C#, Perl, etc) is to consider using something already existing. Everyone has written their own XML parser at least once, and then they never do so again because it's such a pain in the behind. And to be fair, these libraries will have already fixed 99.5% of any problems you would run into.
I recommend LXML. It's a Python library of bindings for the very fast C libraries libxml2 and libxslt.
LXML supports XPATH, and has an elementTree implementation. LXML also has an interface called objectify for writing XML as object hierarchies:
from lxml import etree, objectify
E = objectify.ElementMaker(annotate=False)
my_alpha = my_alpha = E.alpha(E.beta(E.gamma(firstattr='True')),
E.beta(E.delta('text here')))
etree.tostring(my_alpha)
# '<alpha><beta><gamma firstattr="True"/></beta><beta><delta>text here</delta></beta></alpha>'
etree.tostring(my_alpha.beta[0])
# '<beta><gamma firstattr="True"/></beta>'
my_alpha.beta[1].delta.text
# 'text here'
I recently discovered the genshi.builder module. It reminds me of Divmod Nevow's Stan module. How would one use genshi.builder.tag to build an HTML document with a particular doctype? Or is this even a good thing to do? If not, what is the right way?
It's not possible to build an entire page using just genshi.builder.tag -- you would need to perform some surgery on the resulting stream to insert the doctype. Besides, the resulting code would look horrific. The recommended way to use Genshi is to use a separate template file, generate a stream from it, and then render that stream to the output type you want.
genshi.builder.tag is mostly useful for when you need to generate simple markup from within Python, such as when you're building a form or doing some sort of logic-heavy modification of the output.
See documentation for:
Creating and using templates
The XML-based template language
genshi.builder API docs
If you really want to generate a full document using only builder.tag, this (completely untested) code could be a good starting point:
from itertools import chain
from genshi.core import DOCTYPE, Stream
from genshi.output import DocType
from genshi.builder import tag as t
# Build the page using `genshi.builder.tag`
page = t.html (t.head (t.title ("Hello world!")), t.body (t.div ("Body text")))
# Convert the page element into a stream
stream = page.generate ()
# Chain the page stream with a stream containing only an HTML4 doctype declaration
stream = Stream (chain ([(DOCTYPE, DocType.get ('html4'), None)], stream))
# Convert the stream to text using the "html" renderer (could also be xml, xhtml, text, etc)
text = stream.render ('html')
The resulting page will have no whitespace in it -- it'll look normal, but you'll have a hard time reading the source code because it will be entirely on one line. Implementing appropriate filters to add whitespace is left as an exercise to the reader.
Genshi.builder is for "programmatically generating markup streams"[1]. I believe the purpose of it is as a backend for the templating language. You're probably looking for the templating language for generating a whole page.
You can, however do the following:
>>> import genshi.output
>>> genshi.output.DocType('html')
('html', '-//W3C//DTD HTML 4.01//EN', 'http://www.w3.org/TR/html4/strict.dtd')
See other Doctypes here: http://genshi.edgewall.org/wiki/ApiDocs/genshi.output#genshi.output:DocType
[1] genshi.builder.__doc__
What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?
Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.
Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.
The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.
Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.
I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML
If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.
To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")