Sanitising user input using Python - python

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML

If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")

Related

How to place the iterated string variable in another string

Quick overview: I am writing a a very simple script using Python and Selenium to view Facebook Metrics for multiple Facebook pages.
I am trying to find a clean way to loop through the pages and output their results (it's only one number that I am collecting).
Here is what I have right now but it is not working.
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get(('https://www.facebook.com/{link}/insights/?section=navVideos'))
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get('https://www.facebook.com/'+ link + '/insights/?section=navVideos')
its just string concatenation
or if you are so much inclined to use that syntax, have a look at the comment by #heather
It didn't work for you, because you aimed to use Python 3.6's f-strings, but forgot to prepend your string with the f char - crucial for this syntax. E.g. your code should be (only the relevant part):
browser.get(f'https://www.facebook.com/{link}/insights/?sights/?section=navVideos')
Alternatively you could use string formatting (e.g. the established approach before 3.6):
browser.get('https://www.facebook.com/{}/insights/?sights/?section=navVideos'.format(link))
In general, string concatenation - 'string1' + variable + 'string2' - is discouraged in python for performance and readability reasons.
BTW, in your sample code you had brackets around the get()'s argument - it is browser.get((arg)), which essentially turned it to a tuple, and might've caused error in the call. Not sure was it a typo or on purpose, as you can see I and the other responders have removed it.

Python: Matching & Stripping port number from socket data

I have data coming in to a python server via a socket. Within this data is the string '<port>80</port>' or which ever port is being used.
I wish to extract the port number into a variable. The data coming in is not XML, I just used the tag approach to identifying data for future XML use if needed. I do not wish to use an XML python library, but simply use something like regexp and strings.
What would you recommend is the best way to match and strip this data?
I am currently using this code with no luck:
p = re.compile('<port>\w</port>')
m = p.search(data)
print m
Thank you :)
Regex can't parse XML and shouldn't be used to parse fake XML. You should do one of
Use a serialization method that is nicer to work with to start with, such as JSON or an ini file with the ConfigParser module.
Really use XML and not something that just sort of looks like XML and really parse it with something like lxml.etree.
Just store the number in a file if this is the entirety of your configuration. This solution isn't really easier than just using JSON or something, but it's better than the current one.
Implementing a bad solution now for future needs that you have no way of defining or accurately predicting is always a bad approach. You will be kept busy enough trying to write and maintain software now that there is no good reason to try to satisfy unknown future needs. I have never seen a case where "I'll put this in for later" has led to less headache later on, especially when I put it in by doing something completely wrong. YAGNI!
As to what's wrong with your snippet other than using an entirely wrong approach, angled brackets have a meaning in regex.
Though Mike Graham is correct, using regex for xml is not 'recommended', the following will work:
(I have defined searchType as 'd' for numerals)
searchStr = 'port'
if searchType == 'd':
retPattern = '(<%s>)(\d+)(</%s>)'
else:
retPattern = '(<%s>)(.+?)(</%s>)'
searchPattern = re.compile(retPattern % (searchStr, searchStr))
found = searchPattern.search(searchStr)
retVal = found.group(2)
(note the complete lack of error checking, that is left as an exercise for the user)

Python HTML generator

I am looking for an easily implemented HTML generator for Python. I found HTML.py, but there is no way to add CSS elements (id, class) for table.
Dominate is an HTML generation library that lets you easily create tags. In dominate, python reserved words are prefixed with an underscore, so it would look like this:
from dominate.tags import *
t = div(table(_id="the_table"), _class="tbl")
print(t)
<div class="tbl">
<table id="the_table"></table>
</div>
Disclaimer: I am the author of dominate
If you want programmatic generation rather than templating, Karrigell's HTMLTags module is one possibility; it can include e.g. the class attribute (which would be a reserved word in Python) by the trick of uppercasing its initial, i.e., quoting the doc URL I just gave:
Attributes with the same name as
Python keywords (class, type) must be
capitalized :
print DIV('bar', Class="title") ==> <DIV class="title">bar</DIV>
HTML Generation is usually done with one of the infinite amounts of HTML templating languages available for Python. Personally I like Templess, but Genshi is probably the most popular. There are infinite amounts of them, there is a list which is highly likely to be incomplete.
Otherwise you might want to use lxml, where you can generate it in a more programmatically XML-ish way. Although I have a hard time seeing the benefit.
There's the venerable HTMLGen by Robin Friedrich, which is hard to find but still available here (dated 2001, but HTML hasn't changed much since then ;-). There's also xist. Of course nowadays HTML generation, as Lennart points out, is generally better done using templating systems such as Jinja or Mako.
This is one ultra-simple HTML generator I have written. I use it build-time to generate html. If one is generating html pages run-time then there are better options available
Here is the link
http://pypi.python.org/pypi/sphc
And a quick example
>> import sphw
>> tf = sphw.TagFactory()
>>> div = tf.DIV("Some Text here.", Class='content', id='foo')
>>> print(div)
<DIV Class="content", id="foo">Some Text here.</DIV>
Actually you can add any attribute such as id and class to objects in HTML.py (http://www.decalage.info/python/html).
attribs is an optional parameter of Table, TableRow and TableCell classes. It is a dictionary of additional attributes you would like to set. For example, the following code sets id and class for a table:
import HTML
table_data = [
['Last name', 'First name', 'Age'],
['Smith', 'John', 30],
['Carpenter', 'Jack', 47],
['Johnson', 'Paul', 62],
]
htmlcode = HTML.table(table_data,
attribs={'id':'table1', 'class':'myclass'})
print htmlcode
The same parameter can be used with TableRow and TableCell objects to format rows and cells. It does not exist for columns yet, but should be easy to implement if needed.
Ok, here's another html generator, or I prefer to think of it as a compiler.
https://pypi.python.org/pypi/python-html-compiler
This is a set of base classes that can be used to define tags and attributes. Thus a tag class has attributes and children. Children are themselves Tag classes that have attributes and children etc etc. Also you can set parameters that start with your root class and work up the various branches.
This will allow you to define all the tag classes you want thus be able to create customised classes and implement any tags or attributes you want.
Just started on this, so if anybody wants to test :)
Html generation or any text generatio,jinja is a powerful template engine.
You might be interested in some of the Python HAML implementations. HAML is like HTML shorthand and only takes a few minutes to learn. There's a CSS version called SASS too.
http://haml.hamptoncatlin.com/
"Is there a HAML implementation for use with Python and Django" talks about Python and HAML a bit more.
I'm using HAML as much as possible when I'm programming in Ruby. And, as a footnote, there's also been some work getting modules for Perl which work with the nice MVC Mojolicious:
http://metacpan.org/pod/Text::Haml

Parsing SQL with Python

I want to create a SQL interface on top of a non-relational data store. Non-relational data store, but it makes sense to access the data in a relational manner.
I am looking into using ANTLR to produce an AST that represents the SQL as a relational algebra expression. Then return data by evaluating/walking the tree.
I have never implemented a parser before, and I would therefore like some advice on how to best implement a SQL parser and evaluator.
Does the approach described above sound about right?
Are there other tools/libraries I should look into? Like PLY or Pyparsing.
Pointers to articles, books or source code that will help me is appreciated.
Update:
I implemented a simple SQL parser using pyparsing. Combined with Python code that implement the relational operations against my data store, this was fairly simple.
As I said in one of the comments, the point of the exercise was to make the data available to reporting engines. To do this, I probably will need to implement an ODBC driver. This is probably a lot of work.
I have looked into this issue quite extensively. Python-sqlparse is a non validating parser which is not really what you need. The examples in antlr need a lot of work to convert to a nice ast in python. The sql standard grammars are here, but it would be a full time job to convert them yourself and it is likely that you would only need a subset of them i.e no joins. You could try looking at the gadfly (a Python SQL database) as well, but I avoided it as they used their own parsing tool.
For my case, I only essentially needed a where clause. I tried booleneo (a boolean expression parser) written with pyparsing but ended up using pyparsing from scratch. The first link in the reddit post of Mark Rushakoff gives a SQL example using it. Whoosh a full text search engine also uses it but I have not looked at the source to see how.
Pyparsing is very easy to use and you can very easily customize it to not be exactly the same as SQL (most of the syntax you will not need). I did not like ply as it uses some magic using naming conventions.
In short give pyparsing a try, it will most likely be powerful enough to do what you need and the simple integration with python (with easy callbacks and error handling) will make the experience pretty painless.
This reddit post suggests python-sqlparse as an existing implementation, among a couple other links.
TwoLaid's Python SQL Parser works very well for my purposes. It's written in C and needs to be compiled. It is robust. It parses out individual elements of each clause.
https://github.com/TwoLaid/python-sqlparser
I'm using it to parse out queries column names to use in report headers. Here is an example.
import sqlparser
def get_query_columns(sql):
'''Return a list of column headers from given sqls select clause'''
columns = []
parser = sqlparser.Parser()
# Parser does not like new lines
sql2 = sql.replace('\n', ' ')
# Check for syntax errors
if parser.check_syntax(sql2) != 0:
raise Exception('get_query_columns: SQL invalid.')
stmt = parser.get_statement(0)
root = stmt.get_root()
qcolumns = root.__dict__['resultColumnList']
for qcolumn in qcolumns.list:
if qcolumn.aliasClause:
alias = qcolumn.aliasClause.get_text()
columns.append(alias)
else:
name = qcolumn.get_text()
name = name.split('.')[-1] # remove table alias
columns.append(name)
return columns
sql = '''
SELECT
a.a,
replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
a.bla as sally -- some comment
FROM
table_a as a
WHERE
c > 20
'''
print get_query_columns(sql)
# output: ['a', 'jim', 'sally']
Of course, it may be best to leverage python-sqlparse on Google Code
UPDATE: Now I see that this has been suggested - I concur that this is worthwhile:
I am using python-sqlparse with great success.
In my case I am working with queries that are already validated, my AST-walking code can make some sane assumptions about the structure.
https://pypi.org/project/sqlparse/
https://sqlparse.readthedocs.io/en/latest/

How do I use genshi.builder to programmatically build an HTML document?

I recently discovered the genshi.builder module. It reminds me of Divmod Nevow's Stan module. How would one use genshi.builder.tag to build an HTML document with a particular doctype? Or is this even a good thing to do? If not, what is the right way?
It's not possible to build an entire page using just genshi.builder.tag -- you would need to perform some surgery on the resulting stream to insert the doctype. Besides, the resulting code would look horrific. The recommended way to use Genshi is to use a separate template file, generate a stream from it, and then render that stream to the output type you want.
genshi.builder.tag is mostly useful for when you need to generate simple markup from within Python, such as when you're building a form or doing some sort of logic-heavy modification of the output.
See documentation for:
Creating and using templates
The XML-based template language
genshi.builder API docs
If you really want to generate a full document using only builder.tag, this (completely untested) code could be a good starting point:
from itertools import chain
from genshi.core import DOCTYPE, Stream
from genshi.output import DocType
from genshi.builder import tag as t
# Build the page using `genshi.builder.tag`
page = t.html (t.head (t.title ("Hello world!")), t.body (t.div ("Body text")))
# Convert the page element into a stream
stream = page.generate ()
# Chain the page stream with a stream containing only an HTML4 doctype declaration
stream = Stream (chain ([(DOCTYPE, DocType.get ('html4'), None)], stream))
# Convert the stream to text using the "html" renderer (could also be xml, xhtml, text, etc)
text = stream.render ('html')
The resulting page will have no whitespace in it -- it'll look normal, but you'll have a hard time reading the source code because it will be entirely on one line. Implementing appropriate filters to add whitespace is left as an exercise to the reader.
Genshi.builder is for "programmatically generating markup streams"[1]. I believe the purpose of it is as a backend for the templating language. You're probably looking for the templating language for generating a whole page.
You can, however do the following:
>>> import genshi.output
>>> genshi.output.DocType('html')
('html', '-//W3C//DTD HTML 4.01//EN', 'http://www.w3.org/TR/html4/strict.dtd')
See other Doctypes here: http://genshi.edgewall.org/wiki/ApiDocs/genshi.output#genshi.output:DocType
[1] genshi.builder.__doc__

Categories

Resources