I have asked this question before and the answer led to this error that I am unable to solve.
I have an Article model that takes in articles from users. These articles can get #hashtags in them like we have in twitter. I have these hashtags converted to links that users can click to load all articles that have the clicked hashtags in them.
If have these articles saved in Article model:
1. 'For the love of learning: why do we give #Exam?'
2. 'Articles containing #Examination should not come up when exam is clicked'
3. 'This is just an #example post'
I tried using Django's __icontains filter
def hash_tags(request, hash_tag):
hash_tag = '#' + hash_tag
articles = Articles.objects.filter(content__icontains=hash_tag)
articles = list(articles)
return HttpResponse(articles)
but if user clicks on #exam the three articles are returned instead of the first one.
I can add space to '#exam' to become '#exam ' and it will work out fine but I want to be able to do it with regex.
I have tried:
articles = Articles.objects.filter(content__iregex=r"{0}\b".format(hash_tag))
but I get empty response.
And this:
articles = Articles.objects.filter(content__iregex=r"(?i){0}\b".format(hash_tag))
returns "Error 'repetition-operator operand invalid' from regexp"
How do I do this correctly to have it work? I am using Django 1.6 and MySQL at backend.
MySQL doesn't understand Perl regexp. You must read MySQL regex and use [[:>:]]. The string '\b' in Python is a backspace.
For regex in Python must be used double backspace '\\b' or "raw" string prefixed with "r" r'\b'.
You should check safe characters in the hashtag. A bad user can otherwise construct a regex that would be analyzed forever (DOS attack).
I'm trying to match two strings in an url, but it's not working out likely due to my poor knowledge of regex. I want urls like the following:
testserver/username/
testserver/username/1234/
testserver/username/rewards/
Username would be passed in to the url as a kwarg. Here's what I have:
url(r'^(?P<username>[-\w]+)/$', Posts_Index, name="userposts"),
url(r'^(?P<username>[-\w]+)/photos/$', Photo_Index, name="userphotos"),
url(r'^(?P<username>[-\w]+)/rewards/$', Rewards_Index, name="userrewards"),
url(r'^(?P<username>[-\w]+)/following/$', Follower_Index, name="userfollowers"),
url(r'^(?P<username>[-\w]+)/followers/$', Following_Index, name="userfollowing"),
url(r'^(?P<username>[-\w]+)/(?P<pk>\d+)/$', SinglePost_Index, name="singlepost"),
However, only userposts will be found. If I try to query userphotos or anything below userposts, only the userposts url will be checked, which obviously leads to failure. How can I fix this issue?
From the django docs: Django runs through each URL pattern, in order, and stops at the first one that matches the requested URL.
Thus if you reverse the order of the urls it should work better.
Currently working in Django, and I'm trying to set things up so that a form on one page calls a specific URL, for which the appropriate view is rendered. I'm having trouble with the regular expression that parses the URL, as it won't read the value '\?' as an escaped question mark, which is what I believe it should be doing. The following RE checks out on Pythex.
When the app submits the form, it calls the URL:
http://127.0.0.1:8000/map/?street=62+torrey+pines+cove&city=san+diego&state=CA&radius=50&drg=4
In my project level urls.py file, I have the following:
url(r'^map/', include('healthcare_search.urls', namespace="healthcare_search")),
This calls my app level urls.py file, where I have:
url(r'^\?street=(?P<street>[a-z0-9+]+)&city=(?P<city>[a-z+]+)&state=(?P<state>[a-z]{2})&radius=(?P<radius>[0-9]{1,3})&drg=(?P<drg>[0-9]{1,3})', views.map_hospitals, name = "map_hospitals"),
This just results in a 404 error, saying the URL doesn't match any of the patterns. I know that it's a RE problem, because I removed everything from the app level RE, and submitted just http://127.0.0.1:8000/map/ to see if it would call the right view, which it did successfully. Things seem to break apart on the '\?'. Any ideas what I'm doing wrong?
As a note, this is the first time I've written a regular expression, so my apologies if it is unclear or poorly written.
You don't want to get access to the variables that way. A better option is to get them from the request, since they'll be available in the request's dictionary of variables. In your view, you can get the value of street via request.GET.get('street', None), which will return the value if street is in the request or return None otherwise.
I'm developing application using Bottle. How do I get full query string when I get a GET Request.
I dont want to catch using individual parameters like:
param_a = request.GET.get("a","")
as I dont want to fix number of parameters in the URL.
How to get full query string of requested url
You can use the attribute request.query_string to get the whole query string.
Use request.query or request.query.getall(key) if you have more than one value for a single key.
For eg., request.query.a will return you the param_a you wanted. request.query.b will return the parameter for b and so on.
If you only want the query string alone, you can use #halex's answer.
What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?
Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).
It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)
As you can see, it uses the (awesome) BeautifulSoup library.
import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment
def sanitizeHtml(value, base_url=None):
rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
validAttrs = 'href src width height'.split()
urlAttrs = 'href src'.split() # Attributes which should have a URL
soup = BeautifulSoup(value)
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
# Get rid of comments
comment.extract()
for tag in soup.findAll(True):
if tag.name not in validTags:
tag.hidden = True
attrs = tag.attrs
tag.attrs = []
for attr, val in attrs:
if attr in validAttrs:
val = re_scripts.sub('', val) # Remove scripts (vbs & js)
if attr in urlAttrs:
val = urljoin(base_url, val) # Calculate the absolute url
tag.attrs.append((attr, val))
return soup.renderContents().decode('utf8')
As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.
Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.
html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.
Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:
http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py
I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.
The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.
The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.
SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.
This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.
Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/
However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.
SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.
I don't do web development much any longer, but when I did, I did something like so:
When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).
Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)
In short always escape what can affect the current target for the data.
When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.
See also Escaping HTML
If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.
Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.
To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.
For example (if it is acceptable to remove quotes completely):
datasetName = datasetName.replace("'","").replace('"',"")