using unicode strings with white space as Django url variable - python

Is there a problem with using unicode (hebrew specificaly) strings including white space.
some of them also include characters such as "%" .
I'm experiencing some problems and since this is my first Django project I want to rule out this as a problem before going further into debugging.
And if there is a known Django problem with this kind of urls is there a way around it?
I know I can reformat the text to solve some of those problems but since I'm preparing a site that uses raw open government data sets (perfectly legal) I would like to stick to the original format as possible.
thanks for the help

Django shouldn't have any problems with unicode URLs, or whitespace in URLs for that matter (although you might want to take care to make sure whitespace is urlecoded (%20).
Either way, though, using white space in a URL is just bad form. It's not guaranteed to work unless it's urlencoded, and then that's just one more thing to worry about. Best to make any field that will eventually become part of a URL a SlugField (so spaces aren't allowed to begin with) or run the value through slugify before placing it in the URL:
In template:
http://domain.com/{{ some_string_with_spaces|slugify }}/
Or in python code:
from django.template.defaultfilters import slugify
u'http://domain.com/%s/' % slugify(some_string_with_spaces)

Take a look here for a fairly comprehensive discussion on what makes an invalid (or valid) URL.

Related

GAE unicode character gets encoded to utf-8 bytes

In my app I'm accepting text from user inputs where users often paste text from microsoft word.
A good example being the apostrophe ’, which for some reason gets converted to =E2=80=99 when posting to my handler in google app engine. I've tried a number of confused ways to prevent this and I'm quite happy to simple remove these characters, some of these methods work in plain python but not in app engine.
here's some of what I've tried:
problem_string = re.sub(r'[^\x00-\x7F]+','', problem_string)# trying to remove it
problem_string = problem_string.encode( "utf-8" )# desperation...
problem_string = "".join((c if ord(c) < 128 else '' for c in problem_string))# trying to just remove the thing
problem_string = unicode(problem_string, "utf8")# probably fails since its already unicode
... where I'm trying to capture the string including ’ and then later save it to the ndb datastore as a StringProperty(). Except for the last option, the apsotrophe example gets converted to =E2=80=99.
If I could save the apostrophe type character and display it again that would be great, but simply removing it would also serve my needs.
*Edit - the following:
experience = re.sub(r'[^\x00-\x7F]+',' ', experience)
seems to work fine on the dev server, and successfully removes the offending apostrophe.
Also what may be an issue is that the POST fields are going through the blobstore, so: blobstore_handlers.BlobstoreUploadHandler, which I think may being causing some problems.
I've really been bumping my head against this and I would really really appreciate an explanation from some clever stack-overflower...
Ok, I think I've vaguely stumbled upon a solution.
It had something to do with the blobstore upload handler, I guess it was encoding/decoding unicode appropriately to account for weird file characters. So I modified the handler so that the image file is uploaded via google cloud storage instead of the blobstore and it seems to work fine, i.e. the ’ gets to the datastore as ’ instead of =E2=80=99
I won't accept my own answer for the next few days, maybe someone can clarify things better for future confused individuals.

How to correctly indent Django template code automatically?

I am taking care of an old Django project and discovered that most of the templates are not indented correctly (random spaces in each line, sometimes tabs instead of spaces, etc). I feel comfortable with two-space indentation.
I never had this kind of problem and would like a solution (a script, library, doesn't matter) which would take care of this. Note that I do not want something that formats the HTML before sending it to the client but I want to format the local files that are used for templating.
I tried to use BeautifulSoup and its prettify function - however, it does not let you change the indentation level (although you can always modify it) and the indentation made for some template tags ({% example %}) was incorrect.

How can I write the regex for those urls

I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.
Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),

Best practice for allowing Markdown in Python, while preventing XSS attacks?

I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.
I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.
What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)
I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:
Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.
My question regards case #1, specifically.
Given that, what worked well for me is sending user input through
Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.
I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.
This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.
(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)
Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.
To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.

Django missing translation of some strings. Any idea why?

I have a medium sized Django project, (running on AppEngine if it makes any difference), and have all the strings living in .po files like they should.
I'm seeing strange behavior where certain strings just don't translate. They show up in the .po file when I run make_messages, with the correct file locations marked where my {% trans %} tags are. The translations are in place and look correct compared to other strings on either side of them. But when I display the page in question, about 1/4 of the strings simply don't translate.
Digging into the relevant generated .mo file, I don't see either the msgid or the msgstr present.
Has anybody seen anything similar to this? Any idea what might be happening?
trans tags look correct
.po files look correct
no errors during compile_messages
Ugh. Django, you're killing me.
Here's what was happening:
http://blog.e-shell.org/124
For some reason only Django knows, it decided to decorate some of my translations with the comment '# fuzzy'. It seems to have chosen which ones to mark randomly.
Anyway, #fuzzy means this: "don't translate this, even though here's the translation:"
I'll leave this here in case some other poor soul comes across it in the future.
The fuzzy marker is added to the .po file by makemessages. When you have a new string (with no translations), it looks for similar strings, and includes them as the translation, with the fuzzy marker. This means, this is a crude match, so don't display it to the user, but it could be a good start for the human translator.
It isn't a Django behavior, it comes from the gettext facility.

Categories

Resources