I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.
I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.
What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)
I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:
Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.
My question regards case #1, specifically.
Given that, what worked well for me is sending user input through
Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.
I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.
This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.
(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)
Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.
To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.
Related
I'm trying to make my own markdown extension to markdown in django. I'm calling it like
markdown.markdown(markup, [neboard_extension])
In my extension's extendMarkdown method I see some default patterns (like autolink for example) and add mine. But neither the default autolink nor my patterns work. How can I enable the patterns?
Patterns are order-dependent.
If your pattern interacts with existing patterns, for example:
by expecting a pattern that is escaped by the EscapePattern before it gets to your extension, then it may hide the pattern that you are looking for.
by changing the output to something that another Pattern or component modifies, then your output will not look as expected.
One tip is to check the ordering. You can sometimes get around the problem by inserting your extension ahead of all other patterns (for the first scenario above), or after they have all been processed (the second scenario).
There is little discussion about how to protect against this based in the documentation. My experience, after trying to heavily customise python-markdown, is that this is error prone and awkward, with little in the way for introspection for finding out what other patterns are enabled... other than reading the code.
see: http://diveintopython.net/native_data_types/lists.html#d0e5623
I have a website with code examples on it, generated through docutils, and the CSS is always not quite right.
I would like to know if there is
best practise CSS for displaying code (ie can it handle wrap arounds, long lines, any chance of getting colourisation)
best practise for the little numerical callouts (see diveintopython above)
and finally, I am wondering if there is (open) CSS that is designed to work with docutils HTML output and actually look "nice". I would be happy to contribute some CSS that makes tables look "microsoft professional grey" and so forth.
You can't do syntax highlighting with CSS alone. You need the various parts of the code to be marked up; you can do that on the server if you are using dynamic pages, or you can use JavaScript on the client. Here is a comparison of a few JavaScript syntax highlighters.
The circled numbers are images in the site you linked, but I would use Unicode instead: ❶➋➌➍➎➏➐➑➒➓
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.
I'm exploring many technologies, but I would like your input on which web framework would make this the easiest/ most possible. I'm currently looking to JSP/JSF/Primefaces, but I'm not sure if that is capable of this app.
Here's a basic description of the app:
Users log in with their username and password (maybe I can somehow incorporate OPENID)?
With a really nice UI, they will be presented a large list of questions specific to a certain category, for example, "Cooking". (I will manually compile this list and make it available.)
When they click on any of these questions, a little input box opens up below it to allow the user to put in a link/URL.
If the link they enter has the same question on that webpage the URL points to, they will be awarded one point. This question then disappears and gets added to a different page that has a list of all correctly linked questions.
On the right side of the screen, there will be a leaderboard with the usernames of the people with the top ten points.
The idea is relatively simple - to be able to compile links to external websites for specific questions by allowing many people to contribute.
I know I can build the UI easily with Primefaces. [B]What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.[/B] I can do this with python easily by using urllib, but I can't use python for web GUI building (it is very difficult). What is the best approach?
Any help would be appreciated!!! Thanks!
The best approach is whatever is best for you. If Python isn't your strength but Java is, then use Java. If you're a Python expert and know little Java, use Python.
There are so many resources on the Internet supporting so many platforms that the decision really comes down to what works best for you.
For starters, forget about JSP/JSF. This is an old combination that had many problems. Please consider Facelets/JSF. Facelets is the default templating language in the current version of JSF, while JSP is there only for backwards compatibility.
What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.
Yes it does, although the actual fetching of data and parsing of its content will be done by plain Java code. This itself has nothing to do with the JSF APIs.
With JSF you create a Facelet containing your UI (input fields, buttons, etc). Then still using JSF you bind this to a so-called backing bean, which is primarily a normal Java class with only one or two JSF specific annotations applied to it (e.g. #ManagedBean).
When the user enters the URL and presses some button, JSF takes care of calling some action method in your Java class (backing bean). In this action method you now have access to the URL the user entered, and from here on plain Java coding starts and JSF specifics end. You can put the code that fetches the URL and does the parsing you require in a separate helper class (separation of concerns), or at your discretion directly in the backing bean. The choice is yours.
Incidentally we had a very junior programmer at our office use JSF for something not unlike what you are requesting here and he succeeded in doing it in a short time. It thus really isn't that hard ;)
No web technology does what you want. Parsing documents found at certain urls is out of the scope of building web interfaces.
However, each of Java's web technologies will give you, without limits, access to a rich and varied (if not too rich and much too varied) set of libraries and frameworks running on JVM. You could safely say that if there is a library for doing something, there will be a Java version available. Downloading and parsing a document will not require more than what is available in the standard library (unless you insist on injecting your dependencies or crosscutting your concerns), so no problems with doing your project with JSP, or JSF/Primefaces, or whatever.
Since you claim to already know Python, and since you will have to add some HTML/CSS anyway, I suggest you try Django. It's dead simple, has a set of OpenID plugins to choose from, will give you admin interface for free (so you can prime the pump with the first set of links).
I want to use lxml cleaner to get rid of all html, but then a regex to autolink something:
[ABC] -> ABC
what is the right way to handle this without xss and such?
Maybe using markdown with inline HTML disabled would be suitable? The python markdown module is quite mature.
Check out the "safe mode" section in the docs for more info on stripping out inline HTML.
Depending on what you want, something like py-wikimarkup may be more appropriate.
Using a custom regexp is probably not a great idea, because
you'll have to explain the rules to people who might already be familiar with markdown/WikiText
you'll have to provide a way to escape text, e.g. for people who really want to write [ABC]
you'll have to fix any bugs, including security issues