I want to use lxml cleaner to get rid of all html, but then a regex to autolink something:
[ABC] -> ABC
what is the right way to handle this without xss and such?
Maybe using markdown with inline HTML disabled would be suitable? The python markdown module is quite mature.
Check out the "safe mode" section in the docs for more info on stripping out inline HTML.
Depending on what you want, something like py-wikimarkup may be more appropriate.
Using a custom regexp is probably not a great idea, because
you'll have to explain the rules to people who might already be familiar with markdown/WikiText
you'll have to provide a way to escape text, e.g. for people who really want to write [ABC]
you'll have to fix any bugs, including security issues
Related
I'm trying to make my own markdown extension to markdown in django. I'm calling it like
markdown.markdown(markup, [neboard_extension])
In my extension's extendMarkdown method I see some default patterns (like autolink for example) and add mine. But neither the default autolink nor my patterns work. How can I enable the patterns?
Patterns are order-dependent.
If your pattern interacts with existing patterns, for example:
by expecting a pattern that is escaped by the EscapePattern before it gets to your extension, then it may hide the pattern that you are looking for.
by changing the output to something that another Pattern or component modifies, then your output will not look as expected.
One tip is to check the ordering. You can sometimes get around the problem by inserting your extension ahead of all other patterns (for the first scenario above), or after they have all been processed (the second scenario).
There is little discussion about how to protect against this based in the documentation. My experience, after trying to heavily customise python-markdown, is that this is error prone and awkward, with little in the way for introspection for finding out what other patterns are enabled... other than reading the code.
I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.
I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.
I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.
What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)
I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:
Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.
My question regards case #1, specifically.
Given that, what worked well for me is sending user input through
Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.
I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.
This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.
(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)
Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.
To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.
Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.