I am taking care of an old Django project and discovered that most of the templates are not indented correctly (random spaces in each line, sometimes tabs instead of spaces, etc). I feel comfortable with two-space indentation.
I never had this kind of problem and would like a solution (a script, library, doesn't matter) which would take care of this. Note that I do not want something that formats the HTML before sending it to the client but I want to format the local files that are used for templating.
I tried to use BeautifulSoup and its prettify function - however, it does not let you change the indentation level (although you can always modify it) and the indentation made for some template tags ({% example %}) was incorrect.
Related
I am trying to make a bot that shows in an embed some files from a directory. The code works but in case of files with symbols in it, it modifies the text:
EX:
hello_world_-_1
BECOMES:
How can I solve this?
I don't even know what markdown this would be...
You can't explicitly disable markdown, as the rendering happens client-side. The best you can do is wrap in triple backticks.
```hello_word_-_```
This will work unless your string contains triple backticks. And if you have a filename which contains triple backticks, then you have a different issue.
You could (and probably should) still check, of course. Just a simple "if filename contains ```", and you can report an error if it does.
I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.
I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.
Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),
Is there a problem with using unicode (hebrew specificaly) strings including white space.
some of them also include characters such as "%" .
I'm experiencing some problems and since this is my first Django project I want to rule out this as a problem before going further into debugging.
And if there is a known Django problem with this kind of urls is there a way around it?
I know I can reformat the text to solve some of those problems but since I'm preparing a site that uses raw open government data sets (perfectly legal) I would like to stick to the original format as possible.
thanks for the help
Django shouldn't have any problems with unicode URLs, or whitespace in URLs for that matter (although you might want to take care to make sure whitespace is urlecoded (%20).
Either way, though, using white space in a URL is just bad form. It's not guaranteed to work unless it's urlencoded, and then that's just one more thing to worry about. Best to make any field that will eventually become part of a URL a SlugField (so spaces aren't allowed to begin with) or run the value through slugify before placing it in the URL:
In template:
http://domain.com/{{ some_string_with_spaces|slugify }}/
Or in python code:
from django.template.defaultfilters import slugify
u'http://domain.com/%s/' % slugify(some_string_with_spaces)
Take a look here for a fairly comprehensive discussion on what makes an invalid (or valid) URL.
I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.
I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.
What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)
I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:
Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.
My question regards case #1, specifically.
Given that, what worked well for me is sending user input through
Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.
I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.
This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.
(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)
Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.
To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.