Usually code snippets wrap the code tag with a pre tag. Looks like markdown is just using a p tag, is this normal?
from markdown2 import Markdown
markdowner = Markdown()
markdowner.convert("```\nthis is code\n```")
u'<p><code>\nthis is code\n</code></p>\n'
Even this website's adding pre tags. How do I add it to markdown?
is this normal?
Yes, fenced code blocks are not standard Markdown (only indented code blocks are). However, inline code spans can be deliminated with any number of backticks (as long as both opening an closing deliminators match). Therefore, the parser is correctly parsing your input as an inline code span which consists of a code tag inside a p tag. Of course, if you had inserted any blank lines, then the output would have been multiple paragraphs without any code spans (as the opening and closing deliminators would have been in separate paragraphs).
How do I add it to markdown?
As fenced code blocks are non-standard Markdown, they generally need to be enabled in parsers which support them. Each parser is different, so users should consult the documentation for their parser of choice. The other answer already covers how to enable them in the specific parser used by the OP.
Turns out markdown2 only adds pre's to what is indented by four spaces.
To add to above example, use:
markdown2.markdown(text, extras=["fenced-code-blocks"])
Reference
Related
I'm trying to put some python code in html document. I am using code tag. Example
<code>
for iris, species in zip(irises, classification):
if species == 0:
print(f'Flower {iris} is Iris-setosa')
</code>
The problem is, page doesn't see new lines and indents. I can handle new lines with br tag but I didn't find anything to make indent. I tried pre tag, but I have to remove all indents in html document, and with several indents in it, it starts to look very ugly. Propably I could use but using 4,8 or 12 in one line doesn't seem to be good idea. Is there anything else I can do to format my code?
The parser will ignore white space characters in the source code. you can may <pre> or <br/> or fake it with CSS. but the solution you proposed is also valid and works, but as you stated it is ugly. if you are going for that you can use 	 char and it will create a tab indent; it makes more sense to use it instead of 4 x but you still need to put it inside <pre> tag to avoid being ignored by the parser.
I need to match different script tags which
for example like this
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function()
{
jQuery("#gift_cards").tooltip({ effect: \'slide\'});
});
</script>
<script>dasdfsfsdf</script>
Also i need to get the tags only and the src content in groups
I created a regex
(<\s*?script[\s\S]*?(?:src=['"](\S+?)['"])?\B[\S\s]*?>)([\s\S]*?)(</script>)
This is not matching the last script tag
Whats wrong with it?
EDIT:
Removing the \B does match all the script tags but then i donot get the contents of the src attribute in a separate group. What I need to do is from a group of script tags of two categories
One with an src attribute with the path to the actual script
Second without src attribute with normal inline javascript
I need to remove the script opening and closing tags but keep the content inside of the tag
If its of the first type I still need to remove the tags but keep the path in a seperate table
Hope that clarifies it much more
As iCodez' link so entertainingly shows, HTML should not be parsed by regex, as HTML is not a regular language. Instead, try using a parser such as BeautifulSoup. Make sure you also install lxml and html5lib as well for best performance and access to all the features.
pip install lxml html5lib beautifulsoup4
should do the trick.
Provided that I agree with all the remarks about not parsing HTML with RegExp and also provided that I myself indulge in such evil practice when I'm confident that the documents I will process are regular enough, try removing the \B, in my test it matches all three scripts.
What is for, by the way, this "non boundary"? I'm not sure I understood why you inserted it. If it was necessary for some reason I do not grasp please tell me and we'll try to find another way.
Edit:
In order to retain the src content try
(<\s*?script[\s\S]*?(?:(?:src=[\'"](.*?)[\'"])(?:[\S\s]*?))?>)([\s\S]*?)(</script>)
This works for me, check against your other samples.
Consider that your first [\s\S]*? already matches everything till > when you do not have a "src" attribute, so the second one only makes sense if "src" is there and you want to match other possible attributes.
For giggles, here's a super-simple way that I figured out by complete accident (as a js string which would be fed to the RegExp constructor:
'src=(=|=")' + yourPathHere + '[^<]<\/script>'
where yourPathHere has had forward slashes escaped; so, as a pure RE, something like:
/src=(=|=")/scripts/someFolder/script.js[^<]</script>/
which I'm using in a gulp task whilst I'm trying to figure out gulp streams :[]
I'm accepting user input on a small forum I have. This is what I do with user's input:
First, call "html.strip_tags" from django.utils.html on user's cleaned_data[input].
Save it to the database. Postgre.
Query the text and use a regex to replace \n with br and display spaces entered by users.
Then, I do {{text|safe}} to display the text (if I don't mark it as safe, it won't display spaces between paragraphs but br tags).
Finally I use some jquery plugins on the text: Autolinker.js to detect and "urlize" hyperlinks and trunk8 to control its length.
So, because I do {{text|safe}} I am worried about malicious input, is html.strip_tags enough?
The documentation about strip_tags writes:
"Tries to remove anything that looks like an HTML tag from the string, that is anything contained within <>. Absolutely NO guaranty is provided about the resulting string being entirely HTML safe. So NEVER mark safe the result of a strip_tag call without escaping it first, for example with escape()."
The documentation about Python's Bleach:
"The primary goal of Bleach is to sanitize user input that is allowed to contain some HTML as markup and is to be included in the content of a larger page."
Because the user input is not allowed to contain any html, my guess is that Bleach is not needed.. but I am kind of noob so your suggestions will be appreciated.
Quoting the docs on striptags
No safety guarantee
Note that striptags doesn’t give any guarantee about its output being entirely HTML safe, particularly with non valid
HTML input. So NEVER apply the safe filter to a striptags output. If
you are looking for something more robust, you can use the bleach
Python library, notably its clean method.
I think the answer here is to use bleach to strip the tags, easy as bleach.clean(text,tags=[]). Plus, with bleach linkefy you can take care of the url's as well.
Regarding your general process, If the string is generated once and queried multiple times ... why aren't you adding the line break and url's while saving ?
If the only reason you need to mark the input as "safe" is so that it will display your <br> tags that you inserted where users typed line breaks, then your best approach is to use the linebreaks filter. From the Django documentation:
linebreaks
Replaces line breaks in plain text with appropriate HTML; a single newline becomes an HTML line break (<br />) and a new line followed by a blank line becomes a paragraph break (</p>).
For example:
{{ value|linebreaks }}
If value is Joel\nis a slug, the output will be <p>Joel<br />is a slug</p>.
Instead of using a regex to replace newlines with <br>s in your database, just leave the data in there as the user entered it. Then, you can display it in a template with
{{ text|striptags|linebreaks }}
This will first remove (most) HTML tags from your user's input, then add in <br> and <p> tags for newlines. It does not mark the string as safe, though, so any tags left in the user's input will be escaped; only the tags created by linebreaks will have any effect.
(Note that if you don't want <p> tags, you can use the variant filter linebreaksbr).
Using QtGui.QMessageBox to display the messages, warnings and errors.
It seems that QMessageBox doesn't want to work with "\n" new line character when used with html tags
message = "<a href = http://www.google.com> GOOGLE</a> This a line number one.\n This a line number two. \n And this is a line number three."
is all being displayed as one long line when displayed within QMessageBox.
Thanks in advance!
The behaviour you are seeing is entirely as expected. It is part of the HTML 4 spec that, other than inside PRE tags, sequences of whitepsace characters should always be collapsed to a single space. To quote the relevant part of the spec:
Note that a sequence of white spaces between words in the source
document may result in an entirely different rendered inter-word
spacing (except in the case of the PRE element). In particular, user
agents should collapse input white space sequences when producing
output inter-word space.
So, when you need to insert line-breaks, do it explicitly using the <br> tag.
PS:
It's also worth noting here that Qt's text widgets only support a limited set of HTML tags, attributes and CSS properties. For full details, see the Supported HTML Subset in the Qt docs.
I have a <textarea> where the user enters his text. The text can contain special chars which I need to parse and replace with HTML tags for display purposes.
For example:
Bolded text will be entered as: *some text* and parsed to: <strong>some text</strong>.
URL will be entered as: #some text | to/url# and parsed to: some text
What's the best way to parse this text input?
Regex? (I don't have any experience with regex)
Some Python library?
Or should I write my own parser, "reading" the input and applying logic where needed?
The emphasis element of the language you describe looks like Markdown.
You should consider just using Markdown, as is. There is a Python module that parses it too.
The best way depends on exactly what your input "language" is. If it has the same sort of nested structures as HTML, you don't want to do it with regular expressions. (Obligatory link: RegEx match open tags except XHTML self-contained tags)
Are you inventing your own little markup language?
If you are: why? Why not use one of the already existing ones, such as Markdown or reST, for which parsers already exist?
If you aren't: why are you writing your own parser? Isn't there one already?
You can have a look at some existing libraries for parsing wiki text:
http://remysharp.com/2008/04/01/wiki-to-html-using-javascript/
This one seems to work with the same format you've defined.
Headings: ! Heading1 text !! Heading2 text !!! Heading3 text
Bold: Bolded Text
Italic: Italicized Text
Underline: +Underlined Text+
http://randomactsofcoding.blogspot.co.uk/2009/08/parsewikijs-javascript-wiki-parsing.html
Or this one that has a really simple API and allows for checking if the given text is actually a wiki text.
UPDATED - Added python wiki parsers:
Having a look at a list of wiki parsers from here.
Media wiki-parser seems to be a good python parser that generates html from wiki markup:
https://github.com/peter17/mediawiki-parser