New Line Character and PyQt

New Line Character and PyQt - python

Using QtGui.QMessageBox to display the messages, warnings and errors.
It seems that QMessageBox doesn't want to work with "\n" new line character when used with html tags
message = "<a href = http://www.google.com> GOOGLE</a> This a line number one.\n This a line number two. \n And this is a line number three."
is all being displayed as one long line when displayed within QMessageBox.
Thanks in advance!

The behaviour you are seeing is entirely as expected. It is part of the HTML 4 spec that, other than inside PRE tags, sequences of whitepsace characters should always be collapsed to a single space. To quote the relevant part of the spec:
Note that a sequence of white spaces between words in the source
document may result in an entirely different rendered inter-word
spacing (except in the case of the PRE element). In particular, user
agents should collapse input white space sequences when producing
output inter-word space.
So, when you need to insert line-breaks, do it explicitly using the <br> tag.
PS:
It's also worth noting here that Qt's text widgets only support a limited set of HTML tags, attributes and CSS properties. For full details, see the Supported HTML Subset in the Qt docs.

Related

markdown2 not adding <pre> to code snippets

Usually code snippets wrap the code tag with a pre tag. Looks like markdown is just using a p tag, is this normal?
from markdown2 import Markdown
markdowner = Markdown()
markdowner.convert("```\nthis is code\n```")
u'<p><code>\nthis is code\n</code></p>\n'
Even this website's adding pre tags. How do I add it to markdown?

is this normal?
Yes, fenced code blocks are not standard Markdown (only indented code blocks are). However, inline code spans can be deliminated with any number of backticks (as long as both opening an closing deliminators match). Therefore, the parser is correctly parsing your input as an inline code span which consists of a code tag inside a p tag. Of course, if you had inserted any blank lines, then the output would have been multiple paragraphs without any code spans (as the opening and closing deliminators would have been in separate paragraphs).
How do I add it to markdown?
As fenced code blocks are non-standard Markdown, they generally need to be enabled in parsers which support them. Each parser is different, so users should consult the documentation for their parser of choice. The other answer already covers how to enable them in the specific parser used by the OP.

Turns out markdown2 only adds pre's to what is indented by four spaces.
To add to above example, use:
markdown2.markdown(text, extras=["fenced-code-blocks"])
Reference

Detecting newline character on user's input (web2py)

I have the following table:
db.define_table('comm',
Field('post','reference post', readable=False, writable=False),
Field('body','text', requires=IS_NOT_EMPTY()),
auth.signature
)
and in a python function, the following code:
form=SQLFORM(db.comm).process()
I call that form in the returned view by the python function
{{=form}}
The problem is when the user inputs two or more paragraphs, it doesn't detect the newline character. How can I fix that?

Use pre tag to display the content in which you want to detect newline character.
<pre>
The HTML pre element (or HTML Preformatted Text) represents
preformatted text. Text within this element is typically displayed in
a non-proportional ("monospace") font exactly as it is laid out in the
file. Whitespace inside this element is displayed as typed.
{{for post in comments:}}
<pre>{{=post.body}}</pre>
{{pass}}

Assuming you are referring to the subsequent display of the user input in a view, you could use a <pre> tag: http://www.w3schools.com/tags/tag_pre.asp. However, you might need some CSS to get the font/styling you like (by default, the browser will use alternative styling with a fixed-width font).
You could also replace the newlines with <br> tags:
{{=XML(record.body.replace('\n', '<br>'), sanitize=True, permitted_tags=['br/'])}}
Because the text now contains <br> HTML tags, it is necessary to wrap it in XML() to prevent web2py from escaping the HTML -- but you also want to sanitize the text and allow only <br> tags to prevent malicious code from being executed.

Django security. dealing with user input . Is html.strip_tags enough or should I use bleach?

I'm accepting user input on a small forum I have. This is what I do with user's input:
First, call "html.strip_tags" from django.utils.html on user's cleaned_data[input].
Save it to the database. Postgre.
Query the text and use a regex to replace \n with br and display spaces entered by users.
Then, I do {{text|safe}} to display the text (if I don't mark it as safe, it won't display spaces between paragraphs but br tags).
Finally I use some jquery plugins on the text: Autolinker.js to detect and "urlize" hyperlinks and trunk8 to control its length.
So, because I do {{text|safe}} I am worried about malicious input, is html.strip_tags enough?
The documentation about strip_tags writes:
"Tries to remove anything that looks like an HTML tag from the string, that is anything contained within <>. Absolutely NO guaranty is provided about the resulting string being entirely HTML safe. So NEVER mark safe the result of a strip_tag call without escaping it first, for example with escape()."
The documentation about Python's Bleach:
"The primary goal of Bleach is to sanitize user input that is allowed to contain some HTML as markup and is to be included in the content of a larger page."
Because the user input is not allowed to contain any html, my guess is that Bleach is not needed.. but I am kind of noob so your suggestions will be appreciated.

Quoting the docs on striptags
No safety guarantee
Note that striptags doesn’t give any guarantee about its output being entirely HTML safe, particularly with non valid
HTML input. So NEVER apply the safe filter to a striptags output. If
you are looking for something more robust, you can use the bleach
Python library, notably its clean method.
I think the answer here is to use bleach to strip the tags, easy as bleach.clean(text,tags=[]). Plus, with bleach linkefy you can take care of the url's as well.
Regarding your general process, If the string is generated once and queried multiple times ... why aren't you adding the line break and url's while saving ?

If the only reason you need to mark the input as "safe" is so that it will display your <br> tags that you inserted where users typed line breaks, then your best approach is to use the linebreaks filter. From the Django documentation:
linebreaks
Replaces line breaks in plain text with appropriate HTML; a single newline becomes an HTML line break (<br />) and a new line followed by a blank line becomes a paragraph break (</p>).
For example:
{{ value|linebreaks }}
If value is Joel\nis a slug, the output will be <p>Joel<br />is a slug</p>.
Instead of using a regex to replace newlines with <br>s in your database, just leave the data in there as the user entered it. Then, you can display it in a template with
{{ text|striptags|linebreaks }}
This will first remove (most) HTML tags from your user's input, then add in <br> and <p> tags for newlines. It does not mark the string as safe, though, so any tags left in the user's input will be escaped; only the tags created by linebreaks will have any effect.
(Note that if you don't want <p> tags, you can use the variant filter linebreaksbr).

Shouldnt the tostring in a python xml element from an elementtree return a proper text?

I am using tostring in order to transform a xml element tagged "p" into a string.
result=lxml.html.tostring(child, method="text", encoding='utf8') #child is the given element
While on the browser it renders properly as a line:http://jsbin.com/AnoYePA/1/edit
The result string I get from this operation consist of several lines with one word each.
So the question is,shouldnt the "result" string be one lined, same as it renders in the internet browsers?
The element I apply this operation is attached in the pastebin.

No, it shouldn't.
There are newlines in the text of the node. You're asking lxml to extract the text of the node, which includes that whitespace.
A web browser renders any run of whitespace as a single space, so those newlines aren't visible, in the output. But that's a feature of how HTML is rendered, not of the text. The fact that lxml doesn't reproduce that rendering is no more "wrong" than the fact that the text doesn't have the same fonts, boldfacing, etc. as it does in your browser.
If you want to reproduce HTML's whitespace compression, you can do that pretty easily—e.g., re.sub('\s', ' ', s).

python and pyPdf - how to extract text from the pages so that there are spaces between lines

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....

This is a common problem with pdf parsing. You can also expect trailing dashes that you will have to fix in some cases. I came up with a workaround for one of my projects which I will describe here shortly:
I used pdfminer to extract XML from PDF and also found concatenated words in the XML. I extracted the same PDF as HTML and the HTML can be described by lines of the following regex:
<span style="position:absolute; writing-mode:lr-tb; left:[0-9]+px; top:([0-9]+)px; font-size:[0-9]+px;">([^<]*)</span>
The spans are positioned absolutely and have a top-style that you can use to determine if a line break happened. If a line break happened and the last word on the last line does not have a trailing dash you can separate the last word on the last line and the first word on the current line. It can be tricky in the details, but you might be able to fix almost all text parsing errors.
Additionally you might want to run a dictionary library like enchant over your text, find errors and if the fix suggested by the dictionary is like the error word but with a space somewhere, the error word is likely to be a parsing error and can be fixed with the dictionaries suggestion.
Parsing PDF sucks and if you find a better source, use it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

New Line Character and PyQt - python

Related

markdown2 not adding <pre> to code snippets

Detecting newline character on user's input (web2py)

Django security. dealing with user input . Is html.strip_tags enough or should I use bleach?

Shouldnt the tostring in a python xml element from an elementtree return a proper text?

python and pyPdf - how to extract text from the pages so that there are spaces between lines

Categories

Resources