Add HTML linebreaks with Python regex - python

I need to add HTML linebreaks (<br />) to a string at all line endings which are not followed by a blank line. This simple pattern works:
body = re.sub(r'(.)\n(.)', r'\1<br />\2', body)
But I realized it will not work for an edge case where a line contains only a single character (because the character would have to be part of two different overlapping matches). So I tried the following pattern with lookaround subpatterns:
body = re.sub(r'(?<=.)\n(?=.)', r'<br />', body)
This works as intended, except that the HTML tag is added after the linebreak (\n), and with an additional linebreak:
linebreak
<br/>
!
<br/>
linebreak
<br/>
l
<br/>
works
I would expect that the matched linebreak is substituted by the HTML tag (thereby effectively removing the linebreaks from all matching areas) – why does the tag appear on a new line instead (i.e. increasing the number of linebreaks/lines)?
The equivalent pattern in vim does remove the linebreaks:
s:\(.\)\zs\n\ze\(.\):\<br \/\>:ge

This is quite embarrassing – the pattern/my script do indeed work as supposed to. I was fooled by an HTML source viewer which obviously adds linebreaks to the source code it should display unaltered. Sorry for taking your valuable time.

Related

Detecting newline character on user's input (web2py)

I have the following table:
db.define_table('comm',
Field('post','reference post', readable=False, writable=False),
Field('body','text', requires=IS_NOT_EMPTY()),
auth.signature
)
and in a python function, the following code:
form=SQLFORM(db.comm).process()
I call that form in the returned view by the python function
{{=form}}
The problem is when the user inputs two or more paragraphs, it doesn't detect the newline character. How can I fix that?
Use pre tag to display the content in which you want to detect newline character.
<pre>
The HTML pre element (or HTML Preformatted Text) represents
preformatted text. Text within this element is typically displayed in
a non-proportional ("monospace") font exactly as it is laid out in the
file. Whitespace inside this element is displayed as typed.
{{for post in comments:}}
<pre>{{=post.body}}</pre>
{{pass}}
Assuming you are referring to the subsequent display of the user input in a view, you could use a <pre> tag: http://www.w3schools.com/tags/tag_pre.asp. However, you might need some CSS to get the font/styling you like (by default, the browser will use alternative styling with a fixed-width font).
You could also replace the newlines with <br> tags:
{{=XML(record.body.replace('\n', '<br>'), sanitize=True, permitted_tags=['br/'])}}
Because the text now contains <br> HTML tags, it is necessary to wrap it in XML() to prevent web2py from escaping the HTML -- but you also want to sanitize the text and allow only <br> tags to prevent malicious code from being executed.

Error in HTML escaping with Jinja

I have the following regex that searches through text and prepends and appends HTML 'a' tags for the matched substring. It successfully does everything I want except when the HTML is escaped by using the 'safe' filter by Jinja. The regex is below:
re.sub('(^#\w*|(?<=\s)#\w*)',
r'\1',
'here is some #text with a #hashtag')
The above should come out here is some #text with a #hashtag
where '#text' and '#hashtag' are clickable links. However by using Jinja's 'safe' filter it comes out
"here is some "#text" with a "#hashtag
There are a few things to note:
Unmatched substrings are being wrapped in quotations
The html links should come out #hashtag<a> not <a href="{{ url_for(\'main.tag\', tagname=tag) }}">#hashtag
I'm confident it has to do with the string that is being processed by Jinja. I am not confident with how I am escaping specific characters in the string and passing it to Jinja to process.
Am I escaping the characters wrong? Thoughts? Thank you in advance.

Process whitespaces from user input

The only reason why I include Python in the question is that PHP has the nl2br function that inserts br tags, a similar function in Python could be useful, but I suspect that this problem can be solved with HTML and CSS.
So, I have a form that receives user`s input in a textarea. I save it to the database, which is Postgres and then when I display it, it doesn't include the line breaks the user supplied to separate paragraphs.
I tried using the white-space CSS property on the paragraph tag:
white-space: pre
or
white-space: pre-wrap
But, this is weird, the result was separated lines but the first line aligned in the middle:
including text-align:left didn't solve the problem. I'm sure there is a simple solution to this.
I would suggest to replace newline characters with <br /> either before storing it in the database (only once) or when fetching it (see comment).
With Python:
import re
myUserInput = re.sub('(?:\r\n|\r|\n)', '<br />', myUserInput)
With JavaScript (see jsfiddle):
myUserInput = myUserInput.replace(/(?:\r\n|\r|\n)/g, '<br />');

New Line Character and PyQt

Using QtGui.QMessageBox to display the messages, warnings and errors.
It seems that QMessageBox doesn't want to work with "\n" new line character when used with html tags
message = "<a href = http://www.google.com> GOOGLE</a> This a line number one.\n This a line number two. \n And this is a line number three."
is all being displayed as one long line when displayed within QMessageBox.
Thanks in advance!
The behaviour you are seeing is entirely as expected. It is part of the HTML 4 spec that, other than inside PRE tags, sequences of whitepsace characters should always be collapsed to a single space. To quote the relevant part of the spec:
Note that a sequence of white spaces between words in the source
document may result in an entirely different rendered inter-word
spacing (except in the case of the PRE element). In particular, user
agents should collapse input white space sequences when producing
output inter-word space.
So, when you need to insert line-breaks, do it explicitly using the <br> tag.
PS:
It's also worth noting here that Qt's text widgets only support a limited set of HTML tags, attributes and CSS properties. For full details, see the Supported HTML Subset in the Qt docs.

Regular expression for hexadecimal string in python not working

I have a regular expression to match strings like:
--D2CBA65440D
--77094A27E09
--77094A27E
--770
--77094A27E09--
basically, it matches a hexadecimal string surrounded by one or more line breaks or white space, and has the prefix -- and may or may not have -- as suffix
i use the following python code, and it works fine most of the time:
hexaPattern = "\s--[0-9a-fA-F]+[--]?\s"
hex = re.search(hexaPattern, part)
if hex:
print "found a match"
this works for all of the above but it doesn't match --77094A27E09 in this block:
<div id="arrow2" class="headerLinksImg" style="display:block
--77094A27E09
;">
but matches the same string in:
<input type="checkbox" name="checkbox" id="checkboxKG3" class
--77094A27E09
Content-T="checkboxKG" value="KG3" />
What am i doing wrong?
import re
hexaPattern = re.compile(r'\s--([0-9a-fA-F]+)(?:--)?\s')
m = re.search(hexaPattern, part)
if m:
print "found a match:", m.group(1)
This pre-compiles the pattern for speed. This uses a r'' (raw string) so the backslashes are sure to be passed through correctly. This adds parentheses to make a "match group" so you can extract your hex string after the match; it also adds a "non-matching group" around the second -- string.
Because you used the square brackets around the second "--", you got a "character class". I'm not sure exactly what the character class [--] matches; I think it should just match any '-' character. In a character class, a '-' is usually used for a range, as in [a-z] but the range [--] makes no sense so I think it would fall back to just matching a '-'. The problem is: because you have the ? after it, it would only match zero or one '-' character, and you need it to be able to match two.
Try this:
hexaPattern = r"^--[0-9a-fA-F]+(--)?\s"
The fixes I inserted are:
r at the beginning, so that that backslashes won't be "eaten" by the quotation marks
^ at the beginning to match the start of the string
then -- in parenthesis instead of square brackets (the brackets seem like a mistake)
Others have pointed out problems with your regex, namely the [--] which basically finds one single hyphen in an unconventional way ... either way, not what you want anyway.
I would also suggest that having \s at both the beginning and end of the regex will also cause problems under certain circumstances, because it matches spaces, tabs, and newlines. So you could end up with a case where your file has --77094A27E09\n--D2CBA65440D and the second --D2CBA65440D won't match because the newline was consumed by \s at the end of the previous match.
Also, you seem to be checking each line in the file individually, which you don't really need to do. You can use re.findall to get all the matches in one fell swoop.
And finally -- at the beginning of the string seems to be your real marker, not \s at the beginning or end. So why not just use --([0-9a-fA-F]+)(?:--)? with a group around the hex number. findall only returns the groups which is what you want. Then you can do this (read the whole html file into one string, and check for all matches):
text = """
<input type="checkbox" name="checkbox" id="checkboxKG3" class
--D2CBA65440D
<a> --77094A27E09-- </a>
hello world --77094A27E
--770--
--77094A27E09
Content-T="checkboxKG" value="KG3" />
"""
import re
hexapattern = r'--([0-9a-fA-F]+)(?:--)?'
print re.findall(hexapattern, text)
>>> ['D2CBA65440D', '77094A27E09', '77094A27E', '770', '77094A27E09']
Which I think is what you want
I used the following :
pattern = re.compile(r'(\n--)([0-9A-F]+)(--)?', re.I | re.S | re.M)
and it worked fine. Thanks to all your contributions.

Categories

Resources