what's wrong with my python re.sub

what's wrong with my python re.sub - python

this is my code :
string ='''
{% emoji 'MONEY_BAG' %}<span style="color:#7F6C41;">{{ item.name }}を入手した!</span></span>
'''
a = r'''
{%\s+mobile_url\s+['"]{1}(/inventory/view_item/\?)[^'"]*['"]{1}\s+([^%}]+)\s+%}
'''
def aa(x):
print x.group(1)
print x.group(2)
return ''
string = re.sub(a, aa, string)
print string
and it show :
{% emoji 'MONEY_BAG' %}<span style="color:#7F6C41;">{{ item.name }}を入手した!</span></span>
i want to print the x.group(1) and the x.group(2)
so what can i do ,
thanks

It's a bad idea to use regex to extract information from HTML. It's much easier with a HMTL Parser: http://docs.python.org/library/htmlparser.html
Or if you want to crawl a webpage for more information, you might want to use scrapy which is a truly great web crawler framework.

Your extra newline characters in a are causing the regex to never match
a = r'''{%\s+mobile_url\s+['"]{1}(/inventory/view_item/\?)[^'"]*['"]{1}\s+([^%}]+)\s+%}'''

Related

How to replace '\n' in string with '<br>' in Flask app?

I've been trying to modify a string before passing it to my HTML page in Flask (replacing occurrences of '\n' with '<br>'), but the typical methods I use aren't working for some reason.
finalstring = textstring.replace('\n', '<br>')
return render_template('my-form-result.html', emailresponse = finalstring)
This should work, but for some reason, nothing is replaced. How can I get this to work? Thanks!

A better way to replace \n in HTML is using CSS styles.
Your replace() is alright. Debug your code and make sure there is \n before replace().
To be able to view linebreaks in HTML you should use safe filter in the template. But beware that you become open to XSS attacks. To get round this problem you should escape the string before replacing the \n character. This is the code:
from flask import escape
...
...
safe_html = str(escape(text)).replace('\n', '<br/>')
return render_template('[HTML file].html', safe_html=safe_html)
---------
#in the template:
<span> {{ safe_html | safe }} </span>
If you don't use the str() call before replace, then the <br/> will be scaped too. Because the return value from escape() is not string.

Disclaimer: I never worked with Flask, I just looked it up and hope it does what you want to do.
So somewhere in your template my-form-result.html you should find a line containing:
{{ emailresponse }}
You can replace this with:
{% for line in emailresponse.split('\n') %}
{{ line }}
<br />
{% endfor %}
To add an br after every newline

Your replace() code is correct. Make sure you escape the HTML in the template:
{{ emailresponse|safe }}
To diagnose, try this:
finalstring = textstring.replace('\n', '<br>')
print(finalstring)
return render_template('my-form-result.html', emailresponse = finalstring)
Also, show us the source code from the web page, to see what is actually rendering in the template

Two groups to generate html tag

I need help with regex.
https://regex101.com/r/r3pTh0/2
I kinda have it working with the following regex but still need help:
\<\%= image_tag image_url\(\“(.+?)\”\)|\“(.+?)\” %>
replaceing it with:
<img src="\1" alt="\2">
python code:
originalData = re.sub('<%= image_tag image_url("(.+?)")+, :alt => "(.+?)\" %>', r'<img src="\1" alt="\2">', originalData, flags=re.MULTILINE)
But it does not seem to be replacing anything.
Have a string:
<%= image_tag image_url(“/blog/assets/images/2018-11-15/dribbble-developer-interview-jeffrey-chupp.png”), :alt => “Developer interview with Jeffrey Chupp, Director of Engineering at Dribbble” %>
replace it with html img tag:
<img src="/blog/assets/images/2018-11-15/dribbble-developer-interview-jeffrey-chupp.png" alt="Developer interview with Jeffrey Chupp, Director of Engineering at Dribbble">
would it also be hard to add http://somesite.com at the beginning for image link?

You are encountering this error because you selected the PHP flavor of regex on regex101.com.
Simply switch over to the Python flavor and the website fixes the regex for you. The command should be:
originalData = re.sub(r"\<\%= image_tag image_url\(\“(.+?)\”\)+\, :alt => “(.+?)\” %>", r'<img src="\1" alt="\2">', originalData, flags=re.MULTILINE)
As for adding a domain prefix to the image src, simply replace src=" with src="https://somesite.com in a second regex pass.
Or you could add the domain right before your replacement string like this '<img src="https://somesite.com\1" alt="https://somesite.com\2">'

Regular expression pattern for content within HTML tags

I have coded simple Python script that connects to specific website and gets all the links
there.
import urllib2
import re
request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'.+', content)
if match:
for i in match:
print i + "\n"
else:
print 'Not Found!'
Result:
<a href="/video/3878"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3878.jpg" alt=
"avatar" /></a>
NodeZero Linux Review
<a href="/video/3877"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3877.jpg" alt=
"avatar" /></a>
Post Attack Uploading Shell in Real Time
<a href="/video/3867"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3867.jpg" alt=
"avatar" /></a>
Using SQLMAP in Real Time (SQLinjection WEB)
<a href="/video/3866"><img class="corner iradius20 ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3866.jpg" alt=
"avatar" /></a>
....
...
...
I am trying to get those links with the understandable title, such as Using SQLMAP in Real Time (SQLinjection WEB).
My pattern is: .+

If you really want to use regexes instead of a proper parser, you can match groups and access them later on.
See http://docs.python.org/library/re.html
(...)
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed
Try:
request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)
if match:
for link, title in match:
print "link %s -> %s" % (link, title)
this outputs:
link /video/3822 -> SecurityTube SpeakUp: Cloud Computing
link /video/3587 ->
link /video/3587 -> Securitytube Speak Up: Antivirus Evasion attacks
link /video/3489 ->
link /video/3489 -> SecurityTube SpeakUp: ThePirateBay LOSS
link /video/3375 ->
link /video/3375 -> SecurityTube SpeakUp: .COM and .NET Domain Seizures
link /video/3130 ->
link /video/3130 -> SecurityTube Speak Up: The MS12-020 Fiasco!
...etc
you can of course filter the links, so that only links with a matched title will be considered.
you will want to discard links starting with #, too... you see, a proper parser will give you better results.

Never ever parse html with a regex. ;-)
But in order to help you improve your regex-fu could be improved for future non-HTML work, there are two places where your regex is failing:
.\w+.\d+ (this won't match the / in /video/3877. Try `"[^"]+"
.+, that is going to match as many of any character... try as few as possible: .+?

How can i remove <p> </p> with python sub

I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.

Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!

I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?

I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '

What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>

using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.

I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)

How do I perform HTML decoding/encoding using Python/Django?

I have a string that is HTML encoded:
'''<img class="size-medium wp-image-113"\
style="margin-left: 15px;" title="su1"\
src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\
alt="" width="300" height="194" />'''
I want to change that to:
<img class="size-medium wp-image-113" style="margin-left: 15px;"
title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"
alt="" width="300" height="194" />
I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.
The string is stored like that because I am using a web-scraping tool called BeautifulSoup, it "scans" a web-page and gets certain content from it, then returns the string in that format.
I've found how to do this in C# but not in Python. Can someone help me out?
Related
Convert XML/HTML Entities into Unicode String in Python

With the standard library:
HTML Escape
try:
from html import escape # python 3.x
except ImportError:
from cgi import escape # python 2.x
print(escape("<"))
HTML Unescape
try:
from html import unescape # python 3.4+
except ImportError:
try:
from html.parser import HTMLParser # python 3.x (<3.4)
except ImportError:
from HTMLParser import HTMLParser # python 2.x
unescape = HTMLParser().unescape
print(unescape(">"))

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}

For html encoding, there's cgi.escape from the standard library:
>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.
For html decoding, I use the following:
import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39
def unescape(s):
"unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
return re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), s)
For anything more complicated, I use BeautifulSoup.

Use daniel's solution if the set of encoded characters is relatively restricted.
Otherwise, use one of the numerous HTML-parsing libraries.
I like BeautifulSoup because it can handle malformed XML/HTML :
http://www.crummy.com/software/BeautifulSoup/
for your question, there's an example in their documentation
from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacré bleu!",
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

In Python 3.4+:
import html
html.unescape(your_string)

See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.

Daniel's comment as an answer:
"escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}"

If anyone is looking for a simple way to do this via the django templates, you can always use filters like this:
<html>
{{ node.description|safe }}
</html>
I had some data coming from a vendor and everything I posted had html tags actually written on the rendered page as if you were looking at the source.

I found a fine function at: http://snippets.dzone.com/posts/show/4569
def decodeHtmlentities(string):
import re
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
def substitute_entity(match):
from htmlentitydefs import name2codepoint as n2cp
ent = match.group(2)
if match.group(1) == "#":
return unichr(int(ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
return entity_re.subn(substitute_entity, string)[0]

Even though this is a really old question, this may work.
Django 1.5.5
In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

I found this in the Cheetah source code (here)
htmlCodes = [
['&', '&'],
['<', '<'],
['>', '>'],
['"', '"'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
""" Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
for code in codes:
s = s.replace(code[1], code[0])
return s
not sure why they reverse the list,
I think it has to do with the way they encode, so with you it may not need to be reversed.
Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists...
this is going in my library though :)
i noticed your title asked for encode too, so here is Cheetah's encode function.
def htmlEncode(s, codes=htmlCodes):
""" Returns the HTML encoded version of the given string. This is useful to
display a plain ASCII text string on a web page."""
for code in codes:
s = s.replace(code[0], code[1])
return s

You can also use django.utils.html.escape
from django.utils.html import escape
something_nice = escape(request.POST['something_naughty'])

Below is a python function that uses module htmlentitydefs. It is not perfect. The version of htmlentitydefs that I have is incomplete and it assumes that all entities decode to one codepoint which is wrong for entities like &NotEqualTilde;:
http://www.w3.org/TR/html5/named-character-references.html
NotEqualTilde; U+02242 U+00338 ≂̸
With those caveats though, here's the code.
def decodeHtmlText(html):
"""
Given a string of HTML that would parse to a single text node,
return the text value of that node.
"""
# Fast path for common case.
if html.find("&") < 0: return html
return re.sub(
'&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
_decode_html_entity,
html)
def _decode_html_entity(match):
"""
Regex replacer that expects hex digits in group 1, or
decimal digits in group 2, or a named entity in group 3.
"""
hex_digits = match.group(1) # '
' -> unichr(10)
if hex_digits: return unichr(int(hex_digits, 16))
decimal_digits = match.group(2) # '' -> unichr(0x10)
if decimal_digits: return unichr(int(decimal_digits, 10))
name = match.group(3) # name is 'lt' when '<' was matched.
if name:
decoding = (htmlentitydefs.name2codepoint.get(name)
# Treat &GT; like >.
# This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
# If htmlentitydefs included mappings for those entities,
# then this code will magically work.
or htmlentitydefs.name2codepoint.get(name.lower()))
if decoding is not None: return unichr(decoding)
return match.group(0) # Treat "&noSuchEntity;" as "&noSuchEntity;"

This is the easiest solution for this problem -
{% autoescape on %}
{{ body }}
{% endautoescape %}
From this page.

Searching the simplest solution of this question in Django and Python I found you can use builtin theirs functions to escape/unescape html code.
Example
I saved your html code in scraped_html and clean_html:
scraped_html = (
'<img class="size-medium wp-image-113" '
'style="margin-left: 15px;" title="su1" '
'src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
clean_html = (
'<img class="size-medium wp-image-113" style="margin-left: 15px;" '
'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
Django
You need Django >= 1.0
unescape
To unescape your scraped html code you can use django.utils.text.unescape_entities which:
Convert all named and numeric character references to the corresponding unicode characters.
>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True
escape
To escape your clean html code you can use django.utils.html.escape which:
Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.
>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True
Python
You need Python >= 3.4
unescape
To unescape your scraped html code you can use html.unescape which:
Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.
>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True
escape
To escape your clean html code you can use html.escape which:
Convert the characters &, < and > in string s to HTML-safe sequences.
>>> from html import escape
>>> scraped_html == escape(clean_html)
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

what's wrong with my python re.sub - python

It's a bad idea to use regex to extract information from HTML. It's much easier with a HMTL Parser: http://docs.python.org/library/htmlparser.html Or if you want to crawl a webpage for more information, you might want to use scrapy which is a truly great web crawler framework.

Your extra newline characters in a are causing the regex to never match a = r'''{%\s+mobile_url\s+['"]{1}(/inventory/view_item/\?)[^'"]*['"]{1}\s+([^%}]+)\s+%}'''

Related

How to replace '\n' in string with '<br>' in Flask app?

Two groups to generate html tag

Regular expression pattern for content within HTML tags

How can i remove <p> </p> with python sub

How do I perform HTML decoding/encoding using Python/Django?

Categories

Resources