I heard about the filter |safe, but if I understood correctly, that's unsafe and creates a backdoor for injections.
What are the alternatives to display full posts with formatted text?
I think when you not use the filter of |safe, then output should return as text only with html markup (not rendered as html output).
But, if you need to exclude some dangerous tags such as <script>location.reload()</script>, you need to handle it with custom templatetag filter..
I got good answer from: https://stackoverflow.com/a/699483/6396981, via BeautifulSoup.
from bs4 import BeautifulSoup
from django import template
from django.utils.html import escape
register = template.Library()
INVALID_TAGS = ['script',]
def clean_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name in INVALID_TAGS:
# tag.hidden = True # you also can use this.
tag.replaceWith(escape(tag))
return soup.renderContents()
# clean_html('<h1>This is heading</h1> and this one is xss injection <script>location.reload()</script>')
# output:
# <html><body><h1>This is heading</h1> and this one is xss injection <script>location.reload()</script></body></html>
#register.filter
def safe_exclude(text):
# eg: {{ post.description|safe_exclude|safe }}
return clean_html(text)
Hope it usefull..
Related
I have a list of image urls contained in 'images'. I am trying to isolate the title from these image urls so that I can display, on the html, the image (using the whole url) and the corresponding title.
So far I have this:
titles = [image[149:199].strip() for image in images]
This gives me the stripped title in the following format (I provide two examples to show the pattern)
le_Art_Project.jpg/220px-
Rembrandt_van_Rijn_-Self-Portrait-_Google_Art_Project.jpg
and
cene_of_the_Prodigal_Son_-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg
The bits in bold (above) are the bits I would like to remove. From the start I would like to remove everything before 220px and from the end: _-_Google_Art_Project.jpg
A newbie to python, I am struggling with syntax and furthermore as I am doing this while referring to the loop of images (list), the string manipulation is not straightforward and I am unsure of how to approach this.
The whole code for reference is below:
webscraper.py:
#app.route('/') #this is what we type into our browser to go to pages. we create these using routes
#app.route('/home')
def home():
images=imagescrape()
titles=[image[99:247].strip() for image in images]
images_titles=zip(images,titles)
return render_template('home.html',images=images,images_titles=images_titles)
What I've tried / am trying:
x = txt.strip("_-_Google_Art_Project.jpg")
Looking into strip - to get rid of the last part of the unwanted string.
I am unsure of how to combine this with getting rid of the leading string that I want to remove and also do so in the most elegant way given the structure/code I already have.
Visually, I am trying to remove the leading text as shown highlighted, as well as the last part of the string which is _-_Google_Art_Project.jpg.
Visual of HTML displayed:
UPDATE:
Based on an answer below - which is very helpful but doesn't quite perfectly solve it, I am trying this approach (without using the unquote import if possible and pure python string manipulation)
def titleextract(url):
#return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
title=url[58:]
return title
The above, returns:
Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220pxRembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg
but I want:
Rembrandt_van_Rijn_-_Self-Portrait
or for the second title/image in the list:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg
I want:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist
cene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg
You have this string and want to remove. Let's say I have this stored in x
y = x.lsplit("px-")[1]
z = x.rsplit("_Google_Art")[0]
This makes a list with 2 elements: stuff before "px-" in the string, and stuff after. We're just grabbing the stuff after, since you wanted to remove the stuff before. If "px-" isn't always in the string, then we need to find something else to split on. Then we split on something towards the end, and grab the stuff before it.
Edit: Addressing comment on how to split in that loop.. I think you are referring to this: titles=[image[149:199].strip() for image in images]
List comps are great but sometimes it's easier to just write it out. Haven't tested this but here's the idea:
titles = []
for image in images:
title = image[149:199].strip()
cleaned_left = title.lsplit("px-")[1]
cleaned_title = title.rsplit("_Google_Art")[0]
titles.append(cleaned_title)
import re # regular expressions used to match strings
from bs4 import BeautifulSoup # web scraping library
from urllib.request import urlopen # open a url connection
from urllib.parse import unquote # decode special url characters
#app.route('/')
#app.route('/home')
def home():
images=imagescrape()
# Iterate over all sources and extract the title from the URL
titles=(titleextract(src) for src in images)
# zip combines two lists into one.
# It goes through all elements and takes one element from the first
# and one element from the second list, combines them into a tuple
# and adds them to a sequence / generator.
images_titles = zip(images, titles)
return render_template('home.html', image_titles=images_titles)
def imagescrape():
result_images=[]
#html = urlopen('https://en.wikipedia.org/wiki/Prince_Harry,_Duke_of_Sussex')
html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images:
result_images.append("https:"+image['src']+'\n') #concatenation!
return result_images
def titleextract(url):
# Extract the part of the string between the last two "/" characters
# Decode special URL characters and cut off the suffix
# Replace all "_" with spaces
return unquote(url[58:url.rindex("/", 58)-4]).replace('_', ' ')
{% for image, title in images_titles %}
<div class="card" style="width: 18rem;">
<img src="{{image}}" class="card-img-top" alt="...">
<div class="card-body">
<h5 class="card-title">{{title}}</h5>
<p class="card-text">Some quick example text to build on the card title and make up the bulk of the card's content.</p>
Go somewhere
</div>
</div>
{% endfor %}
Currently, I'm trying to scrape 10-K submission text files on sec.gov.
Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt
The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.
First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.
Does anyone have a clean solution for me to accomplish my goal?
Here is my code so far:
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)
Let's set a dummy string based on the example:
original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""
Now let's remove all the javascript.
from lxml.html.clean import Cleaner # remove javascript
# Delete javascript tags (some other options are left for the sake of example).
cleaner = Cleaner(
comments = True, # True = remove comments
meta=True, # True = remove meta tags
scripts=True, # True = remove script tags
embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)
(From https://stackoverflow.com/a/46371211/1204332)
And then we can either remove the HTML tags (extract the text) with the HTMLParser library:
from HTMLParser import HTMLParser
# Strip HTML.
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
text_content = strip_tags(clean_dom)
print text_content
(From: https://stackoverflow.com/a/925630/1204332)
Or we could get the text with the lxml library:
from lxml.html import fromstring
print fromstring(original_content).text_content()
I want to extract key value pairs of some form elements in a html page
for example
name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"
while the original line is
<form name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home">
is there any method using which I can safely get the key and value pairs. I tried using splitting by spaces and then using '=' characters but string inside quotes can also have '='.
is there any different kind of split method which can also take care of quotes?
Use a parsing library such as lxml.html for parsing html.
The library will have a simple way for you to get what you need, probably not taking more than a few steps:
load the page using the parser
choose the form element to operate on
ask for the data you want
Example code:
>>> import lxml.html
>>> doc = lxml.html.parse('http://stackoverflow.com/questions/13432626/split-a-s
tring-in-python-taking-care-of-quotes')
>>> form = doc.xpath('//form')[0]
>>> form
<Element form at 0xbb1870>
>>> form.attrib
{'action': '/search', 'autocomplete': 'off', 'id': 'search', 'method': 'get'}
You could use regular expressions like this one :
/([^=, ]+)="([^" ]+|[^," ]+)" ?"/
In python, you can do this :
#!/usr/bin/python
import re
text = 'name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"';
ftext = re.split( r'([^=, ]+)="([^" ]+|[^," ]+)" ?', text )
print ftext;
s = r'name="frmLogin" method="POST" onSubmit="javascript:return validateAndSubmit();" action="TG_cim_logon.asp?SID=^YcMunDFDQUoWV
32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home"'
>>> lst = s.split('" ')
>>> for item in lst:
... print item.split('="')
...
['name', 'frmLogin']
['method', 'POST']
['onSubmit', 'javascript:return validateAndSubmit();']
['action', 'TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Li
te_Home"']
{i.split('="')[0]: i.split('="')[1] for i in str.split("\" ")}
where str is your original string
dict=eval('dict(%s)'%name.replace(' ',','))
print dict
{'action': 'TG_cim_logon.asp?SID=^YcMunDFDQUoWV32WPUMqPxeSxD4L_slp_rhc_rNvW7Fagp7FgH3l0uJR/3_slp_rhc_dYyJ_slp_rhc_vsPW0kJl&RegType=Lite_Home', 'onSubmit': 'javascript:return,validateAndSubmit();', 'method': 'POST', 'name': 'frmLogin'}
This will solve your problem .
You can use a library which has support for parsing HTML forms.
For example: https://mechanize.readthedocs.io/en/latest/
Stateful programmatic web browsing in Python. Browse pages programmatically with easy HTML form filling and clicking of links.
Suppose I have a huge paragraph.
I just want the top 15 words to be shown. After than, the person clicks "more" to see the rest of the stuff.
Just whipped this up, seems to do what you want, and there's no dependency on any external JS libs.
DISCLAIMER: I haven't tried this in IE, but chrome and firefox work fine.
from django import template
from django.utils.html import escape
from django.utils.safestring import mark_safe
register = template.Library()
import re
readmore_showscript = ''.join([
"this.parentNode.style.display='none';",
"this.parentNode.parentNode.getElementsByClassName('more')[0].style.display='inline';",
"return false;",
]);
#register.filter
def readmore(txt, showwords=15):
global readmore_showscript
words = re.split(r' ', escape(txt))
if len(words) <= showwords:
return txt
# wrap the more part
words.insert(showwords, '<span class="more" style="display:none;">')
words.append('</span>')
# insert the readmore part
words.insert(showwords, '<span class="readmore">... <a href="#" onclick="')
words.insert(showwords+1, readmore_showscript)
words.insert(showwords+2, '">read more</a>')
words.insert(showwords+3, '</span>')
# Wrap with <p>
words.insert(0, '<p>')
words.append('</p>')
return mark_safe(' '.join(words))
readmore.is_safe = True
To use it, just create a templatetags folder in your app, create the __init__.py file in there, and then drop this code into readmore.py.
Then at the top of any template where you want to use it, just add: {% load readmore %}
To use the filter itself:
{{ some_long_text_var|readmore:15 }}
The :15 tells how many words you want to show before the read more link.
If you want anything fancy like ajax loading of the full content, that's quite a bit different and would require a bit more infrastructure.
use truncatechars_html
refer to : https://docs.djangoproject.com/en/1.8/ref/templates/builtins/#truncatechars-html
truncatechars_html
Similar to truncatechars, except that it is aware of HTML tags. Any tags that are opened in the string and not closed before the truncation point are closed immediately after the truncation.
For example:
{{ value|truncatechars_html:9 }}
If value is "<p>Joel is a slug</p>", the output will be "<p>Joel i...</p>".
Newlines in the HTML content will be preserved.
There is truncatewords filter, although you still need a JavaScript helper to do what you described.
from django import template
from django.utils.html import escape
from django.utils.safestring import mark_safe
register = template.Library()
#register.filter
def readmore(text, cnt=250):
text, cnt = escape(text), int(cnt)
if len(text) > cnt:
first_part = text[:cnt]
link = u'%s' % _('read more')
second_part = u'%s<span class="hide">%s</span>' % (link, text[cnt:])
return mark_safe('... '.join([first_part, second_part]))
return text
readmore.is_safe = True
I rewrote an earlier answer to be cleaner and to handle string escaping properly:
#register.filter(needs_autoescape=True)
#stringfilter
def read_more(s, show_words, autoescape=True):
"""Split text after so many words, inserting a "more" link at the end.
Relies on JavaScript to react to the link being clicked and on classes
found in Bootstrap to hide elements.
"""
show_words = int(show_words)
if autoescape:
esc = conditional_escape
else:
esc = lambda x: x
words = esc(s).split()
if len(words) <= show_words:
return s
insertion = (
# The see more link...
'<span class="read-more">…'
' <a href="#">'
' <i class="fa fa-plus-square gray" title="Show All"></i>'
' </a>'
'</span>'
# The call to hide the rest...
'<span class="more hidden">'
)
# wrap the more part
words.insert(show_words, insertion)
words.append('</span>')
return mark_safe(' '.join(words))
The HTML in there assumes you're using Bootstrap and Fontawesome, but if that's not your flavor, it's easy to adapt.
For the JavaScript, assuming you're using jQuery (if you're using Bootstrap you probably are), you'll just need to add something like this:
$(".read-more").click(function(e) {
e.preventDefault();
var t = $(this);
t.parent().find('.more').removeClass('hidden');
t.addClass('hidden');
});
I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e
>>> import BeautifulSoup
>>> html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).
If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b>foo</b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.
Using lxml:
htmlstring = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text
#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is a link and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image
Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'