Hi how should I escape to make the link render?
The way I write it now is with filter:
{{article.text|striptags|urlize|nl2br|safe}}
Can you recommend how to do it?
Related question: https://stackoverflow.com/questions/8179801/autolinebreaks-filter-in-jinja2
Thank you
Usually I'd like to use HTMLParser for processing (overkill maybe?), sample code below for Python 2.7 (3.0 library is renamed html.parser)
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Found Start Tag", attrs
s = "noivos, convites de casamento <a href=\"http://www.olharcaricato.com.br\">
http://www.olharcaricato.com.br</a> more entries here"
parser = MyHTMLParser()
parser.feed(s)
Outputs: Found Start Tag [('href', 'http://www.olharcaricato.com.br')]
Note: Implement the code above as a filter, tweak the output to your needs. Example of filter is found at Custom jinja2 filter for iterator
Related
I'm making a python project in which I created a test wix website.
I want to get the data (text) from the wix website using urllib
so I did
url.urlopen(ADDRESS).readlines()
the problem is it did not give me anything from the text in the page and only information about the structure of the page in HTML.
how would I extricate the requested text information from the website?
I think you'll need to end up parsing the html for the information you want. Check out this python library:
https://docs.python.org/3/library/html.parser.html
You could potentially do something like this:
from html.parser import HTMLParser
rel_data = []
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
rel_data.append(data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
print(rel_data)
Output
["Test", "Parse me!"]
I have the html document with an article. I have some amount of tags, that I can use for text formatting. But my text editor uses a lot of unnecessary tags for formatting. I want to write a program in Python for filtering these tags.
What would be the major logic(structure, strategy) of such a program? I'm beginner in Python and want to learn this language through solving real practical task. But I need some general overview to start.
Use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html_string = # the HTML code
parsed_html = BeautifulSoup(html_string)
print parsed_html.body.find('div', attrs = {attrs inside html code}).text
Here, div is just the tag, you can use any tag whose text you want to filter.
Not so clear on your requirements but you should use ready-made parsers like BeautifulSoup in python.
You can find a tutorial here
just don't know about what will be missed, but you can use regex.
re.sub('<[^<]+?>', '', text)
the above function will search...
otherwise you can use htmlparser
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return s.get_data()
I don't want to know how to solve the problem, because I have solved it on my own. I'm just asking if it is really a bug and whether and how I should report it.
You can find the code and the output below:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
for at in attrs:
if at[0] == 'href':
print(at[1])
return super().handle_starttag(tag, attrs)
def handle_data(self, data):
return super().handle_data(data)
def handle_endtag(self, tag):
return super().handle_endtag(tag)
s = 'nomeLink'
p = MyParser()
p.feed(s)
The following is the output:
"/home?ID=123>3=7"
No, it is not a bug. You are feeding the parser invalid HTML, the correct way to include & in a URL in a HTML attribute is to escape it to &:
>>> s = 'nomeLink'
>>> p = MyParser()
>>> p.feed(s)
/home?ID=123>3=7
The parser did their best (as required by the HTML standard) and gave you 'repaired' data to the best of its ability. In this case, it tried to repair another common broken-HTML error: spelling > as > (forgetting the ; semicolon).
Rather than build on top of the (rather low-level) html.parser library yourself, I recommend you use BeautifulSoup instead. BeautifulSoup supports multiple parsers, and some of those can handle broken HTML better than others.
For example, the html5lib parser can handle unescaped ampersands in attributes better than html.parser can:
>>> from bs4 import BeautifulSoup
>>> s = 'nomeLink'
>>> BeautifulSoup(s, 'html.parser').find('a')['href']
'/home?ID=123>3=7'
>>> BeautifulSoup(s, 'html5lib').find('a')['href']
'/home?ID=123>3=7'
For completeness sake, the third supported parser, lxml, also handles unescaped ampersands as if they are escaped:
>>> BeautifulSoup(s, 'lxml').find('a')['href']
'/home?ID=123>3=7'
You could use lxml and html5lib directly, but then you'd forgo the nice high-level API that BeautifulSoup offers.
Python 3.3.2 (v3.3.2, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32
Let feed s = '<p a="'">' to MyHTMLParser:
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(attrs)
This is valid html tag where ' is for '.
In this case MyHTMLParser gives for attrs:
[('a', "'")]
The reason of such result is the usage of unescape function:
Lines in source file html/parser.py, class HTMLParser
348: if attrvalue:
349: attrvalue = self.unescape(attrvalue)
where self.unescape is an internal helper to remove special character quoting, which is used for attributes values only. See lines 504-532 in parser.py.
I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e
>>> import BeautifulSoup
>>> html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).
If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b>foo</b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.
Using lxml:
htmlstring = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text
#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is a link and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image
Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'
I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is viewable. sounds simple enough.. i did it with the HTMLParser but its not handling javascript well
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
self.output.write(data)
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text
Any Ideas for a way to do this with lxml or a better way to do it HTMLParser.. HTMLParser would be best because no additional libs are needed.. thanks everyone
Scott F.
No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.
scrape.py can do this for you.
It's as simple as:
import scrape
s = scrape.Session()
s.go('yoursite.com')
print s.doc.text
Jump to about 2:40 in this video for an awesome overview from the creator of scrape.py:
pycon.blip.tv/file/3261277
BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.
I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more.
For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.
Two that can execute javascript and are either Python based or can interface with Python:
PAMIE
Selenium
Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text.
So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.
Your code is smart and very flexible to extent, I think.
How about simply adding handle_starttag() and handle_endtag() to supress the <script> blocks?
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
self.is_in_script = False
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
if not self.is_in_script:
self.output.write(data)
def handle_starttag(self, tag, attrs):
if tag == "script":
self.is_in_script = True
def handle_endtag(self, tag):
if tag == "script":
self.is_in_script = False
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text