encode_contents vs encode("utf-8") in Python BeautifulSoup - python

OK, so as a beginner webscraper I feel as though I've seen both used, seemingly interchangeably when converting the default unicode of text in HTML. I know contents() is a list object but other than that, what the heck is the difference?
I've noticed that .encode("utf-8") seems to work more universally.
thanks,
-confused souper.

The documentation of encode_contents:
encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
Renders the contents of this tag as a bytestring.
The documentation ofencode method:
encode(self, encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace')
encode method will work on a bs4.BeautifulSoup object instance. encode_contents will work on the contents of a bs4.BeautifulSoup instance.
>>> html = "<div>div content <p> a paragraph </p></div>"
>>> soup = BeautifulSoup(html)
>>> soup.div.encode()
>>> '<div>div content <p> a paragraph </p></div>'
>>> soup.div.contents
>>> [u'div content ', <p> a paragraph </p>]
>>> soup.div.encode_contents()
>>> 'div content <p> a paragraph </p>'

The method signature for encode_contents() shows that in addition to encoding content, it can also format the output:
encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
Renders the contents of this tag as a bytestring.
For example:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><p>Caf\xe9</p></body></html>')
>>> soup.encode_contents()
'<html><body><p>Caf\xc3\xa9</p></body></html>'
>>> soup.encode_contents(indent_level=1)
'<html>\n <body>\n <p>\n Caf\xc3\xa9\n </p>\n </body>\n</html>'
>>> soup.encode_contents(indent_level=1, encoding='iso-8859-1')
'<html>\n <body>\n <p>\n Caf\xe9\n </p>\n </body>\n</html>'
str.encode('utf-8') can only perform the encoding part, no formatting.

Related

Is there any way to outout blank as in lxml or other python xml lib?

my code is:
>>> from lxml.html import tostring, HtmlElement
>>> a = HtmlElement()
>>> a.tag = "div"
>>> a.text = " "
>>> tostring(a)
b'<div> </div>'
Is there any way to output like in lxml or other python xml library?
<div> </div>
I solved this problem by replace blank to '\xa0' (160 is ascii code)

Getting "TypeError: argument should be integer or bytes-like object, not 'str'" when searching for string on web page

I'm using Python 3.7 and Django. I want to search for a string in an HTML page. I tried this ...
req = urllib2.Request(article.path, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
is_present = html.find(token_str) >= 0
but this is resulting in an error
TypeError: argument should be integer or bytes-like object, not 'str'
complaining about the last line, where I do the "find." What's the right way to search for a string in HTML?
Dave!
For pulling data out of HTML files, I really recommend the library Beautiful Soup. For now, you could just be searching for that token within all the tags of the HTML file, but, at some other time, you might be looking for more complex things such as finding a piece of string thats only within a certain paragraph tag. To install it with pip:
pip install beautifulsoup4
For your case, here's a tested snippet that can solve your problem. It uses a simple regexp pattern for the token that you are looking for. If there's a match for that token inside an HTML tag, it returns True. Otherwise, False.
I hope this function can help you getting started with Beautifulsoup. It's a really powerful library.
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>
Here goes somet title
</title>
</head>
<body>
<p class="title">
<b>
Hello World!
</b>
</p>
<p class="class1">
Once upon a time..... there was a my_token here....
<a class="token" href="http://example.com/token" id="link1">
</p>
<p class="class2">
Nope....
</p>
</body>
</html>
"""
def search_inside_whole_html_tags_for(html_doc, str_lookup):
"""
Looks for a str_lookup using a simple regexp pattern. Returns
True if the str_lookup was found in the whole HTML text. Otherwise,
returns False.
"""
html_soup = BeautifulSoup(html_doc, "html.parser")
# simple regepx pattern: the fixed str lookup
rslt = html_soup.find_all(text=re.compile(str_lookup))
return bool(rslt)
print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_tokenx"))
print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_token")) # this the token
>>> False
>>> True
You're comparing a string to an integer hence the type error. Need to convert to an integer on the string or test for if not None.
Test:
>>> token_str = 'test'
>>> token_str >= 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> token_str != None
True
Recommended Solution:
is_present = html.find(int(token_str)) >= 0
or
is_present = html.find(token_str) != None

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Keep \n in string content and write to one line

I have the following code for parsing some HTML. I need to save the output (html result) as a single line of code with the escaped character sequences there such as \n but I'm either getting a representation I can't use from repr() because of the single quotes or the output is being written to multiple lines like so (interpreting the escape sequences):
<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>
What I require (including the escape sequences):
<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n </p>\n </section>
My Code
soup = BeautifulSoup(html, "html.parser")
for match in soup.findAll(['div']):
match.unwrap()
for match in soup.findAll(['a']):
match.unwrap()
html = soup.contents[0]
html = str(html)
html = html.splitlines(True)
html = " ".join(html)
html = re.sub(re.compile("\n"), "\\n", html)
html = repl(html) # my current solution works, but unusable
The above is my solution, but an object representation is no good, I need the string representation. How can I achieve this?
Why don't use just repr?
a = """this is the first line
this is the second line"""
print repr(a)
Or even (if I clear with your issue of exact output without literal quotes)
print repr(a).strip("'")
Output:
'this is the first line\nthis is the second line'
this is the first line\nthis is the second line
import bs4
html = '''<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>'''
soup = bs4.BeautifulSoup(html, 'lxml')
str(soup)
out:
'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n</section></body></html>'
There are more complex way to output the html code in the Document
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen('https://www.example.com')
soup = BeautifulSoup(r.read(), 'html.parser')
html = str(soup)
This will give your html as one string and lines separated by \n

How can i remove <p> </p> with python sub

I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)

Categories

Resources