I'm using robobrowser to parse some html content. I has a BeautifulSoup inside. How can I find a comment with specified string inside
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
In fact I need to get TEXT_1 if I know ANY_ID
Thanks
Use the text argument and check the type to be Comment. Then, load the contents with BeautifulSoup again and find the desired element by id:
from bs4 import BeautifulSoup
from bs4 import Comment
data = """
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment) and "ANY_ID" in text)
soup_comment = BeautifulSoup(comment, "html.parser")
text = soup_comment.find("div", id="ANY_ID").get_text()
print(text)
Prints TEXT_1.
Related
BeautifulSoup4 does not recognize that it should would break between <li> elements when extracting text:
Demo program:
#!/usr/bin/env python3
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup( HTML, 'html.parser' )
print(soup.find('body').text.strip())
Output:
First ElementSecond element
Desired output:
First Element Second element
I guess I could just globally add a space before all <li> elements. That seems like a hack?
Try using .stripped_strings of soup to extract the text while preserving the whitespaces between elements
from bs4 import BeautifulSoup
HTML = """
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup(HTML, 'html.parser')
print(' '.join(soup.body.stripped_strings))
Or extract the text of each <li> element separately and then join them
from bs4 import BeautifulSoup
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup( HTML, 'html.parser' )
lis = soup.find_all('li')
text = ' '.join([li.text.strip() for li in lis])
print(text)
If I read an html file and load it with bs4, I get an extra doctype entry. How can I prevent it?
HTML Code
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
This is how the file is processed
from bs4 import BeautifulSoup
page = urllib.urlopen(file_name).read()
page_soup = BeautifulSoup(page, 'html.parser')
The resulting HTML
<!DOCTYPE doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
Perhaps the issue is not with BS as I am not able to reproduce the problem.
Running this
from bs4 import BeautifulSoup
import urllib.request
file_name = 'file:///C:/Users/tang/MathScripts/t.html'
page = urllib.request.urlopen(file_name).read()
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I get
<!DOCTYPE html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
It looks like doctype string is case insensitive in HTML spec, but case sensitive in XML spec.
It is explained very well in this post: "Uppercase or lowercase doctype?".
Based on this information, I think BeautifulSoup is not handling html doctype string properly.
I changed my code as below and it works fine now.
page = urllib.urlopen(file_name).read()
# case insensitive replace to consider all case permutations
page = re.sub('<!doctype', '<!DOCTYPE', page, flags=re.IGNORECASE)
page_soup = BeautifulSoup(page, 'html.parser')
I am not so sure if the html specification has been updated or not.
Please post comment if you have more information to share.
Found one more solution.
I replaced 'html.parser' with 'html5lib' and it works fine.
page = urllib.urlopen(file_name).read()
page_soup = BeautifulSoup(page, 'html5lib')
I'm using Beautiful Soup for replacing text.
Here's an example my code:
for x in soup.find('body').find_all(string=True):
fix_str = re.sub(...)
x.replace_with(fix_str)
How do I skip the script and comment (<--! -->) tags?
How can I determine which elements or tag are in x?
If you take the parent item for each text item you get, you can then determine whether or not it comes from within a <script> tag or from an HTML comment. If not, the text can then be used to call replace_with() using your re.sub() function:
from bs4 import BeautifulSoup, Comment
html = """<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>
Some text 1
<!-- a comment -->
<!-- a comment -->
Some text 2
<!-- a comment -->
<script>a script</script>
Some text 2
</body>
</html>"""
soup = BeautifulSoup(html, "html.parser")
for text in soup.body.find_all(string=True):
if text.parent.name != 'script' and not isinstance(text, Comment):
text.replace_with('new text') # add re.sub() logic here
print soup
Giving you the following new HTML:
<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>new text<!-- a comment -->new text<!-- a comment -->new text<!-- a comment -->new text<script>a script</script>new text</body>
</html>
I have html-code some like this
<body>
<p> String </p>
Some string
</body>
I need to wrap all unwrapped text inside a body with a paragraph.
I can do it with javascript Node.nodeTypes, but i need solution on Python (i try to use lxml).
In output i need
<body>
<p> String </p>
<p> Some string </p>
</body>
My solution on javascript
$(document).ready(function() {
$('article').contents().filter(function() {
return this.nodeType == 3 && $.trim(this.nodeValue).length;
}).wrap('</p>');
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<article>
<p>Some text</p>
Some unwrapped text
<p>Some text</p>
</article>
Here's how it can be done using lxml:
html = '''
<html>
<body>
Text
<p>String</p>
Tail
<p>String</p>
Tail
</body>
</html>
'''
from lxml import etree
import lxml.html
doc = lxml.html.fromstring(html)
for doc_child in doc:
if doc_child.tag == 'body':
body = doc_child
if body.text and body.text.strip():
p = etree.Element('p')
p.text = body.text.strip()
body.text = None
body.insert(0, p)
for elem in body:
if elem.tail and elem.tail.strip():
p = etree.Element('p')
p.text = elem.tail.strip()
elem.tail = None
elem.addnext(p)
print(lxml.html.tostring(doc).decode('utf8'))
Output:
<html>
<body><p>Text</p><p>String</p><p>Tail</p><p>String</p><p>Tail</p></body>
</html>
You can use BeautifulSoup module to parse html pages.
There were many ways available to do this.
But this is one of the easiest method to parse html to text.
from bs4 import BeautifulSoup # from BeautifulSoup import BeautifulSoup
html = '''<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<article>
<p>Some text</p>
Some unwrapped text
<p>Some text</p>
</article>'''
parsed_html = BeautifulSoup(html, "lxml")
print parsed_html.text
Output:
Some text
Some unwrapped text
Some text
Python, with lxml:
from lxml.etree import fromstring
body = fromstring("""
<body>
<p> String </p>
Some string
</body>
""")
for text_node in body.xpath("//text()"):
parent = text_node.getparent()
if text_node.strip() and parent.tag != "p":
wrapper = fromstring("<p/>")
parent.replace(text_node, wrapper)
wrapper.append(text_node)
I have this html page:
<html>
<head></head>
<body>
Some Text
<a href="aLink">
Other Text
</a>
<a href="aLink2.html">
Another Text
</a>
</body>
</html>
I am interested in capturing the 3 texts in the file. By doing this, i get in output the texts from the two links:
from lxml import html
from lxml import etree
import requests
page = requests.get('myUrl')
tree = html.fromstring(page.text)
aLink = tree.xpath('//a')
for link in aLink:
print link.text #it works
But I am not able to obtain the text from the body section, since the following code doesn't work:
body = tree.xpath('//body')
print body.text
What am I supposed to do? Thanks for the answers