Parsing Html with python using lxml - python

I have this html page:
<html>
<head></head>
<body>
Some Text
<a href="aLink">
Other Text
</a>
<a href="aLink2.html">
Another Text
</a>
</body>
</html>
I am interested in capturing the 3 texts in the file. By doing this, i get in output the texts from the two links:
from lxml import html
from lxml import etree
import requests
page = requests.get('myUrl')
tree = html.fromstring(page.text)
aLink = tree.xpath('//a')
for link in aLink:
print link.text #it works
But I am not able to obtain the text from the body section, since the following code doesn't work:
body = tree.xpath('//body')
print body.text
What am I supposed to do? Thanks for the answers

Related

Make BeautifulSoup recognize word breaks caused by HTML <li> elements

BeautifulSoup4 does not recognize that it should would break between <li> elements when extracting text:
Demo program:
#!/usr/bin/env python3
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup( HTML, 'html.parser' )
print(soup.find('body').text.strip())
Output:
First ElementSecond element
Desired output:
First Element Second element
I guess I could just globally add a space before all <li> elements. That seems like a hack?
Try using .stripped_strings of soup to extract the text while preserving the whitespaces between elements
from bs4 import BeautifulSoup
HTML = """
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup(HTML, 'html.parser')
print(' '.join(soup.body.stripped_strings))
Or extract the text of each <li> element separately and then join them
from bs4 import BeautifulSoup
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup( HTML, 'html.parser' )
lis = soup.find_all('li')
text = ' '.join([li.text.strip() for li in lis])
print(text)

How to prevent BeautifulSoup from adding extra doctype entry

If I read an html file and load it with bs4, I get an extra doctype entry. How can I prevent it?
HTML Code
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
This is how the file is processed
from bs4 import BeautifulSoup
page = urllib.urlopen(file_name).read()
page_soup = BeautifulSoup(page, 'html.parser')
The resulting HTML
<!DOCTYPE doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
Perhaps the issue is not with BS as I am not able to reproduce the problem.
Running this
from bs4 import BeautifulSoup
import urllib.request
file_name = 'file:///C:/Users/tang/MathScripts/t.html'
page = urllib.request.urlopen(file_name).read()
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I get
<!DOCTYPE html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body>
<p>
text body
</p>
</body>
</html>
It looks like doctype string is case insensitive in HTML spec, but case sensitive in XML spec.
It is explained very well in this post: "Uppercase or lowercase doctype?".
Based on this information, I think BeautifulSoup is not handling html doctype string properly.
I changed my code as below and it works fine now.
page = urllib.urlopen(file_name).read()
# case insensitive replace to consider all case permutations
page = re.sub('<!doctype', '<!DOCTYPE', page, flags=re.IGNORECASE)
page_soup = BeautifulSoup(page, 'html.parser')
I am not so sure if the html specification has been updated or not.
Please post comment if you have more information to share.
Found one more solution.
I replaced 'html.parser' with 'html5lib' and it works fine.
page = urllib.urlopen(file_name).read()
page_soup = BeautifulSoup(page, 'html5lib')

Why does BeautifulSoup remove all the formatting from my HTML?

I have an HTML file which looks more or less like
<body>
<div>
<aside class="bg">
Home
</aside>
</div>
</body>
But after I parse it with BeautifulSoup and then write it to file, all my formatting is gone. My code looks like:
with open('contact.html', 'r') as f:
soup = BeautifulSoup(f, "html.parser")
elem = soup.find("aside")
new_html = "Support"
new_soup = BeautifulSoup(new_html, "html.parser")
elem.insert(1, newsoup)
with open('contact.html', 'w') as f:
f.write(str(soup))
The resulting html file looks like
<body>
<div>
<aside class="bg">
Support
Home
</aside>
</div>
</body>
I don't want to use prettify because I dislike the formatting of it. I just want to keep my formatting the same. Any way I can do that?
there is a discussion in this thread
Maintaining the indentation of an XML file when parsed with Beautifulsoup
hope this helps

How can I find a comment with specified text string

I'm using robobrowser to parse some html content. I has a BeautifulSoup inside. How can I find a comment with specified string inside
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
In fact I need to get TEXT_1 if I know ANY_ID
Thanks
Use the text argument and check the type to be Comment. Then, load the contents with BeautifulSoup again and find the desired element by id:
from bs4 import BeautifulSoup
from bs4 import Comment
data = """
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment) and "ANY_ID" in text)
soup_comment = BeautifulSoup(comment, "html.parser")
text = soup_comment.find("div", id="ANY_ID").get_text()
print(text)
Prints TEXT_1.

Python and HTML: parsing away tags

Let's say you've been able to store part of an HTML file into a string called 'source', that looks like this (it can even start off with a closing </...> tag):
</div> <div class = "label" xml: ... other_parameters: ... > Some plain english here </div> </div> Some more english text... <some more tags> more english... etc
How does one go about extracting just the first "Some plain english here" (and only that), from that line, with no whitespace or newline markers right before the first letter in "Some" nor right after the last letter in "here"?
[EDIT] Extraction into a new string would be fine
There are plenty of XML/HTML parsers to choose from, most popular are:
xml.etree.ElementTree
xml.dom.minidom
BeautifulSoup
lxml
Using ElementTree:
from xml.etree import ElementTree as ET
tree = ET.fromstring("""
<body>
<div class="label"> Some plain english here </div>
</body>""")
print tree.find('.//div[#class="label"]').text.strip()
Using lxml:
from lxml import etree
tree = etree.fromstring("""
<body>
<div class="label"> Some plain english here </div>
</body>""")
print tree.find('.//div[#class="label"]').text.strip()
Using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<body>
<div class="label"> Some plain english here </div>
</body>"""
soup = BeautifulSoup(data)
print soup.find('div', class_="label").text.strip()
No example for minidom, sorry, don't use it.
Hope that helps.

Categories

Resources