Parsing HTML with lxml (python)

Parsing HTML with lxml (python) - python

I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>.
I did all these things already with BeautifulSoup:
f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)
txt = ""
for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)
text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()==""))
[empty_tag.extract() for empty_tag in empty_tags]
My question is: Is this also possible with lxml? If yes: How would this +/- look like?
Thanks a lot for any help.

import lxml.html
# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()
# use a CSS selector for class="main",
# or use root.xpath('//table[#class="main"]')
tables = root.cssselect('table.main')
# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])
# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
empty.getparent().remove(empty)
# root does not contain those empty tags anymore

Related

Bad output with python html parsing

I have an html file in C:\temp.
I want to extract this text
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
from this block of code
<td width='99%' style='word-wrap:break-word;'><div><img src='style_images/1/nav_m.gif' border='0' alt='>' width='8' height='8' /> <b>Death - Individual Thought Patterns (1993)</b>, Progressive Death Metal</div></td>
<!--HideBegin--><div class='hidetop'>Hidden text</div><div class='hidemain'><!--HideEBegin--><!--coloro:#FF0000--><span style="color:#FF0000"><!--/coloro--><b>Download:</b><!--colorc--></span><!--/colorc--><br />Download from ifolder.ru <i>*Death - Individual Thought Patterns (1993)* <b>by Dissident God</b></i><br /><!--HideEnd--></div><!--HideEEnd--><br /><!--HideBegin--><div class='hidetop'>Hidden text</div><div class='hidemain'><!--HideEBegin--><!--coloro:#ff0000--><span style="color:#ff0000"><!--/coloro--><b>Download (mp3#VBR230kbps) (67 MB):</b><!--colorc--></span><!--/colorc--><br />Download from rapidshare.com <i>*Death - Individual Thought Patterns (Remastered) (2008)* <b>by smashter</b></i><!--HideEnd--></div><!--HideEEnd-->
<td width='99%' style='word-wrap:break-word;'><div><img src='style_images/1/nav_m.gif' border='0' alt='>' width='8' height='8' /> <b>Alfadog (1994)</b>, Black Metal</div></td>
The extracted text must be saved in a file called links.txt
Despite my changes my script only ever extracts this text to me
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
But I want you to extract this text and like this
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
This is the script
import requests
from bs4 import BeautifulSoup
# Open the HTML file in read mode
with open("C:/temp/pagina.html", "r") as f:
html = f.read()
# Create a Beautiful Soup object from HTML code
soup = BeautifulSoup(html, "html.parser")
# Initialize a list to contain the extracted text
extracted_text = []
# Find all td's with style "word-wrap:break-word"
tds = soup.find_all("td", style="word-wrap:break-word")
# For each td found, look for the div tag and the b tag inside
# and extract the text contained in these tags
for td in tds:
div_tag = td.find("div")
b_tag = div_tag.find("b")
if b_tag:
text = b_tag.text
# Also add the text after the b tag
text += td.text[td.text.index(b_tag.text) + len(b_tag.text):]
extracted_text.append(text)
# Find all divs with class "hidemain"
divs = soup.find_all("div", class_="hidemain")
# For each div found, look for the a tag inside
# and extract the link text contained in this tag
for div in divs:
a_tag = div.find("a")
if a_tag:
link = a_tag.get("href")
extracted_text.append(link)
# Save the extracted text to a text file
with open("links.txt", "w") as f:
for line in extracted_text:
f.write(line + "\n")
I can't understand the problem of why it doesn't return the text I ask for

After #Barmar code edit i get this output
Death - Individual Thought Patterns (1993), Progressive Death Metal
Alfadog (1994), Black Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
To make the desired lines appear in the order they appear in the html file
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
I modified the code to extract the links first and then the titles.
Specifically, I modified using two for loops, one to extract links and the other to extract song titles. Also, I used the extend function to add the extracted items to the extract_text list instead of using the append function.
This is the fixed code
import requests
from bs4 import BeautifulSoup
# Open the HTML file in read mode
with open("C:/temp/pagina.html", "r") as f:
html = f.read()
# Create a Beautiful Soup object from HTML code
soup = BeautifulSoup(html, "html.parser")
# Initialize a list to contain the extracted text
extracted_text = []
# Find all divs with class "hidemain"
divs = soup.find_all("div", class_="hidemain")
# For each div found, look for the a tag inside
# and extract the link text contained in this tag
for div in divs:
a_tag = div.find("a")
if a_tag:
link = a_tag.get("href")
extracted_text.extend([link])
# Find all td's with style "word-wrap:break-word"
tds = soup.find_all("td", style="word-wrap:break-word;")
# For each td found, look for the div tag and the b tag inside
# and extract the text contained in these tags
for td in tds:
div_tag = td.find("div")
b_tag = div_tag.find("b")
if b_tag:
text = b_tag.text
# Also add the text after the b tag
text += td.text[td.text.index(b_tag.text) + len(b_tag.text):]
extracted_text.extend([text])
# Save the extracted text to a text file
with open("links.txt", "w") as f:
for line in extracted_text:
f.write(line + "\n")

Python function to replace all instances of a tag with another

How would I go about writing a function (with BeautifulSoup or otherwise) that would replace all instances of one HTML tag with another. For example:
text = "<p>this is some text<p><bad class='foo' data-foo='bar'> with some tags</bad><span>that I would</span><bad>like to replace</bad>"
new_text = replace_tags(text, "bad", "p")
print(new_text) # "<p>this is some text<p><p class='foo' data-foo='bar'> with some tags</p><span>that I would</span><p>like to replace</p>"
I tried this, but preserving the attributes of each tag is a challenge:
def replace_tags(string, old_tag, new_tag):
soup = BeautifulSoup(string, "html.parser")
nodes = soup.findAll(old_tag)
for node in nodes:
new_content = BeautifulSoup("<{0}>{1}</{0}".format(
new_tag, node.contents[0],
))
node.replaceWith(new_content)
string = soup.body.contents[0]
return string
Any idea how I could just replace the tag element itself in the soup? Or, even better, does anyone know of a library/utility function that'll handle this more robustly than something I'd write?
Thank you!

Actually it's pretty simple. You can directly use old_tag.name = new_tag.
def replace_tags(string, old_tag, new_tag):
soup = BeautifulSoup(string, "html.parser")
for node in soup.findAll(old_tag):
node.name = new_tag
return soup # or return str(soup) if you want a string.
text = "<p>this is some text<p><bad class='foo' data-foo='bar'> with some tags</bad><span>that I would</span><bad>like to replace</bad>"
new_text = replace_tags(text, "bad", "p")
print(new_text)
Output:
<p>this is some text<p><p class="foo" data-foo="bar"> with some tags</p><span>that I would</span><p>like to replace</p></p></p>
From the documentation:
Every tag has a name, accessible as .name:
tag.name
# u'b'
If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>

Beautifulsoup parsing html line breaks

I'm using BeautifulSoup to parse some HTML from a text file. The text is written to a dictionary like so:
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})
I parse the HTML from html_dict = {} to find texts with the <p> tag:
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')
This is what the HTML in html_dict looks like:
<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>
The problem is, BeautifulSoup seems to be considering the line break and ignoring the second line. So when i print out scrape_selected_tags the output is...
<p>Hey, this should be scraped</p>
when I would expect the whole text.
How can I avoid this? I've tried splitting the lines in html_dict and it doesn't seem to work. Thanks in advance.

By calling splitlines when you read your html documents you break the tags in a list of strings.
Instead you should read all the html in a string.
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read()
html_dict.update({website_id:get_raw_html})
Then remove the inner for loop, so you won't iterate over that string.
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
soup = BeautifulSoup(raw_html, 'html.parser')
scrape_selected_tags = soup.find_all('p')
BeautifulSoup can handle newlines inside tags, let me give you an example:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]
But if you split one tag in multiple BeautifulSoup objects:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
for line in html.splitlines():
soup = BeautifulSoup(line, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped</p>]
[]

BeautifulSoup child-tags and removing duplicates

I'm attempting to clean some html by parsing it through BeautifulSoup using Python 2.
BeautifulSoup parses the raw_html which is associated with a website_id in the html_dict. It also removes any attribute associated with the html tags (<a>, <b>, and <p>).
html_dict = {"l0000": ["<a href='some url'>test</a>", "lol", "<a><b>test</b></a>"], "l0001":["<p>this is html</p>", "<p>this is html</p>"]}
clean_html = {}
for website_id, raw_html in html_dict.items():
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"])
# Remove attributes from html tag
for i in scrape_selected_tags:
i.attrs = {}
print website_id, scrape_selected_tags
This outputs:
l0001 [<p>this is html</p>]
l0001 [<p>this is html</p>]
l0000 [<a>test</a>]
l0000 []
l0000 [<a><b>test</b></a>, <b>test</b>]
I have two questions:
1) The last output has outputted "test" twice. I assume this is because it is surrounded by both the <a> and <b> tags? How does one deal with child-tags to output <a><b>test</b></a> only?
2) Given a unique website_id, how would one remove duplicates, so that there's only one occurrence of <p>this is html</p> for l0001? I know that scrape_selected_tags has a type of bs4.element.ResultSet and I'm not sure how to handle this and insert the new output in the same format as html_dict but in clean_html.
Thanks

1) Set the recursive argument to False. This will select only direct descendants and will not go deeper in the soup. The problem with this method is that children tags will hold their attributes, so you'll have to use one more loop to clean them.
2) Use a set (or you could use list comprehensions) to select only unique tags.
from bs4 import BeautifulSoup
html_dict = {
"l0000":["<a href='some url'>test</a>", "lol", "<a class='1'><b class='2'>test</b></a>"],
"l0001":["<p>this is html</p>", "<p>this is html</p>"]
}
clean_html = {}
for website_id, raw_html in html_dict.items():
clean_html[website_id] = []
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"], recursive=False)
for i in scrape_selected_tags:
i.attrs = {}
for i in [c for p in scrape_selected_tags for c in p.find_all()]:
i.attrs = {}
clean_tags = list(set(scrape_selected_tags + clean_html[website_id]))
clean_html[website_id] = clean_tags
print(clean_html)
{'l0001': [<p>this is html</p>], 'l0000': [<a><b>test</b></a>, <a>test</a>]}

Python text parsing between two words

I'm using beautifulsoup and want to extract all text from between two words on a webpage.
Ex, imagine the following website text:
This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.
I want to pull out everything on the page that starts with text and ends with bunch.
In this case I'd want only:
text of the webpage. It is just a string of a bunch
However, there's a chance there could be multiple instances of this on a page.
What is the best way to do this?
This is my current setup:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.
for line in visible_texts:
print line

since you're just parsing the text you just need the regex:
import re
result = re.findall("text.*?bunch", text_from_web_page)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML with lxml (python) - python

Related

Bad output with python html parsing

Python function to replace all instances of a tag with another

Beautifulsoup parsing html line breaks

BeautifulSoup child-tags and removing duplicates

Python text parsing between two words

Categories

Resources