Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)
Related
i am quite stuck with this:
<span>Alpha<span class="class_xyz">Beta</span></span>
I am trying to scrape only the first span text "Alpha" (excluding the second nested "Beta").
How would you do that?
I am trying to write a function to find all the Span tags without a class attribute, but something is not working...
Thanks.
One way to handle it:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
Output:
'Alpha'
Here is another way that get the text for every Span tag without a class attribute:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
Output:
['Alpha', 'Gamma', 'Epsilon']
Or if you want the whole span tag:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
Output:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]
I'm using bs4 to parse xml,however i can use find method to extract tag "w:tl" as I want, but find_all(name="w:tl") method returns empty list while only find_all(lambda e: e.name == "w:tl") returns what as expected i.e. it returns all contents with tag 'w:tl' .But strangely find_all('w') is alright. So why is that?
from bs4 import BeautifulSoup
openxml = '''
<head>
<p id='p1'>12</p>
<w:tl class='p3'>12</w:tl>
<w:tl class='p3'>11</w:tl>
<w:tl>11</w:tl>
<w:tl>11</w:tl>
<w>11</w>
<w>11</w>
<p></p>
</head>
'''
url_soup = BeautifulSoup(openxml,'lxml')
# #
url_soup.find_all(lambda e: e.name == "w:tl")
url_soup.find_all(name="w:tl")
url_soup.find("w:tbl")
url_soup.find_all(name="w")
I'm attempting to clean some html by parsing it through BeautifulSoup using Python 2.
BeautifulSoup parses the raw_html which is associated with a website_id in the html_dict. It also removes any attribute associated with the html tags (<a>, <b>, and <p>).
html_dict = {"l0000": ["<a href='some url'>test</a>", "lol", "<a><b>test</b></a>"], "l0001":["<p>this is html</p>", "<p>this is html</p>"]}
clean_html = {}
for website_id, raw_html in html_dict.items():
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"])
# Remove attributes from html tag
for i in scrape_selected_tags:
i.attrs = {}
print website_id, scrape_selected_tags
This outputs:
l0001 [<p>this is html</p>]
l0001 [<p>this is html</p>]
l0000 [<a>test</a>]
l0000 []
l0000 [<a><b>test</b></a>, <b>test</b>]
I have two questions:
1) The last output has outputted "test" twice. I assume this is because it is surrounded by both the <a> and <b> tags? How does one deal with child-tags to output <a><b>test</b></a> only?
2) Given a unique website_id, how would one remove duplicates, so that there's only one occurrence of <p>this is html</p> for l0001? I know that scrape_selected_tags has a type of bs4.element.ResultSet and I'm not sure how to handle this and insert the new output in the same format as html_dict but in clean_html.
Thanks
1) Set the recursive argument to False. This will select only direct descendants and will not go deeper in the soup. The problem with this method is that children tags will hold their attributes, so you'll have to use one more loop to clean them.
2) Use a set (or you could use list comprehensions) to select only unique tags.
from bs4 import BeautifulSoup
html_dict = {
"l0000":["<a href='some url'>test</a>", "lol", "<a class='1'><b class='2'>test</b></a>"],
"l0001":["<p>this is html</p>", "<p>this is html</p>"]
}
clean_html = {}
for website_id, raw_html in html_dict.items():
clean_html[website_id] = []
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all(["a", "b", "p"], recursive=False)
for i in scrape_selected_tags:
i.attrs = {}
for i in [c for p in scrape_selected_tags for c in p.find_all()]:
i.attrs = {}
clean_tags = list(set(scrape_selected_tags + clean_html[website_id]))
clean_html[website_id] = clean_tags
print(clean_html)
{'l0001': [<p>this is html</p>], 'l0000': [<a><b>test</b></a>, <a>test</a>]}
Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)?
For the sake of example, let's say I want to add 10 <a> elements to a <div>, but have them separated with a comma:
soup = BeautifulSoup(<<some document here>>)
a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings
div = soup.new_tag("div")
a_str = ",".join(a_tags)
Using div.append(a_str) escapes < and > into < and >, so I end up with
<div> <a1> 1 </a> ... </div>
BeautifulSoup(a_str) wraps this in <html>, and I see getting the tree out of it as an inelegant hack.
What to do?
You need to create a BeautifulSoup object out of your HTML string containing links:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
a_tags = ["<a>1</a>", "<a>2</a>", "<a>3</a>", "<a>4</a>", "<a>5</a>"]
a_str = ",".join(a_tags)
div.append(BeautifulSoup(a_str, 'html.parser'))
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Alternative solution:
For each link create a Tag and append it to div. Also, append a comma after each link except last:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
for x in xrange(1, 6):
link = soup.new_tag('a')
link.string = str(x)
div.append(link)
# do not append comma after the last element
if x != 6:
div.append(",")
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)