<> becomes &lt &gt in beautifulsoup - python

Assume I have item div, div is a beautifulsoup object (obtained by findAll). The source looks like:
<div>text1 <span>text2</span></div>
What I want to do is to replace text1 with text3. I tried:
div.string.replace_with(newstr), where newstr="text3 <span>text2</span>"
This does not work because div.string is None
div.replace_with(newstr)
This does not work because the final result shows &lt and&gt rather than "<" and ">" when I save the html code into file.

You can find div tag and then find next_element which is text1 and then replace_with text3
from bs4 import BeautifulSoup
html= '''<div>text1 <span>text2</span></div>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div').next_element.replace_with('text3')
print(soup)

Just playing around with the interactive prompt... I'm sure there's a better solution but...
from bs4 import BeautifulSoup
data = '''<div>text1 <span>text2</span></div>'''
soup = BeautifulSoup(data, features="lxml")
div = soup.find('div')
a, *b = div.contents
c = a.replace('text1', 'text3')
a.replace_with(c)
print(div)

Related

How to get tag contents (including all text and elements)

I have a html snippet (no other parent elements):
html = '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
How do I extract all the tags and text (which may be variable) within the div, but not the div tag itself? I.e.L
target_str = '<p>Hello</p><p>Goodbye</p>[...]'
I have tried:
soup = BeautifulSoup(html , 'html.parser')
mydiv = soup.find(id='mydiv')
print(mydiv)
>>> '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
mydiv.unwrap()
print(mydiv)
>>> '<div id="mydiv"></div>'
How do I get just the contents of the tag?
Try:
from bs4 import BeautifulSoup
html = '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
soup = BeautifulSoup(html, "html.parser")
print("".join(map(str, soup.select_one("#mydiv").contents)))
Prints:
<p>Hello</p><p>Goodbye</p>[...]

Replacing a bs4 element with a string

So I have a HTML document, where I want to add HTML anchor link tags so that I can easily go to a certain part of a webpage.
The first step is to find all divs that need to replaced. Secondly, an anchor link tag needs to be added, based on the text that is within the div. My code looks as follows:
from bs4 import BeautifulSoup
path= "/text.html"
with open(path) as fp:
soup = BeautifulSoup(fp, 'html.parser')
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
string = div.getText().split()[1]
div_id = f"""<a id="{string}"></a>{div}"""
full =f"{div_id}{div}"
html_soup = BeautifulSoup(full, 'html.parser')
div = html_soup
A div looks as follows:
<p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
After adding the anchor tag it becomes:
<a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">Artikel 10.6 Inwerkingtreding</strong></p><p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
But the problem is, div is not replaced by the new div. How should I correct this? Or is there another way to insert an anchor tag?
I'm not quite sure what your expected output to look like, but BeautifulSoup has methods to create new tags and attributes, and insert them into the soup object.
from bs4 import BeautifulSoup
fp = '<p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong>'
soup = BeautifulSoup(fp, 'html.parser')
print('soup before: ', soup)
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
a_string = div.getText().split()[1]
new_tag = soup.new_tag("a")
new_tag['id'] = f'{a_string}'
div.insert_before(new_tag)
print('soup after: ', soup)
Output:
soup before: <p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
soup after: <a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>

Got empty list with Beautiful Soup and Selenium

https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king
I want to get TOMATOMETER and AUDIENCE SCORE from that website,
but got an empty list.
soup = BeautifulSoup(html, 'html.parser')
notices = soup.select('#tomato_meter_link > span.mop-ratings-wrap__percentage')
You can use last-child selector for span type with the parent class. This is using BeautifulSoup 4.7.1
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king')
soup = bs(res.content, 'lxml')
ratings = [item.text.strip() for item in soup.select('h1.mop-ratings-wrap__score span:last-child')]
print(ratings)
Your code works well
>>> from bs4 import BeautifulSoup
>>> html = requests.get('https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king').text
>>> soup = BeautifulSoup(html, 'html.parser')
>>> notices = soup.select('#tomato_meter_link > span.mop-ratings-wrap__percentage')
>>> notices
[<span class="mop-ratings-wrap__percentage">93%</span>]
How did you get html variable?

Beautifulsoup: parsing html – get part of href

I'm trying to parse
<td height="16" class="listtable_1">76561198134729239</td>
for the 76561198134729239. and I can't figure out how to do it. what I tried:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td",
{
"class":"listtable_1",
"target":"_blank"
})
print(element.text)
There are many such entries in that HTML. To get all of them you could use the following:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")
for td in soup.findAll("td", class_="listtable_1"):
for a in td.findAll("a", href=True, target="_blank"):
print(a.text)
This would then return:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044
"target":"_blank" is a class of anchor tag a within the td tag. It's not a class of td tag.
You can get it like so:
from bs4 import BeautifulSoup
html="""
<td height="16" class="listtable_1">
<a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
76561198134729239
</a>
</td>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)
Output:
76561198134729239
As others mentioned you are trying to check attributes of different elements in a single find(). Instead, you can chain find() calls as MYGz suggested, or use a single CSS selector:
soup.select_one("td.listtable_1 a[target=_blank]").get_text()
If, you need to locate multiple elements this way, use select():
for elm in soup.select("td.listtable_1 a[target=_blank]"):
print(elm.get_text())
"class":"listtable_1" belong to td tag and target="_blank" belong to a tag, you should not use them together.
you should use Steam Community as an anchor to find the numbers after it.
OR use URL, The URL contain the info you need and it's easy to find, you can find the URL and split it by /:
for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
num = a['href'].split('/')[-1]
print(num)
Code:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
num = td.find_next_sibling('td').text
print(num)
out:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044
You could chain together two finds in gazpacho to solve this problem:
from gazpacho import Soup
html = """<td height="16" class="listtable_1">76561198134729239</td>"""
soup = Soup(html)
soup.find("td", {"class": "listtable_1"}).find("a", {"target": "_blank"}).text
This outputs:
'76561198134729239'

Append markup string to a tag in BeautifulSoup

Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)?
For the sake of example, let's say I want to add 10 <a> elements to a <div>, but have them separated with a comma:
soup = BeautifulSoup(<<some document here>>)
a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings
div = soup.new_tag("div")
a_str = ",".join(a_tags)
Using div.append(a_str) escapes < and > into < and >, so I end up with
<div> <a1> 1 </a> ... </div>
BeautifulSoup(a_str) wraps this in <html>, and I see getting the tree out of it as an inelegant hack.
What to do?
You need to create a BeautifulSoup object out of your HTML string containing links:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
a_tags = ["<a>1</a>", "<a>2</a>", "<a>3</a>", "<a>4</a>", "<a>5</a>"]
a_str = ",".join(a_tags)
div.append(BeautifulSoup(a_str, 'html.parser'))
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Alternative solution:
For each link create a Tag and append it to div. Also, append a comma after each link except last:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
for x in xrange(1, 6):
link = soup.new_tag('a')
link.string = str(x)
div.append(link)
# do not append comma after the last element
if x != 6:
div.append(",")
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>

Categories

Resources