i am quite stuck with this:
<span>Alpha<span class="class_xyz">Beta</span></span>
I am trying to scrape only the first span text "Alpha" (excluding the second nested "Beta").
How would you do that?
I am trying to write a function to find all the Span tags without a class attribute, but something is not working...
Thanks.
One way to handle it:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
Output:
'Alpha'
Here is another way that get the text for every Span tag without a class attribute:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
Output:
['Alpha', 'Gamma', 'Epsilon']
Or if you want the whole span tag:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
Output:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]
Related
I'm looking for a way to extract only tags that don't have another tag in it
For example:
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
soup.find_all('p')
Gives
[<p>Text1</p>,
<p>Text2</p>,
<p>Text3</p>,
<p>Text4</p>]
This is what I want to achieve:
[<p>Text2</p>,
<p>Text4</p>]
You can filter Tags without other tags in them as follows:
for tag in soup.find_all('p'):
if isinstance(tag.next, str):
print(tag)
Which returns
<p>Text2</p>
<p>Text4</p>
I would simply filter it afterwards using if/else on the length of the tags, if it's only p then it'll be empty, otherwise it will get filtered out:
for x in soup.find_all('p'):
if len([x.tag for x in x.find_all()]) == 0:
print(x)
Returns only:
<p>Text2</p>
<p>Text4</p>
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
<p>Text6: <a href='QWERTY'>Text5</a></p>
"""
soup = BeautifulSoup(html, 'html.parser')
def p_tag_with_only_strings_as_children(tag):
return tag.name == "p" and all(isinstance(x, str) for x in tag.children)
result = soup.find_all(p_tag_with_only_strings_as_children)
print(result)
Result:
[<p>Text2</p>, <p>Text4</p>]
BeautifulSoup-Documentation for using function-filters on .find_all().
For checking for types within a list credits go to.
https://stackoverflow.com/a/32705845/5288820.
Or use CSS-Selectors:
https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
<p>Text6: <a href='QWERTY'>Text5</a></p>
""".replace('\n',"")
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:not(:has(*))'))
#or in case you only want to filter out "a" tags:
print(soup.select('p:not(:has(a))'))
Result:
[<p>Text2</p>, <p>Text4</p>]
Here is my code:
<h1 onclick="alert('Hello')">Click</h1>
I want to do this:
<h1>Click</h1>
Solution:
from bs4 import BeautifulSoup
html = """<h1 onclick="alert('Hello')">Click</h1>"""
a = BeautifulSoup(html,'html.parser')
for tag in a.find_all():
if 'onclick' in tag.attrs:
tag.attrs.pop('onclick')
print(a)
#<h1>Click</h1>
I think you want something as follows:
from bs4 import BeautifulSoup
html = '<h1 onclick="alert(\'Hello\')">Click</h1>'
soup = BeautifulSoup(html)
print(soup)
# <html><body><h1 onclick="alert('Hello')">Click</h1></body></html>
# get h1
h1 = soup.h1
print(h1.attrs)
# {'onclick': "alert('Hello')"}
# so, we simply want to reset the attrs to an empty dict:
h1.attrs = {}
print(h1)
# <h1>Click</h1>
Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.
You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)