Remove Strong tag inside HTML file - python

I want to remove Strong tag from my html file in two cases:
1st case :
<strong><strong>text1</strong> Some text.</strong>
what i want to do is to remove the first strong tag so the output will be like that :
<strong>text1</strong> Some text.
In second case : if the text's length between strong tags is > 100 characters i want the tag to be deleted
Example :
<strong>Text that is over 100 characters </strong>
to be like this :
Text that is over 100 characters

Apparently BeautifulSoup supports the :has() CSS selector.
from bs4 import BeautifulSoup
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap
data = '''
<strong><strong>text1</strong> Some text.</strong>
<strong>Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters</strong>
'''
soup = BeautifulSoup(data, 'html.parser')
selector = 'strong:has(strong)' # works in bs4.__version__ == 4.9.3
for e in soup.select(selector):
e.unwrap()
for e in soup.select('strong'):
if e.text and len(e.text) > 100:
e.unwrap()
print(soup)

regex is not a good choice when it comes to HTML. But maybe this code snippet helps you (adapted from here)
from bs4 import BeautifulSoup
data = """
<html>
<body>
<strong>This is a test</strong>
<strong>This is a very very very very very very very very very very very very very very very very very very very very very very very very very very long test</strong>
</body>
</html>
"""
soup = BeautifulSoup(data, 'html.parser')
for strong in soup.findAll('strong'):
if len(strong.text) > 100:
# replace the span tag with it's contents
strong.unwrap()
print(soup)

Related

Replacing a bs4 element with a string

So I have a HTML document, where I want to add HTML anchor link tags so that I can easily go to a certain part of a webpage.
The first step is to find all divs that need to replaced. Secondly, an anchor link tag needs to be added, based on the text that is within the div. My code looks as follows:
from bs4 import BeautifulSoup
path= "/text.html"
with open(path) as fp:
soup = BeautifulSoup(fp, 'html.parser')
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
string = div.getText().split()[1]
div_id = f"""<a id="{string}"></a>{div}"""
full =f"{div_id}{div}"
html_soup = BeautifulSoup(full, 'html.parser')
div = html_soup
A div looks as follows:
<p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
After adding the anchor tag it becomes:
<a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">Artikel 10.6 Inwerkingtreding</strong></p><p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
But the problem is, div is not replaced by the new div. How should I correct this? Or is there another way to insert an anchor tag?
I'm not quite sure what your expected output to look like, but BeautifulSoup has methods to create new tags and attributes, and insert them into the soup object.
from bs4 import BeautifulSoup
fp = '<p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong>'
soup = BeautifulSoup(fp, 'html.parser')
print('soup before: ', soup)
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
a_string = div.getText().split()[1]
new_tag = soup.new_tag("a")
new_tag['id'] = f'{a_string}'
div.insert_before(new_tag)
print('soup after: ', soup)
Output:
soup before: <p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>
soup after: <a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">Artikel 7.37 text text text</strong></p>

BeautifulSoup: finding nested tag

i am quite stuck with this:
<span>Alpha<span class="class_xyz">Beta</span></span>
I am trying to scrape only the first span text "Alpha" (excluding the second nested "Beta").
How would you do that?
I am trying to write a function to find all the Span tags without a class attribute, but something is not working...
Thanks.
One way to handle it:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
Output:
'Alpha'
Here is another way that get the text for every Span tag without a class attribute:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
Output:
['Alpha', 'Gamma', 'Epsilon']
Or if you want the whole span tag:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
Output:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

I am very new to BeauitfulSoup.
How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a <br/>, and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a <br/>)?
For example, for the following paragraph:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>
I would like it to be stored into the following array:
['Pancakes', 'A delicious type of food']
What I have tried is:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)
but this outputs an array with only one element:
['Pancakes A delicious type of food']
What is a way to code it so that I can get an array that contains the paragraph text split by any <br/> in the paragraph?
try this
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]
Update for deep recursive
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)
Update again for text split only by <br>
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)
I stumbled across this whilst having a similar issue. This was my solution...
A simple way is to replace the line
p[0] = p[0].getText()
with
p[0].getText('#').split('#')
Result is:
['Pancakes', ' A delicious type of food']
Obv choose a character/characters that won't appear in the text

Beautifulsoup remove pages numbers at bottom

I'm trying to remove the page numbers from this html. It seems to follow the pattern '\n','number','\n' if you look at the list texts. Would I be able to do it with BeautifulSoup? If not, how do I remove that pattern from the list?
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
url='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
texts = soup.findAll(text=True)
### could remove ['\n','number','\n']
visible_texts = filter(tag_visible, texts)
You can try to extract tags containing page numbers from soup before getting text.
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
This extracts tags with page numbers that are in style:
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">57</p>
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">58</p>
... etc.

Append markup string to a tag in BeautifulSoup

Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)?
For the sake of example, let's say I want to add 10 <a> elements to a <div>, but have them separated with a comma:
soup = BeautifulSoup(<<some document here>>)
a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings
div = soup.new_tag("div")
a_str = ",".join(a_tags)
Using div.append(a_str) escapes < and > into < and >, so I end up with
<div> <a1> 1 </a> ... </div>
BeautifulSoup(a_str) wraps this in <html>, and I see getting the tree out of it as an inelegant hack.
What to do?
You need to create a BeautifulSoup object out of your HTML string containing links:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
a_tags = ["<a>1</a>", "<a>2</a>", "<a>3</a>", "<a>4</a>", "<a>5</a>"]
a_str = ",".join(a_tags)
div.append(BeautifulSoup(a_str, 'html.parser'))
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Alternative solution:
For each link create a Tag and append it to div. Also, append a comma after each link except last:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
for x in xrange(1, 6):
link = soup.new_tag('a')
link.string = str(x)
div.append(link)
# do not append comma after the last element
if x != 6:
div.append(",")
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>

Categories

Resources