BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/> - python

I am very new to BeauitfulSoup.
How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a <br/>, and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a <br/>)?
For example, for the following paragraph:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>
I would like it to be stored into the following array:
['Pancakes', 'A delicious type of food']
What I have tried is:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)
but this outputs an array with only one element:
['Pancakes A delicious type of food']
What is a way to code it so that I can get an array that contains the paragraph text split by any <br/> in the paragraph?

try this
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]
Update for deep recursive
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)
Update again for text split only by <br>
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)

I stumbled across this whilst having a similar issue. This was my solution...
A simple way is to replace the line
p[0] = p[0].getText()
with
p[0].getText('#').split('#')
Result is:
['Pancakes', ' A delicious type of food']
Obv choose a character/characters that won't appear in the text

Related

How to get tag contents (including all text and elements)

I have a html snippet (no other parent elements):
html = '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
How do I extract all the tags and text (which may be variable) within the div, but not the div tag itself? I.e.L
target_str = '<p>Hello</p><p>Goodbye</p>[...]'
I have tried:
soup = BeautifulSoup(html , 'html.parser')
mydiv = soup.find(id='mydiv')
print(mydiv)
>>> '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
mydiv.unwrap()
print(mydiv)
>>> '<div id="mydiv"></div>'
How do I get just the contents of the tag?

Try:
from bs4 import BeautifulSoup
html = '<div id="mydiv"><p>Hello</p><p>Goodbye</p>[...]</div>'
soup = BeautifulSoup(html, "html.parser")
print("".join(map(str, soup.select_one("#mydiv").contents)))
Prints:
<p>Hello</p><p>Goodbye</p>[...]

Replacing a bs4 element with a string

So I have a HTML document, where I want to add HTML anchor link tags so that I can easily go to a certain part of a webpage.
The first step is to find all divs that need to replaced. Secondly, an anchor link tag needs to be added, based on the text that is within the div. My code looks as follows:
from bs4 import BeautifulSoup
path= "/text.html"
with open(path) as fp:
soup = BeautifulSoup(fp, 'html.parser')
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
string = div.getText().split()[1]
div_id = f"""<a id="{string}"></a>{div}"""
full =f"{div_id}{div}"
html_soup = BeautifulSoup(full, 'html.parser')
div = html_soup
A div looks as follows:
<p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 7.37 text text text</strong></p>
After adding the anchor tag it becomes:
<a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 10.6 Inwerkingtreding</strong></p><p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 7.37 text text text</strong></p>
But the problem is, div is not replaced by the new div. How should I correct this? Or is there another way to insert an anchor tag?

I'm not quite sure what your expected output to look like, but BeautifulSoup has methods to create new tags and attributes, and insert them into the soup object.
from bs4 import BeautifulSoup
fp = '<p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 7.37 text text text</strong>'
soup = BeautifulSoup(fp, 'html.parser')
print('soup before: ', soup)
mydivs = soup.find_all("p", {"class": "tussenkop"})
for div in mydivs:
if "Artikel" in div.getText():
a_string = div.getText().split()[1]
new_tag = soup.new_tag("a")
new_tag['id'] = f'{a_string}'
div.insert_before(new_tag)
print('soup after: ', soup)
Output:
soup before: <p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 7.37 text text text</strong></p>
soup after: <a id="7.37"></a><p class="tussenkop"><strong class="tussenkop_vet">ArtikelÂ 7.37 text text text</strong></p>

Remove Strong tag inside HTML file

I want to remove Strong tag from my html file in two cases:
1st case :
<strong><strong>text1</strong> Some text.</strong>
what i want to do is to remove the first strong tag so the output will be like that :
<strong>text1</strong> Some text.
In second case : if the text's length between strong tags is > 100 characters i want the tag to be deleted
Example :
<strong>Text that is over 100 characters </strong>
to be like this :
Text that is over 100 characters

Apparently BeautifulSoup supports the :has() CSS selector.
from bs4 import BeautifulSoup
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap
data = '''
<strong><strong>text1</strong> Some text.</strong>
<strong>Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters Text that is over 100 characters</strong>
'''
soup = BeautifulSoup(data, 'html.parser')
selector = 'strong:has(strong)' # works in bs4.__version__ == 4.9.3
for e in soup.select(selector):
e.unwrap()
for e in soup.select('strong'):
if e.text and len(e.text) > 100:
e.unwrap()
print(soup)

regex is not a good choice when it comes to HTML. But maybe this code snippet helps you (adapted from here)
from bs4 import BeautifulSoup
data = """
<html>
<body>
<strong>This is a test</strong>
<strong>This is a very very very very very very very very very very very very very very very very very very very very very very very very very very long test</strong>
</body>
</html>
"""
soup = BeautifulSoup(data, 'html.parser')
for strong in soup.findAll('strong'):
if len(strong.text) > 100:
# replace the span tag with it's contents
strong.unwrap()
print(soup)

Problem Scraping Element & Child Text with lxml & etree

I am trying to scrape lists from Wikipedia pages (like this one for example: https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt) in a particular format. I am encountering issues getting 'li' and 'a href' to match up.
For example, from the above page, the ninth bullet has text:
1238–1268: Sigvarður Þéttmarsson (Norweger)
with HTML:
<li>1238–1268: Sigvarður Þéttmarsson (Norweger)</li>
I want to pull it together as a dictionary:
'1238–1268: Sigvarður Þéttmarsson (Norweger)': '/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson'
[Entire text of both parts of 'li' and 'a' child]: [href of 'a' child]
I know I can use lxml/etree to do this, but I'm not entirely sure how. Some recombination of the below?
from lxml import etree
tree = etree.HTML(html)
bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]
links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]

Update: I have figured this out using BeautifulSoup as follows:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bishops_with_links = {}
bishops = soup.select('li')
for bishop in bishops:
if bishop.findChildren('a'):
bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
else:
bishops_with_links[bishop.text] = ''
return bishops_with_links

Append markup string to a tag in BeautifulSoup

Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)?
For the sake of example, let's say I want to add 10 <a> elements to a <div>, but have them separated with a comma:
soup = BeautifulSoup(<<some document here>>)
a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings
div = soup.new_tag("div")
a_str = ",".join(a_tags)
Using div.append(a_str) escapes < and > into < and >, so I end up with
<div> <a1> 1 </a> ... </div>
BeautifulSoup(a_str) wraps this in <html>, and I see getting the tree out of it as an inelegant hack.
What to do?

You need to create a BeautifulSoup object out of your HTML string containing links:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
a_tags = ["<a>1</a>", "<a>2</a>", "<a>3</a>", "<a>4</a>", "<a>5</a>"]
a_str = ",".join(a_tags)
div.append(BeautifulSoup(a_str, 'html.parser'))
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Alternative solution:
For each link create a Tag and append it to div. Also, append a comma after each link except last:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
for x in xrange(1, 6):
link = soup.new_tag('a')
link.string = str(x)
div.append(link)
# do not append comma after the last element
if x != 6:
div.append(",")
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/> - python

I stumbled across this whilst having a similar issue. This was my solution... A simple way is to replace the line p[0] = p[0].getText() with p[0].getText('#').split('#') Result is: ['Pancakes', ' A delicious type of food'] Obv choose a character/characters that won't appear in the text

Related

How to get tag contents (including all text and elements)

Replacing a bs4 element with a string

Remove Strong tag inside HTML file

Problem Scraping Element & Child Text with lxml & etree

Append markup string to a tag in BeautifulSoup

Categories

Resources