Remove lines getting empty after BeautifulSoup decompose - python

I am trying to strip certain HTML tags and their content from a file with BeautifulSoup. How can I remove lines that get empty after applying decompose()? In this example, I want the line between a and 3 to be gone, as this is where the <span>...</span> block was, but not the line in the end.
from bs4 import BeautifulSoup
Rmd_data = 'a\n<span class="answer">\n2\n</span>\n3\n'
print(Rmd_data)
#OUTPUT
# a
# <span class="answer">
# 2
# </span>
# 3
#
# END OUTPUT
soup = BeautifulSoup(Rmd_data, "html.parser")
answers = soup.find_all("span", "answer")
for a in answers:
a.decompose()
Rmd_data = str(soup)
print(Rmd_data)
# OUTPUT
# a
#
# 3
#
# END OUTPUT

I'm surprised that BeatifulSoup does not offer a prettify() option. Instead of manipulating the html manually you could re-parse your html:
str(BeautifulSoup(str(soup), 'html.parser'))
As always, enjoy.

For removing empty lines most easy will be via re
import re
re.sub(r'[\n\s]+', r'\n', text, re.MULTLINE)

Related

I want to get p tag's text and other tag's text inside the p tag in order by bs4

I selected a content from soup, there are three types.
<p>first</p>
<p><span>first</span></p>
<p>first<span>second</span>third</p>
And I want to get text from the last in order like "first", "second", "third".
First, I use '.text' and the last one returned "firstsecondthird". But I want to get text one by one.
Is there a way?
I edited question so you can get more detail information.
contents_list = soup.select('blabla')
# contents_list =
# ['<p>first</p>',
# '<p><span>first</span></p>',
# '<p>first<span>second</span>third</p>']
for content in contents_list:
print(content.text)
# I want to get
# first
# first
# first, second, third
To get the tags separated with a space, you can use the get_text() method with adding a space as the separator argument. .get_text(separator=" ").
from bs4 import BeautifulSoup
html = """
<p>first</p>
<p><span>first</span></p>
<p>first<span>second</span>third</p>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("p"):
print(tag.get_text(separator=" "))
Output:
first
first
first second third

How can I split the text elements out from this HTML String? Python

Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.
You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)
Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord
from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord

Extracting Embedded <span> in Python using BeautifulSoup

I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!
You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]

Removing new line '\n' from the output of python BeautifulSoup

I am using python Beautiful soup to get the contents of:
<div class="path">
abc
def
ghi
</div>
My code is as follows:
html_doc="""<div class="path">
abc
def
ghi
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi as a single string?
Also I want to know about the output so obtained.
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)
If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)
Unless I'm missing something, just combine strip and list comprehension.
Code:
from bs4 import BeautifulSoup as bsoup
ofile = open("test.html", "r")
soup = bsoup(ofile)
res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res
Result:
abc,def,ghi
[Finished in 0.2s]

Remove <br> tags from a parsed Beautiful Soup list?

I'm currently getting into a for loop with all the rows I want:
page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
At this point, I have my information, but the
<br />
tags are ruining my output.
What's the cleanest way to remove these?
for e in soup.findAll('br'):
e.extract()
If you want to translate the <br />'s to newlines, do something like this:
def text_with_newlines(elem):
text = ''
for e in elem.recursiveChildGenerator():
if isinstance(e, basestring):
text += e.strip()
elif e.name == 'br':
text += '\n'
return text
replace tags at the start with a space
Beautiful soup also accepts the .read() on the urlopen object so this should work - - -
page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....
the re.sub replaces the br tag with a whitespace
Maybe some_string.replace('<br />','\n') to replace the breaks with newlines.
>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data
You might want to check out html5lib and lxml, which are both pretty great at parsing html. lxml is really fast and html5lib is designed to be extremely robust.

Categories

Resources