Strip out html elements using beautiful soup

Strip out html elements using beautiful soup - python

I have the following html:
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
I've just been shown that the way to get the text is
soup.select('.leftColumn div')[0].text.split()
This works but there is so much junk left over from the 2 divs that it is very difficult to pick out the text I need reliably. Is there a way to remove the 2 classes (static and summary ) which would make it much easier to process the remainder?

Here is an example based on your snippet:
from bs4 import BeautifulSoup
text = """
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
</div>
"""
soup = BeautifulSoup(text)
# Find divs with class "static" or "summary" and remove them using `extract`
div_nodes = soup.find_all('div', {'class': ['static', 'summary']})
[div.extract() for div in div_nodes]
print soup.text.split()
If you run the code, you will see that the static and summary divs are removed, and you get:
[u'text1', u'text2', u'(222)', u'123', u'-', u'4567']

Related

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593

I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]

To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?

Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation

How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Python v3 , Beautifoulsoup - multiple div tags with same name

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup(markup, "lxml")
items = soup.find_all("div","_3u1 _gli _uvb", recursive=True)
for item in items:
abouts = item.find_all("div", {"class":"_glo"}, recursive = True)[0].text
print (abouts)
HTML page:
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
Afternoon , i am trying to scrape a webpage using beautifullsoup, python. I need al the "text" strings in a separate variable. When i print abouts i get :"text text text" I want it to be seperated.
Kind regards

Try this:
items = soup.find_all('div', attrs={'class':'_ajw'})
dict = {}
for i in range(len(items)):
dict['text'+str(i+1)] = item[i].find('div', attrs={'class':'_52eh'}).text
print(dict)
This will give you something like this:
{'text1': text, 'text2': text, 'text3': text}

I'd use soup.select to apply a class selector to the html. It is a fast method to get a list of the appropriate elements by class
from bs4 import BeautifulSoup as bs
html = '''
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('._52eh')]
print(items)

Selecting nested element with beautiful soup

I have the following html:
<div class="leftColumn">
<div>
<div class="static">
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
How can I select just the text lines using beautiful soup.
I've tried a variety of things like:
soup.select('.leftColumn div').text
but so far no dice

Mauro's answer is probably more what you wanted, but this is another way to do it and how I thought about getting the inner div text:
from bs4 import BeautifulSoup
html = '''<div class="leftColumn">
<div>
<div class="static">
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
'''
bs = BeautifulSoup(html)
for div in bs.findAll('div', attrs={'class': 'leftColumn'}):
print div.findNext('div').findNext('div').text

BeautifouSoup select retrives a list. You must specify the index.
soup.select('.leftColumn div')[0].text.split()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip out html elements using beautiful soup - python

Related

Scrape values inside span class webpage with beautifulsoup python

Parsing HTML with BeatifulSoup class == AND title CONTAINS

BeautifulSoup: find all tags before stopping condition is met

Python v3 , Beautifoulsoup - multiple div tags with same name

Selecting nested element with beautiful soup

Categories

Resources