Using page text to select `html` element using`Beautiful Soup`

Using page text to select `html` element using`Beautiful Soup` - python

I have a page which contains several repetitions of: <div...><h4>...<p>... For example:
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I write print soup.select('div[class^="proletariat"] > h4 ~ p'), I get:
[<p>Ignore this text</p>, <p>This is the text we want</p>]
How do I specify that I only want the text of p when it is preceded by <h4>hammer</h4>?
Thanks

html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want

:contains() could help here, but it is not supported.
Taking this into account, you can use select() in conjunction with the find_next_sibling():
print next(h4.find_next_sibling('p').text
for h4 in soup.select('div[class^="proletariat"] > h4')
if h4.text == "hammer")

Related

Python: Deleting all divs without class

I want to delete all divs without classes (but not the content that is in the div).
My input
<h1>Test</h1>
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
The output I want
<h1>Test</h1>
<div class="test">
<p>abc</p>
</div>
My try 1
Based on "Deleting a div with a particular class":
from bs4 import BeautifulSoup
soup = BeautifulSoup('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>', 'html.parser')
for div in soup.find_all("div", {'class':''}):
div.decompose()
print(soup)
# <h1>Test</h1>
My try 2
from htmllaundry import sanitize
myinput = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
myoutput = sanitize(myinput)
print myoutput
# <p>Test</p><p>abc</p> instead of <h1>Test</h1><div class="test"><p>abc</p></div>
My try 3
Based on "Clean up HTML in python"
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(remove_tags=('font', 'div'))
return cleaner.clean_html(dirty_html)
myhtml = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
print(sanitize(myhtml))
# <div><h1>Test</h1><p>abc</p></div>
My try 4
from html_sanitizer import Sanitizer
sanitizer = Sanitizer() # default configuration
output = sanitizer.sanitize('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>')
print(output)
# <h1>Test</h1><p>abc</p>
Problem: A div element is used to wrap the HTML fragment for the parser, therefore div tags are not allowed. (Source: Manual)

If you want to exclude div without class, preserving its content:
from bs4 import BeautifulSoup
markup = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
soup = BeautifulSoup(markup,"html.parser")
for tag in soup.find_all():
empty = tag.name == 'div' and not(tag.has_attr('class'))
if not(empty):
print(tag)
Output:
<h1>Test</h1>
<div class="test"><p>abc</p></div>
<p>abc</p>

Please checkout this.
from bs4 import BeautifulSoup
data="""
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(data, features="html5lib")
for div in soup.find_all("div", class_=True):
print(div)

I want to get the value of multiple ids inside an a tag that resides in a div

Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".

I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')

Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size

I want to find a <span> tag that resides inside a <h1> tag that contains multiple <span> tags and get the text inside it

What I want to do is select the second span and grab its text to print it.
Below is the HTML code and BeautifulSoup code
#HTML code
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
productTitle = h1.find('span').text
print(productTitle)

Hopefully, not always, id should be unique meaning find_all is likely not required.
With bs4 4.7.1+ you can use :not to exclude the child span with an id
from bs4 import BeautifulSoup as bs
html = '''<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = bs(html, 'lxml')
print(soup.select_one('#productTitle span:not([id])').text)
You could also nth-child
print(soup.select_one('#productTitle span:nth-child(2)').text)
or
print(soup.select_one('#productTitle span:nth-child(even)').text)
or even an immediate sibling combinator to get span after child a
print(soup.select_one('#productTitle a + span').text)
or chained next_sibling
print(soup.select_one('#productTitle a').next_sibling.next_sibling.text)

This gets all the fields you need within an h1 tag :
Python Code :
from bs4 import BeautifulSoup
text = '''
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = BeautifulSoup(text,features='html.parser')
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
spans = h1.find_all('span')
print('productBrand == > {}'.format(spans[0].text))
print('productTitle == > {}'.format(spans[1].text))
Get All Spans withing the h1 :
for h1 in soup.find_all('h1', id="productTitle"):
for i,span in enumerate(h1.find_all('span')):
print('span {} == > {}'.format(i,span.text))
Demo :
Here

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?

Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation

How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Selecting second child using BeautifulSoup

Let's say I have the following HTML:
<div>
<p>this is some text</p>
<p>...and this is some other text</p>
</div>
How can I retrieve the text from the second paragraph using BeautifulSoup?

You can use a CSS selector to do this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""<div>
.... <p>this is some text</p>
.... <p>...and this is some other text</p>
.... </div>""", "html.parser")
>>> soup.select('div > p')[1].get_text(strip=True)
'...and this is some other text'

You can use nth-of-type:
h = """<div>
<p>this is some text</p>
<p>...and this is some other text</p>
</div>"""
soup = BeautifulSoup(h)
print(soup.select_one("div p:nth-of-type(2)").text)

secondp = [div.find('p') for div in soup.find('div')]
In : secondp[1].text
Out : Your text
Or you can use the findChildren directly -
div_ = soup.find('div').findChildren()
for i, child in enumerate(div_):
if i == 1:
print child.text

You could solve this with gazpacho:
from gazpacho import Soup
html = """\
<div>
<p>this is some text</p>
<p>...and this is some other text</p>
</div>
"""
soup = Soup(html)
soup.find('p')[1].text
Which would output:
'...and this is some other text'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using page text to select `html` element using`Beautiful Soup` - python

:contains() could help here, but it is not supported. Taking this into account, you can use select() in conjunction with the find_next_sibling(): print next(h4.find_next_sibling('p').text for h4 in soup.select('div[class^="proletariat"] > h4') if h4.text == "hammer")

Related

Python: Deleting all divs without class

I want to get the value of multiple ids inside an a tag that resides in a div

I want to find a <span> tag that resides inside a <h1> tag that contains multiple <span> tags and get the text inside it

BeautifulSoup: find all tags before stopping condition is met

Selecting second child using BeautifulSoup

Categories

Resources