Exclude data from tag - python

I want to exclude a specific text inside an html span tag. In the given example below I just wanted to fetch all test2 text from span with class under a-list-item.
my code:
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
my code: tag = tag.find_all("span", {"class" : "a-list-item"})
How to get all test2 only. Thanks for your response

It looks like you are using Beautiful Soup. In Beautiful Soup 4.7+, this is easy to do just by using select instead of find_all. You can use :contains() wrapped in :not() to exclude spans that contain specific text.
from bs4 import BeautifulSoup
markup = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(markup)
print(soup.select("span.a-list-item:not(:contains(test1))"))
Output
[<span class="a-list-item">test2</span>, <span class="a-list-item">test2</span>]

You could go with applying an xpath to exclude containing test1
//span[#class='a-list-item' and not(contains(text(), 'test1'))]
E.g.
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//span[#class='a-list-item' and not(contains(text(), 'test1'))]")]
print(items)
Or test each css qualifying node (based on tag and class) text value
from bs4 import BeautifulSoup as bs
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
soup = bs(h, 'lxml')
items = [item.text for item in soup.select('span.a-list-item') if 'test1' not in item.text]
print(items)

Use regular expression re to find specific text.
from bs4 import BeautifulSoup
import re
html = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.find_all('span',text=re.compile("test2"))
for item in items:
print(item.text)
Output:
test2
test2

Related

Extracting nested span / p / div structure using beautiful soup

I am trying to extract this part from a page:
Using the inspect I see that:
Is the structure defined in the inspect view always follows what bs4 returns?
I am using:
import json
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.find_all('span',"c2")
But it returns:
[<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Gyventojai, kuriems yra daugiau nei 65 metai (1.14 prioritetas)</span>,
<span class="c2"></span>,
<span class="c2">——————————————————————</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Švietimo sistemos darbuotojai bei abiturientai (1.15 prioritetas)</span>,
<span class="c2">Diplomatai (1.16)</span>,
<span class="c2">Sergantieji lėtinėmis ligomis (1.17)</span>,
<span class="c2">Socialinių paslaugų teikėjai (1.18)</span>,
<span class="c2">1.20 prioritetas: gyvybiškai svarbias valstybės funkcijas atliekantys asmenys, kontaktuojantys su kitais asmenimis (pareigūnai, prekybos įmonių salės darbuotojai ir kt.), išskyrus bendrųjų funkcijų darbuotojus. Šiuo metu šio prioriteto sąrašai nuolat keliami.</span>,
<span class="c2">Gyventojų grupė 55-64 m.</span>,
<span class="c2"></span>,
<span class="c2">.</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>]
Which does not include <p class="c6"><span class="c2">ŠIUO METU - TIK SENJORAI:</span></p>
And I am unsure why because it clearly states class c2 in both inspect view and the data returned by bs4.
Should I always follow the nested structure with multiple find statements or what is the best practice to get the data I desire?
The thing is, the CSS class name changes every reload, so sometimes is c7, on reload is c1 and so on.
This example will search for CSS class name that contains "red" color (as your desired text is) and then uses this class name to find your text:
import re
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
html_doc = requests.get(url).text
# find CSS class name that is red:
class_name = re.search(r"\.(c\d+)\{color:#cc0000;", html_doc).group(1)
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find(class_=class_name).text)
Prints:
ŠIUO METU - TIK SENJORAI:

Python: Deleting all divs without class

I want to delete all divs without classes (but not the content that is in the div).
My input
<h1>Test</h1>
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
The output I want
<h1>Test</h1>
<div class="test">
<p>abc</p>
</div>
My try 1
Based on "Deleting a div with a particular class":
from bs4 import BeautifulSoup
soup = BeautifulSoup('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>', 'html.parser')
for div in soup.find_all("div", {'class':''}):
div.decompose()
print(soup)
# <h1>Test</h1>
My try 2
from htmllaundry import sanitize
myinput = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
myoutput = sanitize(myinput)
print myoutput
# <p>Test</p><p>abc</p> instead of <h1>Test</h1><div class="test"><p>abc</p></div>
My try 3
Based on "Clean up HTML in python"
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(remove_tags=('font', 'div'))
return cleaner.clean_html(dirty_html)
myhtml = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
print(sanitize(myhtml))
# <div><h1>Test</h1><p>abc</p></div>
My try 4
from html_sanitizer import Sanitizer
sanitizer = Sanitizer() # default configuration
output = sanitizer.sanitize('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>')
print(output)
# <h1>Test</h1><p>abc</p>
Problem: A div element is used to wrap the HTML fragment for the parser, therefore div tags are not allowed. (Source: Manual)
If you want to exclude div without class, preserving its content:
from bs4 import BeautifulSoup
markup = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
soup = BeautifulSoup(markup,"html.parser")
for tag in soup.find_all():
empty = tag.name == 'div' and not(tag.has_attr('class'))
if not(empty):
print(tag)
Output:
<h1>Test</h1>
<div class="test"><p>abc</p></div>
<p>abc</p>
Please checkout this.
from bs4 import BeautifulSoup
data="""
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(data, features="html5lib")
for div in soup.find_all("div", class_=True):
print(div)

I want to find a <span> tag that resides inside a <h1> tag that contains multiple <span> tags and get the text inside it

What I want to do is select the second span and grab its text to print it.
Below is the HTML code and BeautifulSoup code
#HTML code
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
productTitle = h1.find('span').text
print(productTitle)
Hopefully, not always, id should be unique meaning find_all is likely not required.
With bs4 4.7.1+ you can use :not to exclude the child span with an id
from bs4 import BeautifulSoup as bs
html = '''<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = bs(html, 'lxml')
print(soup.select_one('#productTitle span:not([id])').text)
You could also nth-child
print(soup.select_one('#productTitle span:nth-child(2)').text)
or
print(soup.select_one('#productTitle span:nth-child(even)').text)
or even an immediate sibling combinator to get span after child a
print(soup.select_one('#productTitle a + span').text)
or chained next_sibling
print(soup.select_one('#productTitle a').next_sibling.next_sibling.text)
This gets all the fields you need within an h1 tag :
Python Code :
from bs4 import BeautifulSoup
text = '''
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = BeautifulSoup(text,features='html.parser')
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
spans = h1.find_all('span')
print('productBrand == > {}'.format(spans[0].text))
print('productTitle == > {}'.format(spans[1].text))
Get All Spans withing the h1 :
for h1 in soup.find_all('h1', id="productTitle"):
for i,span in enumerate(h1.find_all('span')):
print('span {} == > {}'.format(i,span.text))
Demo :
Here

Python: extract all the information(src, href, title) inside the class

I found that I can extract all the information I want from this HTML. I need to extract title, href abd src from this.
HTML:
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3090" class="main">
<img src="/FileUploads/Post/3090.jpg?w=70&h=70&mode=crop" alt="apple" title="apple" />
</a>
</div>
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3091" class="main">
<img src="/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana" />
</a>
</div>
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
for b in a.select(".main"):
print ("http://www.cad.com"+b.get('href'))
print(b.get('title'))
I can successfully get href from this, but since title and src are in another line, I don't know how to extract them. After this, I want to save them in excel, so maybe I need to finish one first then do the second one.
Expected output:
/slim?p=3090
apple
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
/slim?p=3091
banana
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
My own solution:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
div = a.findAll('div', {"class": "home-hot-thumb"})
for div in div:
title=(div.img.get('title'))
print(title)
href=('http://www.cad.com/'+div.a.get('href'))
print(href)
src=('http://www.cad.com/'+div.img.get('src'))
print(src.replace('?w=70&h=70&mode=crop', ''))

Using page text to select `html` element using`Beautiful Soup`

I have a page which contains several repetitions of: <div...><h4>...<p>... For example:
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I write print soup.select('div[class^="proletariat"] > h4 ~ p'), I get:
[<p>Ignore this text</p>, <p>This is the text we want</p>]
How do I specify that I only want the text of p when it is preceded by <h4>hammer</h4>?
Thanks
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want
:contains() could help here, but it is not supported.
Taking this into account, you can use select() in conjunction with the find_next_sibling():
print next(h4.find_next_sibling('p').text
for h4 in soup.select('div[class^="proletariat"] > h4')
if h4.text == "hammer")

Categories

Resources