Extracting nested span / p / div structure using beautiful soup

Extracting nested span / p / div structure using beautiful soup - python

I am trying to extract this part from a page:
Using the inspect I see that:
Is the structure defined in the inspect view always follows what bs4 returns?
I am using:
import json
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.find_all('span',"c2")
But it returns:
[<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Gyventojai, kuriems yra daugiau nei 65 metai (1.14 prioritetas)</span>,
<span class="c2"></span>,
<span class="c2">——————————————————————</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Švietimo sistemos darbuotojai bei abiturientai (1.15 prioritetas)</span>,
<span class="c2">Diplomatai (1.16)</span>,
<span class="c2">Sergantieji lėtinėmis ligomis (1.17)</span>,
<span class="c2">Socialinių paslaugų teikėjai (1.18)</span>,
<span class="c2">1.20 prioritetas: gyvybiškai svarbias valstybės funkcijas atliekantys asmenys, kontaktuojantys su kitais asmenimis (pareigūnai, prekybos įmonių salės darbuotojai ir kt.), išskyrus bendrųjų funkcijų darbuotojus. Šiuo metu šio prioriteto sąrašai nuolat keliami.</span>,
<span class="c2">Gyventojų grupė 55-64 m.</span>,
<span class="c2"></span>,
<span class="c2">.</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>]
Which does not include <p class="c6"><span class="c2">ŠIUO METU - TIK SENJORAI:</span></p>
And I am unsure why because it clearly states class c2 in both inspect view and the data returned by bs4.
Should I always follow the nested structure with multiple find statements or what is the best practice to get the data I desire?

The thing is, the CSS class name changes every reload, so sometimes is c7, on reload is c1 and so on.
This example will search for CSS class name that contains "red" color (as your desired text is) and then uses this class name to find your text:
import re
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
html_doc = requests.get(url).text
# find CSS class name that is red:
class_name = re.search(r"\.(c\d+)\{color:#cc0000;", html_doc).group(1)
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find(class_=class_name).text)
Prints:
ŠIUO METU - TIK SENJORAI:

Related

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

I have a tag with the same tag and the same name(property).
Here is my code
first_movie.find('p',{'class' : 'sort-num_votes-visible'})
Here is my output
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
I'm reaching span tag this code;
first_movie.find('span', {'name':'nv',"data-value": True})
978272 --> output
But i want reach the other value with named nv ($858.37M).
My code is only getting this value (978,272) because tags names is equal each other (name = nv)

You're close.
Try using find_all and then grab the last element.
For example:
from bs4 import BeautifulSoup
html_sample = '''
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
'''
soup = (
BeautifulSoup(html_sample, "lxml")
.find_all("span", {'name':'nv',"data-value": True})
)
print(soup[-1].getText())
Output:
$858.37M

If you reach for all spans in p tag, you can work with them like with list and use index to reach for last div.
movies = soup.find('p',{'class' : 'sort-num_votes-visible'})
my_movie = movies.findAll('span')
my_span = my_movie[3].text

GitHub get commits number using python and beautiful soup

I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?

Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)

from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23

You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)

how to find child elements or tags within a tag in beautiful soup?

Hi so i have been web scraping a lot of different sites, but am stuck here . can any one please help.
This is how the html is structured with Huge spaces in the class name
I am using beautiful soup
<div class="nw-priceblock-container">
<del class="
nw-priceblock-amt
nw-priceblock-mrp
is-having-discount
">Rs. 699 </del>
<span class="
nw-priceblock-amt
nw-priceblock-sellingprice
is-having-discount
">Rs. 489 </span>
<span class="nw-priceblock-discount is-having-discount"> (30% Off)</span>
</div>
i want to get the Rs.489 text/value

Try this
from bs4 import BeautifulSoup
html ="""<div class="nw-priceblock-container">
<del class="
nw-priceblock-amt
nw-priceblock-mrp
is-having-discount
">Rs. 699 </del>
<span class="
nw-priceblock-amt
nw-priceblock-sellingprice
is-having-discount
">Rs. 489 </span>
<span class="nw-priceblock-discount is-having-discount"> (30% Off)</span>
</div>"""
soup = BeautifulSoup(html, 'html.parser')
# find parent class
# print(soup.find('div', {'class': 'nw-priceblock-container'}))
parent_clsss = soup.find('div', {'class': 'nw-priceblock-container'})
# print(parent_clsss.find('span', {'class':'nw-priceblock-discount is-having-discount'}).text)
print(parent_clsss.find('span', {'class':'nw-priceblock-discount is-having-discount'}).find_previous('span').text)

so i solved it by defining a function and doing the try and except block
def Price(p):
try:
return p.find("span", class_='nw-priceblock-sellingprice').text.replace("Rs.","").replace(",","")
except:
return 0

Exclude data from tag

I want to exclude a specific text inside an html span tag. In the given example below I just wanted to fetch all test2 text from span with class under a-list-item.
my code:
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
my code: tag = tag.find_all("span", {"class" : "a-list-item"})
How to get all test2 only. Thanks for your response

It looks like you are using Beautiful Soup. In Beautiful Soup 4.7+, this is easy to do just by using select instead of find_all. You can use :contains() wrapped in :not() to exclude spans that contain specific text.
from bs4 import BeautifulSoup
markup = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(markup)
print(soup.select("span.a-list-item:not(:contains(test1))"))
Output
[<span class="a-list-item">test2</span>, <span class="a-list-item">test2</span>]

You could go with applying an xpath to exclude containing test1
//span[#class='a-list-item' and not(contains(text(), 'test1'))]
E.g.
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//span[#class='a-list-item' and not(contains(text(), 'test1'))]")]
print(items)
Or test each css qualifying node (based on tag and class) text value
from bs4 import BeautifulSoup as bs
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
soup = bs(h, 'lxml')
items = [item.text for item in soup.select('span.a-list-item') if 'test1' not in item.text]
print(items)

Use regular expression re to find specific text.
from bs4 import BeautifulSoup
import re
html = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.find_all('span',text=re.compile("test2"))
for item in items:
print(item.text)
Output:
test2
test2

Get number from span which is inside span with beautifulsoup

So I have this piece from html
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
And I want to get that '5 300' out of it.
My code to get that:
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span'))
but it only prints out this:
<span></span>
I hope somebody can help
Edit: already tried to write .text to the end but it gives nothing ' '.

You almost got it, you just need to add .text to the last find function.
from bs4 import BeautifulSoup
html = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
item = BeautifulSoup(html, "lxml")
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span').text)
Outputs:
5 300

You can try this:
from bs4 import BeautifulSoup as soup
import re
s = """
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
"""
final_result = re.sub('^\s+|[a-zA-Z\s]+$', '', soup(s, 'lxml').find('span', {'class':'p'}).text)
Output:
u'5 300'

Here's one with select, which doesn't give you as many options but is quite readable
import bs4
s = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
soup = bs4.BeautifulSoup(s, 'xml')
soup.select_one("#_productX_label > span > span").text
Output: '5 300'
For your other issue of not being able to use the text property, perhaps the data is being filled out by a js function, or stored in an attribute?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting nested span / p / div structure using beautiful soup - python

Related

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

GitHub get commits number using python and beautiful soup

how to find child elements or tags within a tag in beautiful soup?

Exclude data from tag

Get number from span which is inside span with beautifulsoup

Categories

Resources