Get number from span which is inside span with beautifulsoup - python

So I have this piece from html
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
And I want to get that '5 300' out of it.
My code to get that:
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span'))
but it only prints out this:
<span></span>
I hope somebody can help
Edit: already tried to write .text to the end but it gives nothing ' '.

You almost got it, you just need to add .text to the last find function.
from bs4 import BeautifulSoup
html = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
item = BeautifulSoup(html, "lxml")
print(item.find('label',{'for':'productX'}).find('span', attrs={'class': 'p'}).find('span').text)
Outputs:
5 300

You can try this:
from bs4 import BeautifulSoup as soup
import re
s = """
<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>
"""
final_result = re.sub('^\s+|[a-zA-Z\s]+$', '', soup(s, 'lxml').find('span', {'class':'p'}).text)
Output:
u'5 300'

Here's one with select, which doesn't give you as many options but is quite readable
import bs4
s = """<label for="productX" id="_productX_label">
<span class="t">XS</span>
<span class="s">10 x 10 cm</span>
<span class="p"> <span>5 300</span> Ft </span>
</label>"""
soup = bs4.BeautifulSoup(s, 'xml')
soup.select_one("#_productX_label > span > span").text
Output: '5 300'
For your other issue of not being able to use the text property, perhaps the data is being filled out by a js function, or stored in an attribute?

Related

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

I have a tag with the same tag and the same name(property).
Here is my code
first_movie.find('p',{'class' : 'sort-num_votes-visible'})
Here is my output
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
I'm reaching span tag this code;
first_movie.find('span', {'name':'nv',"data-value": True})
978272 --> output
But i want reach the other value with named nv ($858.37M).
My code is only getting this value (978,272) because tags names is equal each other (name = nv)
You're close.
Try using find_all and then grab the last element.
For example:
from bs4 import BeautifulSoup
html_sample = '''
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
'''
soup = (
BeautifulSoup(html_sample, "lxml")
.find_all("span", {'name':'nv',"data-value": True})
)
print(soup[-1].getText())
Output:
$858.37M
If you reach for all spans in p tag, you can work with them like with list and use index to reach for last div.
movies = soup.find('p',{'class' : 'sort-num_votes-visible'})
my_movie = movies.findAll('span')
my_span = my_movie[3].text

Extracting nested span / p / div structure using beautiful soup

I am trying to extract this part from a page:
Using the inspect I see that:
Is the structure defined in the inspect view always follows what bs4 returns?
I am using:
import json
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.find_all('span',"c2")
But it returns:
[<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Gyventojai, kuriems yra daugiau nei 65 metai (1.14 prioritetas)</span>,
<span class="c2"></span>,
<span class="c2">——————————————————————</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2">Švietimo sistemos darbuotojai bei abiturientai (1.15 prioritetas)</span>,
<span class="c2">Diplomatai (1.16)</span>,
<span class="c2">Sergantieji lėtinėmis ligomis (1.17)</span>,
<span class="c2">Socialinių paslaugų teikėjai (1.18)</span>,
<span class="c2">1.20 prioritetas: gyvybiškai svarbias valstybės funkcijas atliekantys asmenys, kontaktuojantys su kitais asmenimis (pareigūnai, prekybos įmonių salės darbuotojai ir kt.), išskyrus bendrųjų funkcijų darbuotojus. Šiuo metu šio prioriteto sąrašai nuolat keliami.</span>,
<span class="c2">Gyventojų grupė 55-64 m.</span>,
<span class="c2"></span>,
<span class="c2">.</span>,
<span class="c2"></span>,
<span class="c2"></span>,
<span class="c2"></span>]
Which does not include <p class="c6"><span class="c2">ŠIUO METU - TIK SENJORAI:</span></p>
And I am unsure why because it clearly states class c2 in both inspect view and the data returned by bs4.
Should I always follow the nested structure with multiple find statements or what is the best practice to get the data I desire?
The thing is, the CSS class name changes every reload, so sometimes is c7, on reload is c1 and so on.
This example will search for CSS class name that contains "red" color (as your desired text is) and then uses this class name to find your text:
import re
import requests
from bs4 import BeautifulSoup
url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"
html_doc = requests.get(url).text
# find CSS class name that is red:
class_name = re.search(r"\.(c\d+)\{color:#cc0000;", html_doc).group(1)
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find(class_=class_name).text)
Prints:
ŠIUO METU - TIK SENJORAI:

python loop for web-scraping

I'am trying to create a loop to show all values within the li tags to create a DataFrame. Moreover, I can only isolate the code using: new = soup.find("div", class_="PlayerList"). If I use a standard for loop it only shows one value not all values.
The output I would like to show is:
Messi,
Shooting 9,
Passing 9,
Tackle 4,
<pre>
import requests
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
main_url = 'https://examplelistpython.000webhostapp.com/messi.html'
result = requests.get(main_url)
result.text
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
new = soup.find("div", class_="PlayerList")
new
</pre>
<ul class="List">
<li>
<div class="PlayerList">
<div class="HeaderList">
<span class="player">Messi</span>
</div>
<div class="PlayerStat">
<span class="stat">Shooting <span class="allStatContainer statShooting" data-stat="Shooting">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Passing <span class="allStatContainer statPassing" data-stat="Passing">
9
</span>
</span>
</div>
<div class="PlayerStat">
<span class="stat">Tackle <span class="allStatContainer statTackle" data-stat="Tackle">
4
</span>
</span>
</div>
</li>
</ul>
player = [i.text.strip() for i in soup.find_all("span", class_="player")]
shooting = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statShooting")]
passing = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statPassing")]
tackle = [i.text.strip() for i in soup.find_all("span", class_="allStatContainer statTackle")]
df = pd.DataFrame({'Player': player, 'Shooting': shooting, 'Passing': passing, 'Tackle': tackle})
Result:
Player
Shooting
Passing
Tackle
0
Messi
9
9
4

How to scrape the text next to end tag in BeautifulSoup4?

How can I scrape this html ?
<h3>
<span class="method">GET </span>
[/r/
<em class="placeholder">subreddit</em>
]/api/user_flair
<span class="oauth-scope-list"><span class="api-badge oauth-scope">flair</span>
</span>
</h3>
Is there any method to get text below span tag. I know that using next or next_sibling I can get next text. But is there any other work around for this something like h3.span
This way you can catch your text
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<h3>
<span class="method">GET </span>
[/r/
<em class="placeholder">subreddit</em>
]/api/user_flair
<span class="oauth-scope-list"><span class="api-badge oauth-scope">flair</span>
</span>
</h3>""")
api_badges = soup.find_all('span', {'class': 'api-badge oauth-scope'})
api_badges_txt = [api_badge.text for api_badge in api_badges]
The output is
['flair']
If you use
add_space = soup.find('em').next_sibling.replace('\n', '').strip()
soup.find('h3').get_text(strip=True).replace(add_space, add_space + ' ')
you get 'GET[/r/subreddit]/api/user_flair flair'

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Categories

Resources