How do I extract the code in-between using BeautifulSoup? - python

I would like to extract the text 'THIS IS THE TEXT I WANT TO EXTRACT' from the snippet below. Does anyone have any suggestions? Thanks!
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>

from bs4 import BeautifulSoup
html = """<span class="cw-type__h2 Ingredients-title">Ingredients</span><p>THIS IS THE TEXT I WANT TO EXTRACT</p>"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.text)

Assuming there is likely more html, I would use the class of the preceeding span with adjacent sibling combinator and p type selector to target the appropriate p tag
from bs4 import BeautifulSoup as bs
html = '''
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
'''
soup = bs(html, 'lxml')
print(soup.select_one('.Ingredients-title + p').text.strip())

Related

Extract text from class 'bs4.element.Tag' beautifulsoup

I have the following text in a class 'bs4.element.Tag' object:
<span id="my_rate">264.46013</span>
How do I strip the value of 264.46013 and get rid of the junk before and after the value?
I have seen this and this but I am unable to use the text.split() methods etc.
Cheers
I'm not sure I follow, however, if you are using BeautifulSoup:
from bs4 import BeautifulSoup as bs
html = '<span id="my_rate">264.46013</span>'
soup = bs(html, 'html.parser')
value = soup.select_one('span[id="my_rate"]').get_text()
print(value)
Result:
264.46013

Get value from first span tag in beautifulsoup

I'd like to know how to get $39,465,077,974.88 from this beautifulsoup code
<td><span>$39,465,077,974.88</span><div><span class="sc-15yy2pl-0 kAXKAX" style="font-size:12px;font-weight:600"><span class="icon-Caret-up"></span>4.59<!-- -->%</span></div></td>
I'm new to web scrapers 😁, hopefully you guys have a clear explanation.
try using a css selector,
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("td > span").text)
$39,465,077,974.88
You can get the value like this. Since the value you need is inside a <span> that is inside a <td>
First select the <td> tag using find()
td = soup.find('td')
Next select the <span> tag present inside td.
sp = td.find('span')
Print the text of sp
print(sp.text.strip())
Here is the complete code
from bs4 import BeautifulSoup
s = """<td><span>$39,465,077,974.88</span><div><span class="sc-15yy2pl-0 kAXKAX" style="font-size:12px;font-weight:600"><span class="icon-Caret-up"></span>4.59<!-- -->%</span></div></td>"""
soup = BeautifulSoup(s, 'lxml')
td = soup.find('td')
sp = td.find('span')
print(sp.text.strip())
$39,465,077,974.88

How to extract or Scrape data from HTML page but from the element itself

Currently i use lxml to parse the html document to get the data from the HTML elements
but there is a new challenge, there is one data stored as ratings inside HTML elements
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
Its easy to extract text between tags but within tags no ideas.
What do you suggest ?
Challenge i want to extract "3"
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
Gabriel.
If I understand your question and comments correctly, the following should extract all the rating in that page:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[#class]]/#data-rating')
For example:
targets[0]
output
3
Try below script:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])

getting content inside of all <span> using findall, only get the content that not has \n

I'm trying to extract the content that is inside the span tag under the structure:
<span style="font-weight:bold">xxx</span>
I get a big html code from a web service and from there I extract the span tags with this structure.
the problem is that if the content of some span has a \n it does not extract it.
for example:
print(re.findall(pattern, '<span style="font-weight:bold">AAA\n</span><span style="font-weight:bold">ooo</span>'))
>>[ooo]
#output desired should be [AAA,ooo]
how can I fix this so that the content of the span is extracted if it has or does not have \n?
Use BeautifulSoup to handle element in html
from bs4 import BeautifulSoup
h = """<span style="font-weight:bold">xxx</span>"""
soup = BeautifulSoup(h)
spans = soup.find_all("span")
for span in spans:
print(span.text)
OUTPUT
u'xxx'

Python BeautifulSoup - Text Between Div

I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())

Categories

Resources