extracting text from id in beautifulsoup - python

have html code like this. using BeautifulSoup, i want to extract the text that is 2,441
have a span element and a id which is equals to lastPrice.
<span id="lastPrice">2,441.00</span>
I have tried to look up on the net and solve, but i am still unable to do it. I am a beginner.
i have tried this:
tag = soup.span
price = soup.find(id="lastPrice")
print(price.text)

The data you see on the page is rendered via JavaScript, so BeautifulSoup doesn't see the price. The price is embedded within the page in JavaScript form. You can extract it for example:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HDFC"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.find(id="responseDiv").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data["data"][0]["lastPrice"])
Prints:
2,407.90

Try this:
price = soup.select("#lastPrice")[0]
print(price.text)

Not with bs4 but with regex is as follows:
import re
line = '<span id="lastPrice">2,441.00</span>'
print(p.sub("", data))

Related

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)
After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

Trying to access hidden <div> tags when web scraping in python

so I'm trying to extract some data from a website by webscraping using python but some of the div tags are not expanding to show the data that I want.
This is my code.
import requests
from bs4 import BeautifulSoup as soup
uq_url = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
web_r = requests.get(uq_url, headers=headers)
web_soup = soup(web_r.text, 'html.parser')
print(web_soup.prettify())
This is what the code will scrape but it won't extract any of the data in the div with id="app". It's supposed to have a lot of data in there like the second picture. Any help would be appreciated.
All the that content is present within a script tag, as shown in your image. You can regex out the appropriate javascript object then handle the unquoted keys with json, in order to convert to hjson. Then extract whatever you want:
import requests, re, hjson
from bs4 import BeautifulSoup as bs #there is some data as embedded html you may wish to parse later from json
r = requests.get('https://my.uq.edu.au/programs-courses/requirements/program/2451/2021', headers = {'User-Agent':'Mozilla/5.0'})
data = hjson.loads(re.search(r'window\.AppData = ([\s\S]+?);\n' , r.text).group(1))
# hjson.dumpsJSON(data['programRequirements'])
core_courses = data['programRequirements']['payload']['components'][1]['payload']['body'][0]['body']
for course in core_courses:
if 'curriculumReference' in course:
print(course['curriculumReference'])

Extracting specific part of html

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<div class="plp-listing-load-status c-list-header__counter initialized" data-page-number="1" data-total-pages-count="57" data-products-count="60" data-total-products-count="3361" data-status-format="{available}/{total} results">60/3361 results</div>
I want the scrape the "57" from the data-total-pages-count="57". I have tried using:
soup = BeautifulSoup(page.content, "html.parser")
nopagesstr = soup.find(class_="plp-listing-load-status c-list-header__counter initialized").get('data-total-pages-count')
and
nopagesstr = r.html.find('[data-total-pages-count]',first=True)
But both return None. I am not sure how to select the 57 specifically. Any help would be appreicated
To get total pages count, you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
print(soup.select_one("[data-total-pages-count]")["data-total-pages-count"])
Prints:
56

Python parse string in span class

I've already tried the other solutions but did not work.
Here is my span tag:
<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
and here is my full code
import requests
from bs4 import BeautifulSoup
#<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
url = "https://www.google.com/search?q={}+kaç+tl".format(input())
r = requests.get(url)
source = BeautifulSoup(r.content,"html")
print(source.find_all("span",string="DFlfde SwHCTb"))
It returns a empty list, i need the value "7.05", how can i reach it? Thanks
There are 2 things to do to get your data:
specify User-Agent header (google needs this to return correct data)
in .find_all specify class_= parameter, not string
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=100+kaç+tl"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
r = requests.get(url, headers=headers)
source = BeautifulSoup(r.content, "html.parser")
print(source.find_all("span", class_="DFlfde SwHCTb"))
# or
print(source.select_one('span.DFlfde.SwHCTb').text)
Prints:
[<span class="DFlfde SwHCTb" data-precision="2" data-value="13.0095">13,01</span>]
13,01

BeautifulSoup returns None

i am trying to get the title of the listing in this URL, but this code returns None.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0)
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div
class="ListingCell-KeyInfo-title" ..>
listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)
Any idea why is that?
Thanks
The url you request treat you as a bot.
Request response:
h1>Pardon Our Interruption...</h1>
<p>
As you were browsing <strong>www.lamudi.com.ph</strong> something about your
browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
Before you parse anything from the response.
Print the content first to make sure you have access the url in right way.
You have to add User-Agent or somthing else to make you like a real user
Try add this into your request headers :
USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'
I tried with selenium and with specific wait but do not works.
If you print the soup you can get the error. In fact the page return this: "As you were browsing www.lamudi.com.ph something about your browser made us think you were a bot. There are a few reasons this might happen:... "
The website recognise that you are not human.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup) #--> this print get the error
listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)

Categories

Resources