extract text from h1 and id with python beautiful soup

extract text from h1 and id with python beautiful soup - python

I'm trying to extract the text from HTML id="itemSummaryPrice" but I couldn't figure it out.
html = """
<div id="itemSummaryContainer" class="content">
<div id="itemSummaryMainWrapper">
<div id="itemSummaryImage">
<img src="https://img.rl.insider.gg/itemPics/large/endo.fgreen.891c.jpg" alt="Forest Green Endo">
</div>
<h2 id="itemSummaryTitle">Item Report</h2>
<h2 id="itemSummaryDivider"> | </h2>
<h2 id="itemSummaryDate">Friday, January 15, 2021, 8:38 AM EST</h2>
<div id="itemSummaryBlankSpace"></div>
<h1 id="itemSummaryName">
<span id="itemNameSpan" style="color: rgb(88, 181, 73);"><span>Forest Green</span> <span>Endo</span></span>
</h1>
**<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>**
</div>
</div>
"""
my code:
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item)
returns with:
<h1 id="itemSummaryPrice"></h1>
What I'm trying to extract:
<h1 id="itemSummaryPrice">200 - 300</h1>
OR
<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>
OR
200 - 300

Because I can't place comments yet an answer then. Shouldn't you call .text behind the price_check_item?
So the python code looks like this.
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item.text) #Also possible to do print(price_check_item.text.strip())
I think this is the correct answer. Unfortunately not able to test now. Will check my code for you tonight.

As discussed in the comments, the content you seek is loaded dynamically using JavaScript. Therefore, you must either use a library like Selenium to dynamically run the JS, or find out where/how the data is loaded and replicate that.
Method 1: Use Selenium
from selenium import webdriver
url = 'https://rl.insider.gg/en/psn/octane/grey'
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
price = driver.find_element_by_id('itemSummaryPrice')
print(price.text)
In this case its easy, you just make the request and use find_element_by_id to get the data you want.
Method 2: Trace & Replicate
If you look at your browser's debugger, you can find where/how the itemSummaryPrice it set.
In particular, we find that its set using $('#itemSummaryPrice').text(itemData.currentPriceRange) in https://rl.insider.gg/js/itemDetails.js.
The next step is to find out where itemData comes from. It turns out, this is not from some other file or API call. Instead, it appears to be hard-coded in the HTML source itself (presumably loaded server-side).
If you inspect the source, you'll find the itemData is just a JSON object defined on one line within a script tag on the page itself.
There are two different approaches you can use here.
Use Selenium's execute_script to extract the data. This gives you the JSON object in a ready-to-use format. You can then just index it to get the currentPriceRange.
from selenium import webdriver
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
itemData = driver.execute_script('return itemData')
print(itemData['currentPriceRange'])
Method 2.1: Alternative to Selenium
Alternatively, you can extract this in Python using traditional methods. Then, convert that to a usable Python object using json.loads and then, index the object to extract the currentPriceRange -- this gives you the desired output.
import re
import requests
import json
# Download & convert the response content to a list
url = 'https://rl.insider.gg/en/psn/octane/grey'
site = str(requests.get(url).content).split('\\n')
# Extract the line containing 'var itemData'
itemData = [s for s in site if re.match(r'^\s*var itemData', s)][0].strip()
# Remove 'var itemData' and ';' from that line
# This leaves valid JSON which can be converted from a string using json.loads
itemData = json.loads(re.sub(r'var itemData = |;', '', itemData))
# Index the data to extract the 'currentPriceRange'
print(itemData['currentPriceRange'])
This approach doesn't require Selenium to run the JavaScript and also doesn't require BeautifulSoup to parse the HTML. It does rely on the itemData being initialized in a certain way. Should the developers of that site decide to change the way this is done, you'll have to adapt it slightly in response.
Which method should I use?
If all you really want is the price range and nothing else, then use the first method. If you're interested in other data as well, you'd be better off extracting the full itemData JSON from the source and using that.
One could argue Selenium is more reliable than manual parsing of the HTML, but in this case you're probably fine. In both cases, you assume there is some itemData defined somewhere. If the format does change slightly, then the parsing may break. The other disadvantage is if part of the data relied on JS function calls -- which Selenium would execute, whereas manual parsing couldn't account for. (This isn't the case here, but it could change).

Related

Extracting information from website with BeautifulSoup and Python

I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >

The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.

Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*

It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4>  |  המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה)  |  המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6>  |  המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.

Extract property id from attribute using xpath

I have been trying to extract property id from the following website: https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any
But whichever combination I try to use I can't seem to retrieve it.
Property id is located here:
<div class="corner-ribbon">
<span class="ribbon-green">NEW!</span>
</div>
<a href="Details?id=182519" title="view this property">
<img class="img-responsive img-prop" src="https://kwsadocuments.blob.core.windows.net/devblob/24c21aa4-ae17-41d1-8719-5abf8f24c766.jpg" alt="Living close to Nature">
</a>
And here is what I have tried so far:
response.xpath('//a[#title="view this property"]/#href').getall(),
response.xpath('//*[#id="divListingResults"]/div/div/a/#href').getall(),
response.xpath('//*[#class="corner-ribbon"]/a/#href').getall()
Any suggestion on what I might be doing wrong?
Thank you in advance!

First you need to understand how this page works. It loads properties using Javascript (check page source in your browser using Ctrl+U) and (as you know) Scrapy can't process Javascript.
But if you check page source you'll find that all information your need is "hidden" inside <input id="propertyJson" name="ListingResults.JsonResult" > tag. So all you need to get that value and process it using json module:
import scrapy
import json
class PropertySpider(scrapy.Spider):
name = 'property_spider'
start_urls = ['https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any']
def parse(self, response):
property_json = response.xpath('//input[#id="propertyJson"]/#value').get()
# with open('Samples/Properties.json', 'w', encoding='utf-8') as f:
# f.write(property_json)
property_data = json.loads(property_json)
for property in property_data:
property_id = property['Id']
property_title = property['Title']
print(property_id)
print(property_data)

why do i get empty lists returned from web scraping?

i am trying to get the weather from a website and collect this data. but some requests return empty lists or different information then expected. why does this happen and what is the correct format and method to getting the right xpath and information from a website.
i have tried using multiple websites but cannot consistantly get results.
import requests
from lxml import html
site1data = requests.get('http://m.bom.gov.au/vic/melbourne/', verify =
False)
tree = html.fromstring(site1data.content)
humidity = tree.xpath('//div[#class="humidity"]/text()')
print(humidity)
the expected result was something like:
67%
but i got:
['\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t']

Because the text data you are looking for is presented inside a <p> tag, not inside the <div> itself:
<div class="humidity">
<h3>Humidity</h3>
<img class="humidity" src="/assets/images/ui/humidity.svg" />
<p>65%</p>
</div>
This xpath should solve your immediate problem:
humidity = tree.xpath('//div[#class="humidity"]/p/text()')

If you look at the site they offer a beta site which is API fed so you can get all the info from that endpoint as json
import requests
r = requests.get('https://api.weather.bom.gov.au/v1/locations/r1r0fs/observations').json()
print(r)

Requests won't get the text from web page?

I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.

The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000

When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.

Requests gets dashes, while the same webpage gives me the page with all details? [duplicate]

I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.

The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000

When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract text from h1 and id with python beautiful soup - python

Related

Extracting information from website with BeautifulSoup and Python

Extract property id from attribute using xpath

why do i get empty lists returned from web scraping?

Requests won't get the text from web page?

Requests gets dashes, while the same webpage gives me the page with all details? [duplicate]

Categories

Resources