Extract a value from html using BeautifulSoup

Extract a value from html using BeautifulSoup - python

I'm trying to retrieve a value from this HTML using bs4. I'm really new to data scraping and I have tried to figure out some ways to get this value but to no avail. The closest solution I saw is this one.
Extracting a value from html table using BeautifulSoup
Here is the HTML of which I am looking at:
<div class="dataItem_hld clearfix">
<div class="smalltxt">ROE</div>
<div name="tixStockRoe" class="value">121.362</div>
</div>
I've tried this so far:
from bs4 import BeautifulSoup as BS
import requests
url = "https://www.bursamarketplace.com/mkt/themarket/stock/SUPM"
html_content = requests.get(url).text
soup = BS(html_content, 'lxml')
val = soup.find_all('div', {'name': "tixStockRoe", 'class':"value"})
Before I want to try to use strip() to get the value, my val variable is empty.
In [96]: val
Out[96]: []
I've been searching the posts for few hours, but I did not manage to type the correct code to get the value yet.
Also, please let me know if there are any good sources to learn about extracting data. Thanks
Update
I have edited the code thanks to the response to the post. Now I encounter a problem. It seems like the number 121.362 did not appear in the variable. Any idea here?
val = soup.find_all(attrs={'name': "tixStockRoe"})
and the output is this:
Out[14]: [<div class="value" name="tixStockRoe"><div class="loader loaderSmall"><div class="loader_hld"><img alt="" src="/img/loading.gif"/></div></div></div>]

The data in that page is loaded by JavaScript and that is the reason you aren't finding the data you are looking for - 121.362 using beautifulsoup.
beautifulsoup only works on static websites.
You need to use selenium to load the page and get data. You can read more about web-scraping using selenium here
Here is how you scrape using Selenium.
import time
from bs4 import BeautifulSoup, Tag, NavigableString
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=options)
url = 'https://www.bursamarketplace.com/mkt/themarket/stock/SUPM'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
d = soup.find('div', attrs= {'name': 'tixStockRoe'})
print(d.text.strip())
121.362

Check the Docs
You can't use a keyword argument to search for HTML’s name element.
name_soup.find_all(attrs={"name": "tixStockRoe"})

You can try this :
# I do not know lxml, in my case this parser is working
soup = BS(page, 'html.parser')
val = soup.find_all('div', attrs={'name': "tixStockRoe", 'class':"value"})

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?

The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>

Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!

To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

beautiful soup- Scraping a site with hidden tag

I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!

Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

How can I get the value of data-d1-value when I am using requests library of python?
The request.get(URL) function is itself not giving the data-* attributes in the div which are present in the original webpage.
The web page is as follows:
<div id="test1" class="class1" data-d1-value="150">
180
</div>
The code I am using is :
req = request.get(url)
soup = BeautifulSoup(req.text, 'lxml')
d1_value = soup.find('div', {'class':"class1"})
print(d1_value)
The result I get is:
<div id="test1" class="class1">
180
</div>
When I debug this, I found that request.get(URL) is not returning the full div but only the id and class and not data-* attributes.
How should I modify to get the full value?
For better example:
For my case the URL is:
https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG
And the Information of variable:
The DIV CLASS is : class="inprice1 nsecp" and The value of data-numberanimate-value is what I am trying to fetch
Thanks in advance :)

EDIT
Website response differs in case of requesting it - In your case using requests the value you are looking for is served in this way:
<div class="inprice1 nsecp" id="nsecp" rel="92.75">92.75</div>
So you can get it from the rel or from the text:
soup.find('div', {'class':"inprice1"})['rel']
soup.find('div', {'class':"inprice1"}).get_text()
Example
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG')
soup = BeautifulSoup(req.text, 'lxml')
print('rel: '+soup.find('div', {'class':"inprice1"})['rel'])
print('text :'+soup.find('div', {'class':"inprice1"}).get_text())
Output
rel: 92.75
text: 92.75
To get a response that display the source as you inspect it, you have to try selenium
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find('div', class_='inprice1 nsecp')['data-numberanimate-value'])
driver.close()
To get the attribute value just add ['data-d1-value'] to your find()
Example
from bs4 import BeautifulSoup
html='''
<div id="test1" class="class1" data-d1-value="150">
180
</div>
'''
soup = BeautifulSoup(html, 'lxml')
d1_value = soup.find('div', {'class':"class1"})['data-d1-value']
print(d1_value)

you are seeing this issue, because you didn't retrieve all of the other attributes which we're defined on the DIV.
The below code will retrieve all of the custom attributes which we're defined on the div as well
from bs4 import BeautifulSoup
s = '<div id="test1" class="class1" data-d1-value="150">180</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div',{'class':"class1"}).attrs
print(attributes_dictionary)

You can get data from HTML or you just can do it scraping the API
This is an example:
Website is: Money Control
If you going to developer tools into your browser, and select Network, you can see the requests that are doing the website:
See image
You can see that in headers, appear URL from API: priceapi.moneycontrol.com.
This is a strange case, because the API is open... and usually it isn't.
You can access to price:
Imagine that you save JSON data into a variable called 'json', you can access it with:
json.data.pricecurrent

Why am I not getting the value of the field rather than the field itself?

so I'm trying to do web scraping for the first time using BeautifulSoup and Python. The page that I am trying to scrape is at: http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172
client = request('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
page_html = client.read()
client.close()
page_soup = soup(page_html)
identification = page_soup.find('div', {'data-bind':'text: name'})
print(identification.text)
When I do this I simply get an empty string. If I print out simply the identification variable I get:
<div class="col-xs-7" data-bind="text: name"></div>
This is the line of html that I am trying to get the value of, as you can see there is a value A LEBLANC there in the tag

You can try this code :
from selenium import webdriver
driver=webdriver.Chrome()
browser=driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
find=driver.find_element_by_xpath('//*[#id="identificationCollapse"]/div/div/div/div[1]/div[1]/div[2]')
print(find.text)
output:
A LEBLANC

There are several ways you can achieve the same goal. However, I've used selector in my script which is easy to understand and has got less chance to break unless the html structure of that website is heavily changed. Try this out as well.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.select("[data-bind$='name']")[0].text
print(item_name)
Result:
A LEBLANC
Btw, the way you started will also work:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.find('div', {'data-bind':'text: name'}).text
print(item_name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a value from html using BeautifulSoup - python

Check the Docs You can't use a keyword argument to search for HTML’s name element. name_soup.find_all(attrs={"name": "tixStockRoe"})

You can try this : # I do not know lxml, in my case this parser is working soup = BS(page, 'html.parser') val = soup.find_all('div', attrs={'name': "tixStockRoe", 'class':"value"})

Related

How can I get information from a web site using BeautifulSoup in python?

Unable to find element BeautifulSoup

beautiful soup- Scraping a site with hidden tag

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

Why am I not getting the value of the field rather than the field itself?

Categories

Resources