How to access a value within the body of html using soup

How to access a value within the body of html using soup - python

I am trying to access the live price of a crypto currency coin from an exchange webpage through python.
The XPath of the value I want is in /html/body/luno-exchange-app/luno-navigation/div/luno-market/div/section[2]/luno-orderbook/div/luno-spread/div/luno-orderbook-entry[2]/div/span/span and I have been doing the following
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
exchage_url = 'https://www.luno.com/trade/markets/LTCXBT'
uClient = uReq(exchange_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
print(page_soup.html.body.luno-exchange-app.luno-navigation.div.luno-market.div.section[2].luno-orderbook.div.luno-spread.div.luno-orderbook-entry[2].div.span.span)
However the code doesn't work, due to the '-' in print(page_soup.html.body.luno-exchange-app...)
Is there a way to get that value I want?
How I got the XPath:
I pressed F12, right click the red box, and clicked "Inspect" which highlights a section in the html which I then right click, copy -> XPath. Here is a picture to visually show the value I want The value I am interested in

It is dynamically added from an ajax request
import requests
r = requests.get('https://ajax.luno.com/ajax/1/ticker?pair=LTCXBT')
print(r.json()['bid'])

Related

Python BeautifulSoup missing information inside tag how to include information inside the tag

I am trying to scrape some data from a website.
But when I want to print it I just get the tags back, but with out the information in it.
This is the code:
#Imports
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#URL
my_url = 'https://website.com'
#Opening connection grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
#closing the page
uClient.close()
#Parse html
page_soup = soup(page_html,"html.parser")
price = page_soup.findAll("div",{"id":"lastTrade"})
print(price)
This ist what I get back:
[<div id="lastTrade"> </div>]
So does anyone can tell me what i have to change or add so I receive the actual infortmation from inside this tag?

Maybe loop through your list like this :
for res in price:
print(res.text)

Unable to scrape containers from webpages

I am trying to practice web-scraping from a e-commerce webpage. I have identified the class name of the container (cell which contains each product) to be 'c3e8SH'. I then used the following code to scrape for all containers in that webpage. After which, I used len(containers) to check the number of containers in the webpage.
However, it returned a 0. Can someone point out what I am doing incorrectly? Thank you very much!
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)

(1) Firstly, param cookies is needed.
You will get the validation page as below if you only request the link without cookies
https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1
(2) secondly, The page you want to scrape is dynamicly loaded
That's why what you see through web browser is different from what you get by codes
for convenience , i'd prefer to use requests module.
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
shop-items like as:
item_json =
{
"#context":"https://schema.org",
"#type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"#type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"#type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
as json data showed, you can get any item's name, url-link, price, and so on.

Try using a different parser.
I recommend lxml.
So your line where you create the page_soup would be:
page_soup = soup(page_html, 'lxml')

I tried to find c3e8SH in your suggested document with regex, but I coudn't find such class name. Please, check your document again.

Can't parse a 2nd page using beautiful soup

I am trying to navigate a website using beautifulsoup. I open the first page and find the links I want to follow, but when I ask beautiful soup to open the next page, none of the HTML is parsed and it just returns this
<function scraper at 0x000001E3684D0E18>
I have tried opening the second page in its own script and it works just fine so the problem has to do with parsing a page from another page.
I have ~2000 links I need to go through so I created a function that goes through them. Here's my script so far
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import lxml
# The first webpage I'm parsing
my_url = 'https://mars.nasa.gov/msl/multimedia/raw/'
#calls the urlopen function from the request module of the urllib module
# AKA opens up the connection and grabs the page
uClient = uReq(my_url)
#imports the entire webpage from html format into python.
# If webpage has lots of data this can take a long time and take up a lot of
space or crash
page_html = uClient.read()
#closes the client
uClient.close()
#parses the HTML using bs4
page_soup = soup(page_html, "lxml")
#finds the categories for the types of images on the site, category 1 is
RHAZ
containers = page_soup.findAll("div", {"class": "image_list"})
RHAZ = containers[1]
# prints the links in RHAZ
links = []
for link in RHAZ.find_all('a'):
#removes unwanted characters from the link making it usable.
formatted_link = my_url+str(link).replace('\n','').split('>')
[0].replace('%5F\"','_').replace('amp;','').replace('<a href=\"./','')
links.append(formatted_link)
print (links[1])
# I know i should be defining a function here.. so ill give it a go.
def scraper():
pic_page = uReq('links[1]') #calls the first link in the list
page_open = uClient.read() #reads the page in a python accessible format
uClient.close() #closes the page after it's been stored to memory
soup_open = soup(page_open, "lxml")
print (soup_open)
print (scraper)
Do I need to clear the previously loaded HTML in beautifulsoup so I can open the next page? If so, how would I do this? Thanks for any help

You need to make requests from the urls scraped from first page...check this code.
from bs4 import BeautifulSoup
import requests
url = 'https://mars.nasa.gov/msl/multimedia/raw'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'lxml')
img_list = soup.find_all('div', attrs={'class': 'image_list'})
for i in img_list:
image = i.find_all('a')
for x in image:
href = x['href'].replace('.', '')
link = (str(url)+str(href))
req2 = requests.get(link)
soup2 = BeautifulSoup(req2.content, 'lxml')
img_list2 = soup2.find_all('div', attrs={
'class': 'RawImageUTC'})
for l in img_list2:
image2 = l.find_all('a')
for y in image2:
href2 = y['href']
print(href2)
Output:
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FRB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134317EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134106EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134065EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134052EDR_S0722464FHAZ00222M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590133948EDR_S0722464FHAZ00222M_.JPG

How do I fetch content from switchable tab in BeautifulSoup?

There are 4 switchable tabs on the website, I managed to extract from the first tab but couldn't figure out how to extract from other three tabs because the tab needs to be clicked on (i think).
Product Details
Feedback
Shipping & Payment
Seller Guarantees
my code :
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myurl = 'https://www.aliexpress.com/item/Vfemage-Womens-Elegant-Ruched-Bow-Contrast-Patchwork-3-4-Sleeve-Vintage-Pinup-Work-Office-Party-Fitted/32831085887.html?spm=2114.search0103.3.12.iQlXqu&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10345_10342_10325_10343_51102_10546_10340_10548_10341_10609_10541_10084_10083_10307_10610_10539_10312_10313_10059_10314_10534_100031_10604_10603_10103_10605_10594_10142_10107,searchweb201603_25,ppcSwitch_5&algo_expid=a3e03a67-d922-4c90-aba7-d3cc80101a75-1&algo_pvid=a3e03a67-d922-4c90-aba7-d3cc80101a75&rmStoreLevelAB=0'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
productdetails = page_soup.select("ul.product-property-list.util-clearfix li")
How do you extract contents from the other 3 tabs ?

I used Selenium to click each tabs & extract contents from all tabs with it.

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to access a value within the body of html using soup - python

It is dynamically added from an ajax request import requests r = requests.get('https://ajax.luno.com/ajax/1/ticker?pair=LTCXBT') print(r.json()['bid'])

Related

Python BeautifulSoup missing information inside tag how to include information inside the tag

Unable to scrape containers from webpages

Can't parse a 2nd page using beautiful soup

How do I fetch content from switchable tab in BeautifulSoup?

Python Beautiful Soup - Span class text not extracted

Categories

Resources