Python Beautiful Soup - Span class text not extracted - python

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

How to scrape when data tables do not show in page source

I would like to scrape all running times (not just the first 10 results) from the data tables on https://www.ijsselsteinloop.nl/uitslagen-2019. However, the data that shows on the webpage does not show in de page source. Under every data table, there's a hyperlink ("hier"). These link to the full data table pages. But those links are also not in the page source.
Any suggestions or code snippets how to scrape this data (with Python and BeautifulSoup or Scrapy).
Use the same endpoint the page uses for that content. You can find this in the network tab of the browser.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.ijsselsteinloop.nl/uitslag/2019/index.html')
soup = bs(r.content, 'lxml')
links = ['https://www.ijsselsteinloop.nl/uitslag/2019/' + item['href'] for item in soup.select('[href^=uitslag]')]
for link in links:
table = pd.read_html(link)[0]
print(table)
You could use BeautifulSoup. First :
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
And then use function find.All( to get every tr). And then use for loop , and type
again find('td') to get every row

Unable to scrape containers from webpages

I am trying to practice web-scraping from a e-commerce webpage. I have identified the class name of the container (cell which contains each product) to be 'c3e8SH'. I then used the following code to scrape for all containers in that webpage. After which, I used len(containers) to check the number of containers in the webpage.
However, it returned a 0. Can someone point out what I am doing incorrectly? Thank you very much!
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)
(1) Firstly, param cookies is needed.
You will get the validation page as below if you only request the link without cookies
https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1
(2) secondly, The page you want to scrape is dynamicly loaded
That's why what you see through web browser is different from what you get by codes
for convenience , i'd prefer to use requests module.
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
shop-items like as:
item_json =
{
"#context":"https://schema.org",
"#type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"#type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"#type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
as json data showed, you can get any item's name, url-link, price, and so on.
Try using a different parser.
I recommend lxml.
So your line where you create the page_soup would be:
page_soup = soup(page_html, 'lxml')
I tried to find c3e8SH in your suggested document with regex, but I coudn't find such class name. Please, check your document again.

Beautiful Soup and the findAll() process

I am attempting to scrape data from a site using the following code. The site required the decode method and I followed a #royatirek solution. My problem is that container_a ends up containing nothing. I use a similar method on few other sites and it works. But on this and a couple of other sites my container_a variable remains an empty list. Cheers
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
my_url = 'http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and-
the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824'
req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
page_soup = soup(web_byte, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})
The content you want to parse is being dynamically loaded by JavaScript and therefore requests won't do the job for you. You could use selenium and ChromeDriver or any other driver for that:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and-the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824")
You can then proceed with the use of bs4 as you wish by accessing the page source using .page_source:
page_soup = BeautifulSoup(driver.page_source, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})

How to python Scrape text in span class

So I'm making a bitcoin checker practice and I'm having trouble scraping data because the data I want is in a span class and I don't know how to retrieve the data.
so here is the line that I got from inspect:
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
I want to scrape the "11,511.31" number. How do I do this?
I tried so many different things and I honestly have no clue what to do anymore.
here is the URL:link
Im scraping the current USD price (right next to "BTC/USD")
EDIT: Guys a lot of the examples you gave me is where i input the data. Thats not useful because i want to refresh the page every 30 seconds so I need the program to find the span class and extract the data and print it'
EDIT:current code. need to get programm to get "html" part by itself
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
You just have to search for the right tag and class -
from bs4 import BeautifulSoup
html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""
html = BeautifulSoup(html_text, "lxml")
spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD', '').strip())
Searching for all <span> tags and then filtering them by class attribute, which in you case has a value of MarketInfo_market-num_1lAXs. Once the filter is done just loop through the spans and using the .text attribute you can retrieve the text, then just replace the 'USD'.
UPDATE
import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])
No need to understand the HTML. The data in that HTML tag is getting populated from an API call which has a JSON response. You can call that API directly. This will keep your data current.
you can use beautifulsoup or lxml.
For beautifulsoup, the code like as following
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")
print(soup.string)
The lxml is more quickly
from lxml import etree
span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")
for i in span.xpath("//span/text()"):
print(i)
Try a real browser like Selenium-Firefox. I tried to use Selenium-PhantomJS, but I failed...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = 'https://www.gdax.com/trade/BTC-USD'
driver = webdriver.Firefox(executable_path='./geckodriver')
driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
driver.close()
Output:
11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]

Categories

Resources