How to scrape when data tables do not show in page source

How to scrape when data tables do not show in page source - python

I would like to scrape all running times (not just the first 10 results) from the data tables on https://www.ijsselsteinloop.nl/uitslagen-2019. However, the data that shows on the webpage does not show in de page source. Under every data table, there's a hyperlink ("hier"). These link to the full data table pages. But those links are also not in the page source.
Any suggestions or code snippets how to scrape this data (with Python and BeautifulSoup or Scrapy).

Use the same endpoint the page uses for that content. You can find this in the network tab of the browser.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.ijsselsteinloop.nl/uitslag/2019/index.html')
soup = bs(r.content, 'lxml')
links = ['https://www.ijsselsteinloop.nl/uitslag/2019/' + item['href'] for item in soup.select('[href^=uitslag]')]
for link in links:
table = pd.read_html(link)[0]
print(table)

You could use BeautifulSoup. First :
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
And then use function find.All( to get every tr). And then use for loop , and type
again find('td') to get every row

Related

Why does python parsing table BeatifulSoup do not work on this website as intended?

I am stuck on this website. I've done some small codes to learn about BeatifulSoup for the past week, I did some research on how to use it and the respective official documentation. Not only that, but review some tutorials and videos on how to parse a table from websites. I've parsed data from tables using the methods soup.find() and soup.select() from several websites such as:
Games engine website
MLB stats website
Wikipedia
for example, for the MLB stats website I used the following code:
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as bs
def connection(url):
uclient = ureq(url)
page_html = uclient.read()
uclient.close()
soup = bs(page_html, "html.parser")
return(soup)
soup = connection('https://baseballsavant.mlb.com/team/146')
table = soup.findAll("div", {"class": "table-savant"}) #<--using method soup.find()
#table = soup.select("div.table-savant") #<-----------------using method soup.select()
for n in range(len(table)):
if (n==9): break
content = table[n]
columns = content.find("thead").find_all("th")
column_names = [str(c.string).strip() for c in columns]
table_rows = soup.findAll("tbody")[n].find_all("tr")
l = []
for tr in table_rows:
td = tr.find_all("td")
row = [str(tr.text).strip() for tr in td]
l.append(row)
print(l)
Then convert them into a data frame. But there is one particular website that I can not retrieve the data of the tables. I've tried just printing the content with find():
def connection(url):
uclient = ureq(url)
page_html = uclient.read()
uclient.close()
soup = bs(page_html, "html.parser")
return(soup)
soup = connection('https://baseballsavant.mlb.com/preview?game_pk=634607&game_date=2021-4-4')
table = soup.findAll("div", {"class": "table-savant"}) #<--using method soup.find()
print(table)
result: []
With select():
table = soup.select("div.table-savant")
print(table)
result: []
With select() using CSS path from this post:
table = soup.select('#preview > div:nth-of-type(1) > div:nth-of-type(2) > div:nth-of-type(3) > table:nth-of-type(1) > tbody:nth-of-type(2) > tr:nth-of-type(2) > td:nth-of-type(3)')
print(table)
result: []
I want to retrieve the stats from the players, but I'm lost. Any suggestion will be highly appreciated. Thank you.

Problem: The page uses javascript to fetch and display the content, so you cannot just use requests or other similars because javascript code would not be executed.
Solution: use selenium in order to load the page then parse the content with BeautifulSoup.
Sample code here:
from selenium import webdriver
d = webdriver.Chrome()
d.get(url)
bs = BeautifulSoup(d.page_source)
To use webdriver.Chrome you will also have to download chromedriver from here and put the executable in the same folder of your project or in PATH.

Unable to scrape containers from webpages

I am trying to practice web-scraping from a e-commerce webpage. I have identified the class name of the container (cell which contains each product) to be 'c3e8SH'. I then used the following code to scrape for all containers in that webpage. After which, I used len(containers) to check the number of containers in the webpage.
However, it returned a 0. Can someone point out what I am doing incorrectly? Thank you very much!
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)

(1) Firstly, param cookies is needed.
You will get the validation page as below if you only request the link without cookies
https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1
(2) secondly, The page you want to scrape is dynamicly loaded
That's why what you see through web browser is different from what you get by codes
for convenience , i'd prefer to use requests module.
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
shop-items like as:
item_json =
{
"#context":"https://schema.org",
"#type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"#type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"#type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
as json data showed, you can get any item's name, url-link, price, and so on.

Try using a different parser.
I recommend lxml.
So your line where you create the page_soup would be:
page_soup = soup(page_html, 'lxml')

I tried to find c3e8SH in your suggested document with regex, but I coudn't find such class name. Please, check your document again.

Can't parse a 2nd page using beautiful soup

I am trying to navigate a website using beautifulsoup. I open the first page and find the links I want to follow, but when I ask beautiful soup to open the next page, none of the HTML is parsed and it just returns this
<function scraper at 0x000001E3684D0E18>
I have tried opening the second page in its own script and it works just fine so the problem has to do with parsing a page from another page.
I have ~2000 links I need to go through so I created a function that goes through them. Here's my script so far
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import lxml
# The first webpage I'm parsing
my_url = 'https://mars.nasa.gov/msl/multimedia/raw/'
#calls the urlopen function from the request module of the urllib module
# AKA opens up the connection and grabs the page
uClient = uReq(my_url)
#imports the entire webpage from html format into python.
# If webpage has lots of data this can take a long time and take up a lot of
space or crash
page_html = uClient.read()
#closes the client
uClient.close()
#parses the HTML using bs4
page_soup = soup(page_html, "lxml")
#finds the categories for the types of images on the site, category 1 is
RHAZ
containers = page_soup.findAll("div", {"class": "image_list"})
RHAZ = containers[1]
# prints the links in RHAZ
links = []
for link in RHAZ.find_all('a'):
#removes unwanted characters from the link making it usable.
formatted_link = my_url+str(link).replace('\n','').split('>')
[0].replace('%5F\"','_').replace('amp;','').replace('<a href=\"./','')
links.append(formatted_link)
print (links[1])
# I know i should be defining a function here.. so ill give it a go.
def scraper():
pic_page = uReq('links[1]') #calls the first link in the list
page_open = uClient.read() #reads the page in a python accessible format
uClient.close() #closes the page after it's been stored to memory
soup_open = soup(page_open, "lxml")
print (soup_open)
print (scraper)
Do I need to clear the previously loaded HTML in beautifulsoup so I can open the next page? If so, how would I do this? Thanks for any help

You need to make requests from the urls scraped from first page...check this code.
from bs4 import BeautifulSoup
import requests
url = 'https://mars.nasa.gov/msl/multimedia/raw'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'lxml')
img_list = soup.find_all('div', attrs={'class': 'image_list'})
for i in img_list:
image = i.find_all('a')
for x in image:
href = x['href'].replace('.', '')
link = (str(url)+str(href))
req2 = requests.get(link)
soup2 = BeautifulSoup(req2.content, 'lxml')
img_list2 = soup2.find_all('div', attrs={
'class': 'RawImageUTC'})
for l in img_list2:
image2 = l.find_all('a')
for y in image2:
href2 = y['href']
print(href2)
Output:
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FRB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134317EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134106EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134065EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134052EDR_S0722464FHAZ00222M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590133948EDR_S0722464FHAZ00222M_.JPG

Trying to grab specific HTML for web scrape

I am trying to scrape data from the following url: https://www.pro-football-reference.com/boxscores/201809060phi.htm
Specifically, I want info from the "Passing, Rushing, & Receiving" table. I have the following code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# assigning url
my_url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
# opening up connection, grabbing the page
raw_page = uReq(my_url)
page_html = raw_page.read()
raw_page.close()
# html parsing
page_soup = soup(page_html,"html.parser")
# assign variable to stat table
stat_table = page_soup.find ("div",{"id":"all_player_offense"})
inner_table = stat_table.findAll("tr")
print(len(inner_table)
It should be printing the number of player rows in that table. The output I get from this is 0 instead of what I expected, 17.

You're getting the parent div to the table instead of the table itself. Double check the HTML markup of the page and you'll find the id of the table.
Also note that the table uses a tbody instead of immediately listing the rows, so you'll have to account for that as well.

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape when data tables do not show in page source - python

You could use BeautifulSoup. First : uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html,"html.parser") And then use function find.All( to get every tr). And then use for loop , and type again find('td') to get every row

Related

Why does python parsing table BeatifulSoup do not work on this website as intended?

Unable to scrape containers from webpages

Can't parse a 2nd page using beautiful soup

Trying to grab specific HTML for web scrape

Python Beautiful Soup - Span class text not extracted

Categories

Resources