Trying to grab specific HTML for web scrape

Trying to grab specific HTML for web scrape - python

I am trying to scrape data from the following url: https://www.pro-football-reference.com/boxscores/201809060phi.htm
Specifically, I want info from the "Passing, Rushing, & Receiving" table. I have the following code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# assigning url
my_url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
# opening up connection, grabbing the page
raw_page = uReq(my_url)
page_html = raw_page.read()
raw_page.close()
# html parsing
page_soup = soup(page_html,"html.parser")
# assign variable to stat table
stat_table = page_soup.find ("div",{"id":"all_player_offense"})
inner_table = stat_table.findAll("tr")
print(len(inner_table)
It should be printing the number of player rows in that table. The output I get from this is 0 instead of what I expected, 17.

You're getting the parent div to the table instead of the table itself. Double check the HTML markup of the page and you'll find the id of the table.
Also note that the table uses a tbody instead of immediately listing the rows, so you'll have to account for that as well.

Related

BS4 elements missing

I am trying to parse some HTML but the section that I want simply does not show up in the soup. Both the prior section and the posterior section are there, but not the one I want. Am I doing something wrong?
URL: https://coronavirus-portugal-esriportugal.hub.arcgis.com/
My code (with the URL):
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
url = 'https://coronavirus-portugal-esriportugal.hub.arcgis.com/'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
soup = soup(page_html, 'html.parser')
body = soup.body
print(body.prettify())
I am looking for the first four numbers (those corresponding to "Casos Confirmados", "Casos Suspeitos", "Recuperados", "Óbitos")

The data is retrieved dynamically from a backend SQL database. If you inspect the network traffic updating the page (and know a little SQL) you can work out how to write a query to send yourself to retrieve the portugal specific data. Here, 215 corresponds with Portugal.
import requests
r = requests.get('https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/ncov_cases/FeatureServer/1/query?f=json&where=OBJECTID=215&outFields=*')
print(r.json())
All data (use a different query):
https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/ncov_cases/FeatureServer/1/query?f=json&where=1=1&outFields=*
You can also dynamically pick up the other identifier used in the query string
import requests, re
country_id = 215
with requests.Session() as s:
r = s.get('https://coronavirus-portugal-esriportugal.hub.arcgis.com/')
p = re.compile(r'https://services1.arcgis.com/(.*?)/arcgis')
code = p.findall(r.text)[0]
r = s.get(f'https://services1.arcgis.com/{code}/arcgis/rest/services/ncov_cases/FeatureServer/1/query?f=json&where=OBJECTID={country_id}&outFields=*')
print(r.json())

How to scrape when data tables do not show in page source

I would like to scrape all running times (not just the first 10 results) from the data tables on https://www.ijsselsteinloop.nl/uitslagen-2019. However, the data that shows on the webpage does not show in de page source. Under every data table, there's a hyperlink ("hier"). These link to the full data table pages. But those links are also not in the page source.
Any suggestions or code snippets how to scrape this data (with Python and BeautifulSoup or Scrapy).

Use the same endpoint the page uses for that content. You can find this in the network tab of the browser.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.ijsselsteinloop.nl/uitslag/2019/index.html')
soup = bs(r.content, 'lxml')
links = ['https://www.ijsselsteinloop.nl/uitslag/2019/' + item['href'] for item in soup.select('[href^=uitslag]')]
for link in links:
table = pd.read_html(link)[0]
print(table)

You could use BeautifulSoup. First :
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
And then use function find.All( to get every tr). And then use for loop , and type
again find('td') to get every row

Unable to scrape containers from webpages

I am trying to practice web-scraping from a e-commerce webpage. I have identified the class name of the container (cell which contains each product) to be 'c3e8SH'. I then used the following code to scrape for all containers in that webpage. After which, I used len(containers) to check the number of containers in the webpage.
However, it returned a 0. Can someone point out what I am doing incorrectly? Thank you very much!
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)

(1) Firstly, param cookies is needed.
You will get the validation page as below if you only request the link without cookies
https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1
(2) secondly, The page you want to scrape is dynamicly loaded
That's why what you see through web browser is different from what you get by codes
for convenience , i'd prefer to use requests module.
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
shop-items like as:
item_json =
{
"#context":"https://schema.org",
"#type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"#type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"#type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
as json data showed, you can get any item's name, url-link, price, and so on.

Try using a different parser.
I recommend lxml.
So your line where you create the page_soup would be:
page_soup = soup(page_html, 'lxml')

I tried to find c3e8SH in your suggested document with regex, but I coudn't find such class name. Please, check your document again.

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to grab specific HTML for web scrape - python

You're getting the parent div to the table instead of the table itself. Double check the HTML markup of the page and you'll find the id of the table. Also note that the table uses a tbody instead of immediately listing the rows, so you'll have to account for that as well.

Related

BS4 elements missing

How to scrape when data tables do not show in page source

Unable to scrape containers from webpages

Extract data from BSE website

Python Beautiful Soup - Span class text not extracted

Categories

Resources