Get Web data with images for HTML table - python

I am trying to extract the article body with images from this link, so that using the extracted article body I can make a HTML table. So, I have tried using BeautifulSoup.
t_link = 'https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html'
page = requests.get(t_link)
soup_page = BeautifulSoup(page.content, 'html.parser')
html_article = soup_page.find_all("div", {"class": re.compile('ArticleBody-articleBody.?')})
for article_body in html_article:
print(article_body)
But unfortunately the article_body didn't show any image, like this. Because, <div class="InlineImage-wrapper"> is't scraping in this way
So, how can I get article data with article images, so that I can make a HTML table?

I didn't quite understand your goal, so mine is probably not the answer you want.
In the html source of that page you have all inside the script you at the bottom.
It has inside the content of the page in JSON format.
If you simply use grep and jq (a great JSON cli utility), you can run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq '[.content[]|select(.tagName|contains("image"))]'
to have all infos about the images
[
{
"tagName": "image",
"attributes": {
"id": "106967852",
"type": "image",
"creatorOverwrite": "PM Images",
"headline": "Retirement Savings",
"url": "https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026",
"datePublished": "2021-10-29T16:30:26+0000",
"copyrightHolder": "PM Images",
"width": "2233",
"height": "1343"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
},
{
"tagName": "image",
"attributes": {
"id": "106323101",
"type": "image",
"creatorOverwrite": "JGI/Jamie Grill",
"headline": "GP: 401k money jar on desk of businesswoman",
"url": "https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437",
"datePublished": "2020-01-06T20:58:19+0000",
"copyrightHolder": "JGI/Jamie Grill",
"width": "5120",
"height": "3418"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
}
]
If you need only the URLs, run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq -r '[.content[]|select(.tagName|contains("image"))]|.[].attributes.url'
to get
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437

Everything you want is in the source HTML, but you need to jump through a couple of hoops to get that data.
I'm providing the following:
article body
two (2) images that go with the article body and a url to header video (1)
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
with requests.Session() as s:
s.headers.update(headers)
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
print(item["attributes"]["url"])
This should print:
The entire article body (Redacted for brevity)
The new year offers opportunities for many Americans in their careers and financial lives. The "Great Reshuffle" is expected to continue as employees leave jobs and take new ones at a rapid clip. At the same time, many workers have made a vow to save more this year, yet many admit they don't know how they'll stick to that goal. One piece of advice: Keep it simple.
[...]
The above mentioned urls to assets:
https://www.cnbc.com/video/2022/01/03/how-to-choose-the-best-retirement-strategy-for-2022.html
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437
EDIT:
If you want to download the images, use this:
import json
import os
import re
from pathlib import Path
from shutil import copyfileobj
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
def download_images(image_source: str, directory: str) -> None:
"""Download images from a given source and save them to a given directory."""
os.makedirs(directory, exist_ok=True)
save_dir = Path(directory)
if re.match(r".*\.jp[e-g]", image_source):
file_name = save_dir / image_source.split("/")[-1].split("?")[0]
with s.get(image_source, stream=True) as img, open(file_name, "wb") as output:
copyfileobj(img.raw, output)
with requests.Session() as s:
s.headers.update(headers)
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
url = item["attributes"]["url"]
print(url)
download_images(url, "images")

Related

Incorporating pagination scraping into my script

url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=electronics"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
names = soup.find_all("div", class_="s-item__title")
prices = soup.find_all("span", class_="s-item__price")
shippings = soup.find_all("span", class_="s-item__shipping s-item__logisticsCost"
for name,price,shipping in zip(names,prices,shippings):
print(name.text, price.text, shipping.text)
Right now, this script works perfectly. It prints everything that needs to be printed.
But... I want to be able to go to the next page and scrape everything off of there as well.
The class for the next page is "pagination__next icon-link"
I'm not sure how I would go about it.
Just iterate link by pagination url query value
base_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=electronics&_pgn='
for i in range(pages_count):
base_url+f'{i}'
# your code...
response = requests.get(url)
For correct parsing by category, due to the specifics of the displayed pages of the site, I advise you to refer to the pagination object for each request, look at the last page number and substitute it in the request
Take last number of available page on current page:
ol = soup.find("ol", class_="pagination__items")
lis = ol.find_all("li")
print(f"Last available number of post on current page {lis[-1].text}")
In order to collect all the information from all pages, you can use the while loop which dynamically paginates through all pages.
The while loop will be executed until there is a stop command, in our case, the command to end the loop will be to check for the presence of the next page, for which the CSS selector is responsible - ".pagination__next".
Also, there's a URL parameter that is responsible for pagination: _pgn which is used to increase page number by 1 and thus selects the next page:
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
See the full code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
"_nkw": "electronics", # search query
"_pgn": 1 # page number
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Nintendo DSi XL Japan Import Console & USB Charger - Pick Your Color TESTED",
"price": "$69.99",
"link": "https://www.ebay.com/itm/165773301243?hash=item2698dbd5fb:g:HFcAAOSwTdNhnqy~&amdata=enc%3AAQAHAAAA4MXRmWPDY6vBlTlYLy%2BEQPsi1HJM%2BFzt2TWJ%2BjCbK6Q2mreLV7ZpKmZOvU%2FMGqxY2oQZ91aPaHW%2FS%2BRCUW3zUKWDIDoN2ITF3ooZptkWCkd8x%2FIOIaR7t2rSYDHwQEFUD7N6wdnY%2Bh6SpljeSkCPkoKi%2FDCpU0YLOO3mpuLVjgO8GQYKhrlXG59BDDw8IyaayjRVdWyjh534fuIRToSqFrki97dJMVXE0LNE%2BtPmJN96WbYIlqmo4%2B278nkNigJHI8djvwHMmqYUBQhQLN2ScD%2FLnApPlMJXirqegMet0DZQ%7Ctkp%3ABk9SR7K0tsSSYQ"
},
{
"title": "Anbernic RG351P White, Samsung 64 GB SD Card AmberElec & Case",
"price": "$89.99",
"link": "https://www.ebay.com/itm/144690079314?hash=item21b0336652:g:8qwAAOSw93ZjO6n~&amdata=enc%3AAQAHAAAAoNGQWvtymUdp2cEYaKyfTAzWm0oZvBODZsm2oeHl3s%2F6jF9k3nAIpsQkpiZBFI657Cg53X9zAgExAxQAfmev0Bgh7%2FjEtC5FU8O5%2FfoQ3tp8XYtyKdoRy%2FwdebmsGKD%2FIKvW1lWzCNN%2FpSAUDLrPgPN9%2Fs8igeU7jqAT4NFn3FU7W4%2BoFV%2B2gNOj8nhxYlm3HZ6vm21T4P3IAA4KXJZhW2E%3D%7Ctkp%3ABk9SR7K0tsSSYQ"
},
{
"title": "New ListingWhite wii console ONLY Tested Working",
"price": "$24.99",
"link": "https://www.ebay.com/itm/385243730852?hash=item59b250d3a4:g:t3YAAOSwZBBjctqi&amdata=enc%3AAQAHAAAAoH9I%2BSQlJpKebgObGE7Idppe2cewzEiV0SdZ6pEu0sVpIJK5%2F3q15ygTFAdPRElY232LwDKIMXjkIwag1FUN76geBg2vCnPfd3x8BAHzXn%2B1u5zF9cBITLCuawKTYnfUeCYMavO4cBmpnsrvUOSokvnTacfB078MF95%2FH1sUQH%2BfIjDtPzFoFTJrTtKLINRlXZ9edD%2BVW%2FB2TLYZ%2FHMAHkE%3D%7Ctkp%3ABk9SR7K0tsSSYQ"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code that paginates through all pages:
from serpapi import EbaySearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": "electronics", # search query
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
link = organic_result.get("link")
price = organic_result.get("price")
data.append({
"price" : price,
"link" : link
})
page_num += 1
print(page_num)
next_page_query_dict = dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query))
current_page = results["serpapi_pagination"]["current"] # 1,2,3...
# looks for the next page data (_pgn):
if "next" in results.get("pagination", {}):
# if current_page = 20 and next_page_query_dict["_pgn"] = 20: break
if int(current_page) == int(next_page_query_dict["_pgn"]):
break
# update next page data
search.params_dict.update(next_page_query_dict)
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "$169.00",
"extracted": 169.0
},
"link": "https://www.ebay.com/itm/113356737439?hash=item1a64968b9f:g:4qoAAOSwQypdKgT6&amdata=enc%3AAQAHAAAA4N8GJRRCbG8WIU7%2BzjrvsRMMmKaTEnA0l7Nz9nOWUUSin3gZ5Ho41Fc4A2%2FFLtlLzbb5UuTtU5s3Qo7Ky%2FWB%2FTEuDKBhFldxMZUzVoixZXII6T1CTtgG5YFJWs0Zj8QldjdM9PwBFuiLNJbsRzG38k7v1rJdg4QGzVUOauPxH0kiANtefqiBhnYHWZ0RfMqwh4S%2BbQ59JYQWSZjAefL61WYyNwkfSdrfcq%2BW2B7b%2BR8QEfynka5CE6g7YPpoWWp4Bk3IOvd4CZxAzTpgvOPoMMKPy0VCW1gPJDG4R2CsfDEv%7Ctkp%3ABk9SR56IpsWSYQ"
},
{
"price": {
"raw": "$239.00",
"extracted": 239.0
},
"link": "https://www.ebay.com/itm/115600879000?hash=item1aea596d98:g:F3YAAOSwsXxjbuYn&amdata=enc%3AAQAHAAAA4LuAhrdA4ahkT85Gf15%2FtEH9GBe%2B0qlDZfEt4p9O0YPmJZVPyq%2Fkuz%2FV86SF3%2B7SYY%2BlK04XQtCyS3NGyNi03GurFWx2dYwoKFUj2G7YsLw%2BalUKmdiv5bC3jJaRTnXuBOJGPXQxw2IwTHcvZ%2Fu8T7tEnYF5ih3HGMg69vCVZdVHqRa%2FYehvk14wVwj3OwBTVrNM8dq7keGeoLKUdYDHCMAH6Y4je4mTR6PX4pWFS6S7lJ8Zrk5YhyHQInwWYXwkclgaWadC4%2BLwOzUjcKepXl5mDnxUXe6pPcccYL3u8g4O%7Ctkp%3ABk9SR56IpsWSYQ"
},
# ...
]

How to scrape data from sciencedirect

I want to scrape all data from sciencedirect by keyword.
I know that sciencedirect is program by ajax,
so the data of their page could't be extract directly via the
url of search result page.
The page I want to scrape
I've find the json data from numerous requests in Network area, in my view, I could get json data by this url of the request.But there are some error msg and garbled. Here is my code.
The request that contain json
import requests as res
import json
from bs4 import BeautifulSoup
keyword="digital game"
url = 'https://www.sciencedirect.com/search/api?'
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = res.get(url, params = payload)
print(r.content) # get garbled
r = r.json()
print(r) # get error msg
Garbled (not json data I expect)
Error msg (about .json()
Try setting the HTTP headers in the request such as user-agent to mimic a standard web browser. This will return query search results in JSON format.
import requests
keyword = "digital game"
url = 'https://www.sciencedirect.com/search/api?'
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json'
}
payload = {
'tak': keyword,
't': 'ZNS1ixW4GGlMjTKbRHccgZ2dHuMVHqLqNBwYzIZayNb8FZvZFnVnLBYUCU%2FfHTxZMgwoaQmcp%2Foemth5%2FnqtM%2BGQW3NGOv%2FI0ng6yDADzynQO66j9EPEGT0aClusSwPFvKdDbfVcomCzYflUlyb3MA%3D%3D',
'hostname': 'www.sciencedirect.com'
}
r = requests.get(url, headers=headers, params=payload)
# need to check if the response output is JSON
if "json" in r.headers.get("Content-Type"):
data = r.json()
else:
print(r.status_code)
data = r.text
print(data)
Output:
{'searchResults': [{'abstTypes': ['author', 'author-highlights'], 'authors': [{'order': 1, 'name': 'Juliana Tay'},
..., 'resultsCount': 961}}
I've got the same problem. The point is that sciencedirect.com is using cloudflare which blocks the access for scraping bots. I've tried to use different approaches like cloudsraper, cfscrape etc... Unsuccessful! Then I've made a small parser based on Selenium which allows me to take metadata from publications and put it into my own json file with following schema:
schema = {
"doi_number": {
"metadata": {
"pub_type": "Review article" | "Research article" | "Short communication" | "Conference abstract" | "Case report",
"open_access": True | False,
"title": "title_name",
"journal": "journal_name",
"date": "publishing_date",
"volume": str,
"issue": str,
"pages": str,
"authors": [
"author1",
"author2",
"author3"
]
}
}
}
If you have any questions or maybe ideas fill free to contact me.

Not all containers loading using beautiful soup

I am trying to dump a website (website link is given below in code) and all containers are not loading. In my case, price container is not dumping. See screenshots for more details. How to solve this?
In this case, container inside class "I6yQz" are not loading.
MyCode:
url = "https://gomechanic.in/gurgaon/car-battery-replacement/maruti-suzuki-versa/petrol"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
I need the following content shown in screenshot
Some thing like this:
data = {'CityName' : 'Gurgaon', 'CarName' : 'Versa-Petrol', 'serviceName' : 'Excide (55 Months Warranty)', 'Price' : '4299', 'ServicesOffered' : '['Free pickup & drop', 'Free Installation', 'Old Battery Price Included', 'Available at Doorstep']}
I have also got the API which is have all the information: https://gomechanic.app/api/v2/oauth/customer/get-services-details-by-category?car_id=249&city_id=1&category_id=-4&user_car_id=null (it will be visible by name 'get-services-details-by-category' in inspect element). The only problem is that I have to give carId and cityId instead of carName and cityName which I don't know which carId maps to what carName.
As comment pointed out - this website dynamically loads some objects like prices via javascript.
When you connect to the page you can see a request in the background being made:
What you have to do is figure out how to replicate this request in your python code:
import requests
headers = {
# this website sues authroization for all requests
'Authorization': 'Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJiNGJjM2NhZjVkMWVhOTlkYzk2YjQzM2NjYzQzMDI0ZTAyM2I0MGM2YjQ5ZjExN2JjMDk5OGY2MWU3ZDI1ZjM2MTU1YWU5ZDIxNjE2ZTc5NSIsInNjb3BlcyI6W10sInN1YiI6IjE2MzM5MzQwNjY5NCIsImV4cCI6MTYzNjUyNjA2Ny4wLCJhdWQiOiIzIiwibmJmIjoxNjMzOTM0MDY3LjAsImlhdCI6MTYzMzkzNDA2Ny4wfQ.QQI_iFpNgONAIp4bfoUbGDtnnYiiViEVsPQEK3ouYLjeyhMkEKyRclazuJ9i-ExQyqukFuqiAn4dw7drGUhRykJY6U67iSnbni0aXzzF9ZTEZrvMmqItHXjrdrxzYCqoKJAf2CYY-4hkO-NXIrTHZEnk-N_jhv30LHuK9A5I1qK8pajt4XIkC7grAn3gaMe3c6rX6Ko-AMZ801TVdACD4qIHb4o73a3vodEMvh4wjIcxRGUBGq4HBgAKxKLCcWaNz-z7XjvYrWhNJNB_iRjZ1YBN97Xk4CWxC0B4sSgA2dVsBWaKGW4ck8wvrHQyFRfFpPHux-6sCMqCC-e4okOhku3AasqPKwvUuJK4oov9tav4YsjfFevKkdsCZ1KmTehtvadoUXAHQcij0UqgMtzNPO-wKYoXwLc8yZGi_mfamAIX0izFOlFiuL26X8XUMP5HkuypUqDa3MLg91f-8oTMWfUjVYYsnjw7lwxKSl7KRKWWhuHwL6iDUjfB23qjEuq2h9JBVkoG71XpA9SrJbunWARYpQ48mc0LlYCXCbGkYIh9pOZba7JGMh7E15YyRla8qhU9pEkgWVYjzgYJaNkhrSNBaIdY56i_qlnTBpC00sqOnHRNVpYMb4gF3PPKalUMMJjbSqzEE2BNTFO5dGxGcz2cKP0smoVi_SK3XcKgPXc',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36',
}
url = 'https://gomechanic.in/api/v1/priceList?city=gurgaon&brand=maruti-suzuki&service=car-battery-replacement'
response = requests.get(url, headers=headers)
print(response.json())
Which will result in:
{
"success": true,
"data": [
{
"id": 1,
"name": "800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 2,
"name": "800 CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 3,
"name": "Alto Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 4,
"name": "Alto CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 5,
"name": "Alto 800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 6,
"name": "Alto 800 CNG",
"price": 3400,
"savings": "25%"
}
]
}
This whole process is called reverse engineering and for a more in-depth introduction you can see my tutorial blog here: https://scrapecrow.com/reverse-engineering-intro.html
As for parameters that are used in these backend API requests - they are most likely in initial html document initial state json object. If you view page source of the html page and ctrl+f parameter name like city_id you can see it's hidden deep in some json. You can either extract this whole JSON and parse it or use regular expressions like re.findall('"city_id":(\d+)', html)[0] to just get this one value.

Beautiful Soup returns an empty string when website has text

Considering this website here: https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/
I'm looking to scrape the content under the headings on the right. Here is my sample code which should return the list of contents but is returning empty strings:
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
This returns:
<ul class="park_icon">
<div class="clearfix"></div>
</ul>
The website doesn't show any div element with that class. Also, the list items are not being captured.
Facilities, and other info in that frame, are loaded dynamically by JavaScript, so bs4 doesn't see them in the source HTML because they're simply not there.
However, you can query the endpoint and get all the info you need.
Here's how:
import json
import re
import time
import requests
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://dlnr.hawaii.gov/",
}
endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback\((.*)\);", response).group(1))
print("\n".join(f for f in data["park info"]["facilities"]))
Output:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
Here's the entire JSON:
{
"park info": {
"name": "Ahupua\u02bba \u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapa\u2018ele\u2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
You've already been given the necessary answer and I thought I would provide insight into another way you could have divined what was going on (other than looking in network traffic).
Let's start with your observation:
the list items are not being captured.
Examining each of the li elements we see that the html is of the form
class="parkicon facilities icon01" - where 01 is a variable number representing the particular icon visible on the page.
A quick search through the associated source files will show you that these numbers, and their corresponding facility reference are listed in
https://dlnr.hawaii.gov/dsp/wp-content/themes/hic_state_template_StateParks/js/icon.js:
var w_fac_icons={"ADA Accessible":"01","Boat Ramp":"02","Campsites":"03","Food Concession":"04","Lodging":"05","No Drinking Water":"06","Picnic Pavilion":"07","Picnic table":"08","Pier Fishing":"09","Restroom":"10","Showers":"11","Trash Cans":"12","Walking Path":"13","Water Fountain":"14","Gift Shop":"15","Scenic Viewpoint":"16"}
If you then search the source html for w_fac_icons you will come across (lines 560-582):
// Icon Facilities
var i_facilities =[];
for(var i=0, l=parkfac.length; i < l ; ++i) {
var icon_fac = '<li class="parkicon facilities icon' + w_fac_icons[parkfac[i]] + '"><span>' + parkfac[i] + '</span></li>';
i_facilities.push(icon_fac);
};
if (l > 0){
jQuery('#i_facilities ul').html(i_facilities.join(''));
} else {
jQuery('#i_facilities').hide();
}
This shows you how the li element html is constructed through javascript running on the page with parkfac[i] returning the text description in the span, and w_fac_icons[parkfac[i]] returning the numeric value associated with the icon in the class value.
If you track back parkfac you will arrive at line 472
var parkfac = parkinfo.facilities;
If you then track back function parkinfo you will arrive at line 446 onwards, where you will find the ajax request which dynamically grabs the json data used to update the webpage:
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,
data can be passed within a querystring as params using a GET.
This is therefore the request you are looking for in the network tab.
While the above answers technically answer the question, if you're scraping data from multiple pages its not feasible to look into endpoints each time.
The simpler approach when you know you're handling a javascript page is to simply load it with scrapy-splash or selenium. Then the javascript elements can be parsed with BeautifulSoup.

Glassdoor API Not Printing Custom Response

I have the following problem when I try to print something from this api. I'm trying to set it up so I can access different headers, then print specific items from it. But instead when I try to print soup it gives me the entire api response in json format.
import requests, json, urlparse, urllib2
from BeautifulSoup import BeautifulSoup
url = "apiofsomesort"
#Create Dict based on JSON response; request the URL and parse the JSON
#response = requests.get(url)
#response.raise_for_status() # raise exception if invalid response
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response)
print soup
When it prints it looks like the below:
{
"success": true,
"status": "OK",
"jsessionid": "0541E6136E5A2D5B2A1DF1F0BFF66D03",
"response": {
"attributionURL": "http://www.glassdoor.com/Reviews/airbnb-reviews-SRCH_KE0,6.htm",
"currentPageNumber": 1,
"totalNumberOfPages": 1,
"totalRecordCount": 1,
"employers": [{
"id": 391850,
"name": "Airbnb",
"website": "www.airbnb.com",
"isEEP": true,
"exactMatch": true,
"industry": "Hotels, Motels, & Resorts",
"numberOfRatings": 416,
"squareLogo": "https://media.glassdoor.com/sqll/391850/airbnb-squarelogo-1459271200583.png",
"overallRating": 4.3,
"ratingDescription": "Very Satisfied",
"cultureAndValuesRating": "4.4",
"seniorLeadershipRating": "4.0",
"compensationAndBenefitsRating": "4.3",
"careerOpportunitiesRating": "4.1",
"workLifeBalanceRating": "3.9",
"recommendToFriendRating": "0.9",
"sectorId": 10025,
"sectorName": "Travel & Tourism",
"industryId": 200140,
"industryName": "Hotels, Motels, & Resorts",
"featuredReview": {
"attributionURL": "http://www.glassdoor.com/Reviews/Employee-Review-Airbnb-RVW12111314.htm",
"id": 12111314,
"currentJob": false,
"reviewDateTime": "2016-09-28 16:44:00.083",
"jobTitle": "Employee",
"location": "",
"headline": "An amazing place to work!",
"pros": "Wonderful people and great culture. Airbnb really strives to make you feel at home as an employee, and everyone is genuinely excited about the company mission.",
"cons": "The limitations of Rails 3 and the company infrastructure make developing difficult sometimes.",
"overall": 5,
"overallNumeric": 5
},
"ceo": {
"name": "Brian Chesky",
"title": "CEO & Co-Founder",
"numberOfRatings": 306,
"pctApprove": 95,
"pctDisapprove": 5,
"image": {
"src": "https://media.glassdoor.com/people/sqll/391850/airbnb-brian-chesky.png",
"height": 200,
"width": 200
}
}
}]
}
}
I want to print out specific items like employers":name, industry etc...
You can load the JSON response into a dict then look for the values you want like you would in any other dict.
I took your data and saved it in an external JSON file to do a test since I don't have access to the API. This worked for me.
import json
# Load JSON from external file
with open (r'C:\Temp\json\data.json') as json_file:
data = json.load(json_file)
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']
Since you're getting your data from an API something like this should work.
import json
import urlib2
url = "apiofsomesort"
# Load JSON from API
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.load(response.read())
# Print the values
print 'Name:', data['response']['employers'][0]['name']
print 'Industry:', data['response']['employers'][0]['industry']
import json, urlib2
url = "http..."
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url, headers=hdr)
response = urllib2.urlopen(req)
data = json.loads(response.read())
# Print the values
print 'numberOfRatings:', data['response']['employers'][0]['numberOfRatings']

Categories

Resources