I am trying to dump a website (website link is given below in code) and all containers are not loading. In my case, price container is not dumping. See screenshots for more details. How to solve this?
In this case, container inside class "I6yQz" are not loading.
MyCode:
url = "https://gomechanic.in/gurgaon/car-battery-replacement/maruti-suzuki-versa/petrol"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
I need the following content shown in screenshot
Some thing like this:
data = {'CityName' : 'Gurgaon', 'CarName' : 'Versa-Petrol', 'serviceName' : 'Excide (55 Months Warranty)', 'Price' : '4299', 'ServicesOffered' : '['Free pickup & drop', 'Free Installation', 'Old Battery Price Included', 'Available at Doorstep']}
I have also got the API which is have all the information: https://gomechanic.app/api/v2/oauth/customer/get-services-details-by-category?car_id=249&city_id=1&category_id=-4&user_car_id=null (it will be visible by name 'get-services-details-by-category' in inspect element). The only problem is that I have to give carId and cityId instead of carName and cityName which I don't know which carId maps to what carName.
As comment pointed out - this website dynamically loads some objects like prices via javascript.
When you connect to the page you can see a request in the background being made:
What you have to do is figure out how to replicate this request in your python code:
import requests
headers = {
# this website sues authroization for all requests
'Authorization': 'Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJiNGJjM2NhZjVkMWVhOTlkYzk2YjQzM2NjYzQzMDI0ZTAyM2I0MGM2YjQ5ZjExN2JjMDk5OGY2MWU3ZDI1ZjM2MTU1YWU5ZDIxNjE2ZTc5NSIsInNjb3BlcyI6W10sInN1YiI6IjE2MzM5MzQwNjY5NCIsImV4cCI6MTYzNjUyNjA2Ny4wLCJhdWQiOiIzIiwibmJmIjoxNjMzOTM0MDY3LjAsImlhdCI6MTYzMzkzNDA2Ny4wfQ.QQI_iFpNgONAIp4bfoUbGDtnnYiiViEVsPQEK3ouYLjeyhMkEKyRclazuJ9i-ExQyqukFuqiAn4dw7drGUhRykJY6U67iSnbni0aXzzF9ZTEZrvMmqItHXjrdrxzYCqoKJAf2CYY-4hkO-NXIrTHZEnk-N_jhv30LHuK9A5I1qK8pajt4XIkC7grAn3gaMe3c6rX6Ko-AMZ801TVdACD4qIHb4o73a3vodEMvh4wjIcxRGUBGq4HBgAKxKLCcWaNz-z7XjvYrWhNJNB_iRjZ1YBN97Xk4CWxC0B4sSgA2dVsBWaKGW4ck8wvrHQyFRfFpPHux-6sCMqCC-e4okOhku3AasqPKwvUuJK4oov9tav4YsjfFevKkdsCZ1KmTehtvadoUXAHQcij0UqgMtzNPO-wKYoXwLc8yZGi_mfamAIX0izFOlFiuL26X8XUMP5HkuypUqDa3MLg91f-8oTMWfUjVYYsnjw7lwxKSl7KRKWWhuHwL6iDUjfB23qjEuq2h9JBVkoG71XpA9SrJbunWARYpQ48mc0LlYCXCbGkYIh9pOZba7JGMh7E15YyRla8qhU9pEkgWVYjzgYJaNkhrSNBaIdY56i_qlnTBpC00sqOnHRNVpYMb4gF3PPKalUMMJjbSqzEE2BNTFO5dGxGcz2cKP0smoVi_SK3XcKgPXc',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36',
}
url = 'https://gomechanic.in/api/v1/priceList?city=gurgaon&brand=maruti-suzuki&service=car-battery-replacement'
response = requests.get(url, headers=headers)
print(response.json())
Which will result in:
{
"success": true,
"data": [
{
"id": 1,
"name": "800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 2,
"name": "800 CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 3,
"name": "Alto Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 4,
"name": "Alto CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 5,
"name": "Alto 800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 6,
"name": "Alto 800 CNG",
"price": 3400,
"savings": "25%"
}
]
}
This whole process is called reverse engineering and for a more in-depth introduction you can see my tutorial blog here: https://scrapecrow.com/reverse-engineering-intro.html
As for parameters that are used in these backend API requests - they are most likely in initial html document initial state json object. If you view page source of the html page and ctrl+f parameter name like city_id you can see it's hidden deep in some json. You can either extract this whole JSON and parse it or use regular expressions like re.findall('"city_id":(\d+)', html)[0] to just get this one value.
Related
I am trying to extract the article body with images from this link, so that using the extracted article body I can make a HTML table. So, I have tried using BeautifulSoup.
t_link = 'https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html'
page = requests.get(t_link)
soup_page = BeautifulSoup(page.content, 'html.parser')
html_article = soup_page.find_all("div", {"class": re.compile('ArticleBody-articleBody.?')})
for article_body in html_article:
print(article_body)
But unfortunately the article_body didn't show any image, like this. Because, <div class="InlineImage-wrapper"> is't scraping in this way
So, how can I get article data with article images, so that I can make a HTML table?
I didn't quite understand your goal, so mine is probably not the answer you want.
In the html source of that page you have all inside the script you at the bottom.
It has inside the content of the page in JSON format.
If you simply use grep and jq (a great JSON cli utility), you can run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq '[.content[]|select(.tagName|contains("image"))]'
to have all infos about the images
[
{
"tagName": "image",
"attributes": {
"id": "106967852",
"type": "image",
"creatorOverwrite": "PM Images",
"headline": "Retirement Savings",
"url": "https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026",
"datePublished": "2021-10-29T16:30:26+0000",
"copyrightHolder": "PM Images",
"width": "2233",
"height": "1343"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
},
{
"tagName": "image",
"attributes": {
"id": "106323101",
"type": "image",
"creatorOverwrite": "JGI/Jamie Grill",
"headline": "GP: 401k money jar on desk of businesswoman",
"url": "https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437",
"datePublished": "2020-01-06T20:58:19+0000",
"copyrightHolder": "JGI/Jamie Grill",
"width": "5120",
"height": "3418"
},
"data": {
"__typename": "image"
},
"children": [],
"__typename": "bodyContent"
}
]
If you need only the URLs, run
curl -kL "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html" | \
grep -Po '"body":.+"body".' | \
grep -Po '{"content":\[.+"body".' | \
jq -r '[.content[]|select(.tagName|contains("image"))]|.[].attributes.url'
to get
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437
Everything you want is in the source HTML, but you need to jump through a couple of hoops to get that data.
I'm providing the following:
article body
two (2) images that go with the article body and a url to header video (1)
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
with requests.Session() as s:
s.headers.update(headers)
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
print(item["attributes"]["url"])
This should print:
The entire article body (Redacted for brevity)
The new year offers opportunities for many Americans in their careers and financial lives. The "Great Reshuffle" is expected to continue as employees leave jobs and take new ones at a rapid clip. At the same time, many workers have made a vow to save more this year, yet many admit they don't know how they'll stick to that goal. One piece of advice: Keep it simple.
[...]
The above mentioned urls to assets:
https://www.cnbc.com/video/2022/01/03/how-to-choose-the-best-retirement-strategy-for-2022.html
https://image.cnbcfm.com/api/v1/image/106967852-1635524865061-GettyImages-1072593728.jpg?v=1635525026
https://image.cnbcfm.com/api/v1/image/106323101-1578344280328gettyimages-672157227.jpeg?v=1641216437
EDIT:
If you want to download the images, use this:
import json
import os
import re
from pathlib import Path
from shutil import copyfileobj
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
url = "https://www.cnbc.com/2022/01/03/5-ways-to-reset-your-retirement-savings-and-save-more-money-in-2022.html"
def download_images(image_source: str, directory: str) -> None:
"""Download images from a given source and save them to a given directory."""
os.makedirs(directory, exist_ok=True)
save_dir = Path(directory)
if re.match(r".*\.jp[e-g]", image_source):
file_name = save_dir / image_source.split("/")[-1].split("?")[0]
with s.get(image_source, stream=True) as img, open(file_name, "wb") as output:
copyfileobj(img.raw, output)
with requests.Session() as s:
s.headers.update(headers)
script = [
s.text for s in
BeautifulSoup(s.get(url).text, "lxml").find_all("script")
if "window.__s_data" in s.text
][0]
payload = json.loads(
re.match(r"window\.__s_data=(.*);\swindow\.__c_data=", script).group(1)
)
article_data = (
payload
["page"]
["page"]
["layout"][3]
["columns"][0]
["modules"][2]
["data"]
)
print(article_data["articleBodyText"])
for item in article_data["body"]["content"]:
if "url" in item["attributes"].keys():
url = item["attributes"]["url"]
print(url)
download_images(url, "images")
As an example I have code like this:
import requests
from bs4 import BeautifulSoup
def get_data(url):
r = requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
word = soup.find(class_='mdl-cell mdl-cell--11-col')
print(word)
get_data('http://savodxon.uz/izoh?sher')
I don't know why, but when I print the word there will be nothing
Like this:
<h2 class="mdl-cell mdl-cell--11-col" id="definition_l_title"></h2>
But should be like this:
<h2 id="definition_l_title" class="mdl-cell mdl-cell--11-col">acha</h2>
You have common problem with modern pages: this page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JavaScript.
You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTools I found it read data from other URLs (using post)
http://savodxon.uz/api/search
http://savodxon.uz/api/get_definition
and they give results as JSON data so it doesn't need beautifulsoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest',
}
# ---- suggestions ---
url = 'http://savodxon.uz/api/search'
payload = {
'keyword': 'sher',
'names': '[object HTMLInputElement]',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data)
# ---
print('--- suggestions ---')
for word in data['suggestions']:
print('-', word)
# --- definitons ---
url = 'http://savodxon.uz/api/get_definition'
payload = {
'word': 'sher',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data.keys())
print('--- definitons ---')
for item in data['definition']:
for meaning in item['meanings']:
print(meaning['text'])
for example in meaning['examples']:
print('-', example['text'], f"({example['takenFrom']})")
Result:
--- suggestions ---
- sher
- sherboz
- sherdil
- sherik
- sherikchilik
- sheriklashmoq
- sheriklik
- sherlanmoq
- sherobodlik
- sherolgʻin
- sheroz
- sheroza
- sherqadamlik
- shershikorlik
- sherst
--- definitons ---
Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.
- Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi. (Maqol)
- Oʻzingni er bilsang, oʻzgani sher bil. (Maqol)
- Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi. (Ertaklar)
Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).
- Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi? (I. Rahim, Ixlos)
- — Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol. (A. Qodiriy, Oʻtgan kunlar)
- Yoppa yov-lik otga mining, sherlarim. (Yusuf va Ahmad)
- Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar (Bahrom va Gulandom)
BTW:
You may also run it without headers.
Here is example video (without sound) how to use DevTools
How to use DevTools in Firefox to find JSON data in EpicGames.com - YouTube
The data you see on the page is loaded via JavaScript from external URL so beautifulsoup cannot see it. To load the data you can use requests module:
import requests
api_url = "https://savodxon.uz/api/get_definition"
data = requests.post(api_url, data={"word": "sher"}).json()
print(data)
Prints:
{
"core": "",
"definition": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Maqol",
"text": "Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi.",
},
{
"takenFrom": "Maqol",
"text": "Oʻzingni er bilsang, oʻzgani sher bil.",
},
{
"takenFrom": "Ertaklar",
"text": "Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi.",
},
],
"reference": "",
"tags": "",
"text": "Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.",
},
{
"examples": [
{
"takenFrom": "I. Rahim, Ixlos",
"text": "Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi?",
},
{
"takenFrom": "A. Qodiriy, Oʻtgan kunlar",
"text": "— Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol.",
},
{
"takenFrom": "Yusuf va Ahmad",
"text": "Yoppa yov-lik otga mining, sherlarim.",
},
{
"takenFrom": "Bahrom va Gulandom",
"text": "Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar",
},
],
"reference": "",
"tags": "koʻchma",
"text": "Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).",
},
],
"phrases": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Gazetadan",
"text": "Ichkilikning zoʻridan sher boʻlib ketgan Yazturdi endi koʻcha harakati qoidasini unutib qoʻygan edi.",
},
{
"takenFrom": "H. Tursunqulov, Hayotim qissasi",
"text": "Balli, azamat, bugun jang vaqtida sher boʻlib ketding.",
},
],
"reference": "",
"tags": "ayn.",
"text": "Sherlanmoq.",
}
],
"tags": "",
"text": "Sher boʻlmoq",
}
],
"tags": "",
}
],
"isDerivative": False,
"tailStructure": "",
"type": "ot",
"wordExists": True,
}
EDIT: To get words:
import requests
api_url = "https://savodxon.uz/api/search"
d = {"keyword": "sher", "names": "[object HTMLInputElement]"}
data = requests.post(api_url, data=d).json()
print(data)
Prints:
{
"success": True,
"matchFound": True,
"suggestions": [
"sher",
"sherboz",
"sherdil",
"sherik",
"sherikchilik",
"sheriklashmoq",
"sheriklik",
"sherlanmoq",
"sherobodlik",
"sherolgʻin",
"sheroz",
"sheroza",
"sherqadamlik",
"shershikorlik",
"sherst",
],
}
I am trying to web-scrap this webpage but I always end up getting the "main" page (same URL but without "#face-a-face" at the end). It's the same problem as this guy encountered, see this forum. He got an answer but I am not able to generalize and apply this for the website I want to scrap.
import requests
from bs4 import BeautifulSoup
url_main = "https://www.lequipe.fr/Football/match-direct/ligue-1/2020-2021/ol-dijon-live/477168"
url_target = url_main + "#face-a-face"
soup_main = BeautifulSoup(requests.get(url_main, verify=False).content, "html.parser")
soup_target = BeautifulSoup(requests.get(url_target, verify=False).content, "html.parser")
print(soup_main == soup_target)
returns True. I would like to get different contents, this is not the case here.
For example, I would like to extract all the "confrontations depuis 2011" in the target webpage. How can I get the final content of this webpage with a GET request (or with another way) ? Thanks !
All the data comes from a highly nested JSON file.
You can get that file and extract the information you need.
Here's how:
import json
import requests
endpoint = "https://iphdata.lequipe.fr/iPhoneDatas/EFR/STD/ALL/V2/Football/Prelive/68/477168.json"
team_data = requests.get(endpoint).json()
specifics = team_data["items"][1]["objet"]["matches"][0]["specifics"]
print(json.dumps(specifics, indent=2))
This should get you a dictionary:
{
"__type": "specifics_sport_collectif",
"vainqueur": "domicile",
"score": {
"__type": "score",
"exterieur": "1",
"domicile": "4"
},
"exterieur": {
"__type": "effectif_sport_collectif",
"equipe": {
"__type": "equipe",
"id": "202",
"url_image": "https://medias.lequipe.fr/logo-football/202/{width}{declinaison}",
"nom": "Dijon",
"url_fiche": "https://www.lequipe.fr/Football/FootballFicheClub202.html"
}
},
"domicile": {
"__type": "effectif_sport_collectif",
"equipe": {
"__type": "equipe",
"id": "22",
"url_image": "https://medias.lequipe.fr/logo-football/22/{width}{declinaison}",
"nom": "Lyon",
"url_fiche": "https://www.lequipe.fr/Football/FootballFicheClub22.html"
}
},
"is_final": false,
"prolongation": false,
"vainqueur_final": "domicile",
"is_qualifier": false
}
And if you, for example, just want the socre, add this line:
just_the_score = specifics["score"]
print(just_the_score)
To get this:
{'__type': 'score', 'exterieur': '1', 'domicile': '4'}
Considering this website here: https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/
I'm looking to scrape the content under the headings on the right. Here is my sample code which should return the list of contents but is returning empty strings:
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
This returns:
<ul class="park_icon">
<div class="clearfix"></div>
</ul>
The website doesn't show any div element with that class. Also, the list items are not being captured.
Facilities, and other info in that frame, are loaded dynamically by JavaScript, so bs4 doesn't see them in the source HTML because they're simply not there.
However, you can query the endpoint and get all the info you need.
Here's how:
import json
import re
import time
import requests
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://dlnr.hawaii.gov/",
}
endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback\((.*)\);", response).group(1))
print("\n".join(f for f in data["park info"]["facilities"]))
Output:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
Here's the entire JSON:
{
"park info": {
"name": "Ahupua\u02bba \u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapa\u2018ele\u2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
You've already been given the necessary answer and I thought I would provide insight into another way you could have divined what was going on (other than looking in network traffic).
Let's start with your observation:
the list items are not being captured.
Examining each of the li elements we see that the html is of the form
class="parkicon facilities icon01" - where 01 is a variable number representing the particular icon visible on the page.
A quick search through the associated source files will show you that these numbers, and their corresponding facility reference are listed in
https://dlnr.hawaii.gov/dsp/wp-content/themes/hic_state_template_StateParks/js/icon.js:
var w_fac_icons={"ADA Accessible":"01","Boat Ramp":"02","Campsites":"03","Food Concession":"04","Lodging":"05","No Drinking Water":"06","Picnic Pavilion":"07","Picnic table":"08","Pier Fishing":"09","Restroom":"10","Showers":"11","Trash Cans":"12","Walking Path":"13","Water Fountain":"14","Gift Shop":"15","Scenic Viewpoint":"16"}
If you then search the source html for w_fac_icons you will come across (lines 560-582):
// Icon Facilities
var i_facilities =[];
for(var i=0, l=parkfac.length; i < l ; ++i) {
var icon_fac = '<li class="parkicon facilities icon' + w_fac_icons[parkfac[i]] + '"><span>' + parkfac[i] + '</span></li>';
i_facilities.push(icon_fac);
};
if (l > 0){
jQuery('#i_facilities ul').html(i_facilities.join(''));
} else {
jQuery('#i_facilities').hide();
}
This shows you how the li element html is constructed through javascript running on the page with parkfac[i] returning the text description in the span, and w_fac_icons[parkfac[i]] returning the numeric value associated with the icon in the class value.
If you track back parkfac you will arrive at line 472
var parkfac = parkinfo.facilities;
If you then track back function parkinfo you will arrive at line 446 onwards, where you will find the ajax request which dynamically grabs the json data used to update the webpage:
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,
data can be passed within a querystring as params using a GET.
This is therefore the request you are looking for in the network tab.
While the above answers technically answer the question, if you're scraping data from multiple pages its not feasible to look into endpoints each time.
The simpler approach when you know you're handling a javascript page is to simply load it with scrapy-splash or selenium. Then the javascript elements can be parsed with BeautifulSoup.
I'm trying to hit my geocoding server's REST API:
[https://locator.stanford.edu/arcgis/rest/services/geocode/USA_StreetAddress/GeocodeServer] (ArcGIS Server 10.6.1)
...using the POST method (which, BTW, could use an example or two, there only seems to be this VERY brief "note" on WHEN to use POST, not HOW: https://developers.arcgis.com/rest/geocode/api-reference/geocoding-geocode-addresses.htm#ESRI_SECTION1_351DE4FD98FE44958C8194EC5A7BEF7D).
I'm trying to use requests.post(), and I think I've managed to get the token accepted, etc..., but I keep getting a 400 error.
Based upon previous experience, this means something about the formatting of the data is bad, but I've cut-&-pasted directly from the Esri support site, this test pair.
# import the requests library
import requests
# Multiple address records
addresses={
"records": [
{
"attributes": {
"OBJECTID": 1,
"Street": "380 New York St.",
"City": "Redlands",
"Region": "CA",
"ZIP": "92373"
}
},
{
"attributes": {
"OBJECTID": 2,
"Street": "1 World Way",
"City": "Los Angeles",
"Region": "CA",
"ZIP": "90045"
}
}
]
}
# Parameters
# Geocoder endpoint
URL = 'https://locator.stanford.edu/arcgis/rest/services/geocode/USA_StreetAddress/GeocodeServer/geocodeAddresses?'
# token from locator.stanford.edu/arcgis/tokens
mytoken = <GeneratedToken>
# output spatial reference id
outsrid = 4326
# output format
format = 'pjson'
# params data to be sent to api
params ={'outSR':outsrid,'f':format,'token':mytoken}
# Use POST to batch geocode
r = requests.post(url=URL, data=addresses, params=params)
print(r.json())
print(r.text)
Here's what I consistently get:
{'error': {'code': 400, 'message': 'Unable to complete operation.', 'details': []}}
I had to play around with this for longer than I'd like to admit, but the trick (I guess) is to use the correct request header and convert the raw addresses to a JSON string using json.dumps().
import requests
import json
url = 'http://sampleserver6.arcgisonline.com/arcgis/rest/services/Locators/SanDiego/GeocodeServer/geocodeAddresses'
headers = { 'Content-Type': 'application/x-www-form-urlencoded' }
addresses = json.dumps({ 'records': [{ 'attributes': { 'OBJECTID': 1, 'SingleLine': '2920 Zoo Dr' }}] })
r = requests.post(url, headers = headers, data = { 'addresses': addresses, 'f':'json'})
print(r.text)