BeautifulSoup4 returns returning html without data in - python

As an example I have code like this:
import requests
from bs4 import BeautifulSoup
def get_data(url):
r = requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
word = soup.find(class_='mdl-cell mdl-cell--11-col')
print(word)
get_data('http://savodxon.uz/izoh?sher')
I don't know why, but when I print the word there will be nothing
Like this:
<h2 class="mdl-cell mdl-cell--11-col" id="definition_l_title"></h2>
But should be like this:
<h2 id="definition_l_title" class="mdl-cell mdl-cell--11-col">acha</h2>

You have common problem with modern pages: this page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JavaScript.
You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTools I found it read data from other URLs (using post)
http://savodxon.uz/api/search
http://savodxon.uz/api/get_definition
and they give results as JSON data so it doesn't need beautifulsoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest',
}
# ---- suggestions ---
url = 'http://savodxon.uz/api/search'
payload = {
'keyword': 'sher',
'names': '[object HTMLInputElement]',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data)
# ---
print('--- suggestions ---')
for word in data['suggestions']:
print('-', word)
# --- definitons ---
url = 'http://savodxon.uz/api/get_definition'
payload = {
'word': 'sher',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data.keys())
print('--- definitons ---')
for item in data['definition']:
for meaning in item['meanings']:
print(meaning['text'])
for example in meaning['examples']:
print('-', example['text'], f"({example['takenFrom']})")
Result:
--- suggestions ---
- sher
- sherboz
- sherdil
- sherik
- sherikchilik
- sheriklashmoq
- sheriklik
- sherlanmoq
- sherobodlik
- sherolgʻin
- sheroz
- sheroza
- sherqadamlik
- shershikorlik
- sherst
--- definitons ---
Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.
- Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi. (Maqol)
- Oʻzingni er bilsang, oʻzgani sher bil. (Maqol)
- Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi. (Ertaklar)
Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).
- Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi? (I. Rahim, Ixlos)
- — Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol. (A. Qodiriy, Oʻtgan kunlar)
- Yoppa yov-lik otga mining, sherlarim. (Yusuf va Ahmad)
- Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar (Bahrom va Gulandom)
BTW:
You may also run it without headers.
Here is example video (without sound) how to use DevTools
How to use DevTools in Firefox to find JSON data in EpicGames.com - YouTube

The data you see on the page is loaded via JavaScript from external URL so beautifulsoup cannot see it. To load the data you can use requests module:
import requests
api_url = "https://savodxon.uz/api/get_definition"
data = requests.post(api_url, data={"word": "sher"}).json()
print(data)
Prints:
{
"core": "",
"definition": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Maqol",
"text": "Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi.",
},
{
"takenFrom": "Maqol",
"text": "Oʻzingni er bilsang, oʻzgani sher bil.",
},
{
"takenFrom": "Ertaklar",
"text": "Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi.",
},
],
"reference": "",
"tags": "",
"text": "Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.",
},
{
"examples": [
{
"takenFrom": "I. Rahim, Ixlos",
"text": "Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi?",
},
{
"takenFrom": "A. Qodiriy, Oʻtgan kunlar",
"text": "— Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol.",
},
{
"takenFrom": "Yusuf va Ahmad",
"text": "Yoppa yov-lik otga mining, sherlarim.",
},
{
"takenFrom": "Bahrom va Gulandom",
"text": "Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar",
},
],
"reference": "",
"tags": "koʻchma",
"text": "Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).",
},
],
"phrases": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Gazetadan",
"text": "Ichkilikning zoʻridan sher boʻlib ketgan Yazturdi endi koʻcha harakati qoidasini unutib qoʻygan edi.",
},
{
"takenFrom": "H. Tursunqulov, Hayotim qissasi",
"text": "Balli, azamat, bugun jang vaqtida sher boʻlib ketding.",
},
],
"reference": "",
"tags": "ayn.",
"text": "Sherlanmoq.",
}
],
"tags": "",
"text": "Sher boʻlmoq",
}
],
"tags": "",
}
],
"isDerivative": False,
"tailStructure": "",
"type": "ot",
"wordExists": True,
}
EDIT: To get words:
import requests
api_url = "https://savodxon.uz/api/search"
d = {"keyword": "sher", "names": "[object HTMLInputElement]"}
data = requests.post(api_url, data=d).json()
print(data)
Prints:
{
"success": True,
"matchFound": True,
"suggestions": [
"sher",
"sherboz",
"sherdil",
"sherik",
"sherikchilik",
"sheriklashmoq",
"sheriklik",
"sherlanmoq",
"sherobodlik",
"sherolgʻin",
"sheroz",
"sheroza",
"sherqadamlik",
"shershikorlik",
"sherst",
],
}

Related

Not all containers loading using beautiful soup

I am trying to dump a website (website link is given below in code) and all containers are not loading. In my case, price container is not dumping. See screenshots for more details. How to solve this?
In this case, container inside class "I6yQz" are not loading.
MyCode:
url = "https://gomechanic.in/gurgaon/car-battery-replacement/maruti-suzuki-versa/petrol"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
I need the following content shown in screenshot
Some thing like this:
data = {'CityName' : 'Gurgaon', 'CarName' : 'Versa-Petrol', 'serviceName' : 'Excide (55 Months Warranty)', 'Price' : '4299', 'ServicesOffered' : '['Free pickup & drop', 'Free Installation', 'Old Battery Price Included', 'Available at Doorstep']}
I have also got the API which is have all the information: https://gomechanic.app/api/v2/oauth/customer/get-services-details-by-category?car_id=249&city_id=1&category_id=-4&user_car_id=null (it will be visible by name 'get-services-details-by-category' in inspect element). The only problem is that I have to give carId and cityId instead of carName and cityName which I don't know which carId maps to what carName.
As comment pointed out - this website dynamically loads some objects like prices via javascript.
When you connect to the page you can see a request in the background being made:
What you have to do is figure out how to replicate this request in your python code:
import requests
headers = {
# this website sues authroization for all requests
'Authorization': 'Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJiNGJjM2NhZjVkMWVhOTlkYzk2YjQzM2NjYzQzMDI0ZTAyM2I0MGM2YjQ5ZjExN2JjMDk5OGY2MWU3ZDI1ZjM2MTU1YWU5ZDIxNjE2ZTc5NSIsInNjb3BlcyI6W10sInN1YiI6IjE2MzM5MzQwNjY5NCIsImV4cCI6MTYzNjUyNjA2Ny4wLCJhdWQiOiIzIiwibmJmIjoxNjMzOTM0MDY3LjAsImlhdCI6MTYzMzkzNDA2Ny4wfQ.QQI_iFpNgONAIp4bfoUbGDtnnYiiViEVsPQEK3ouYLjeyhMkEKyRclazuJ9i-ExQyqukFuqiAn4dw7drGUhRykJY6U67iSnbni0aXzzF9ZTEZrvMmqItHXjrdrxzYCqoKJAf2CYY-4hkO-NXIrTHZEnk-N_jhv30LHuK9A5I1qK8pajt4XIkC7grAn3gaMe3c6rX6Ko-AMZ801TVdACD4qIHb4o73a3vodEMvh4wjIcxRGUBGq4HBgAKxKLCcWaNz-z7XjvYrWhNJNB_iRjZ1YBN97Xk4CWxC0B4sSgA2dVsBWaKGW4ck8wvrHQyFRfFpPHux-6sCMqCC-e4okOhku3AasqPKwvUuJK4oov9tav4YsjfFevKkdsCZ1KmTehtvadoUXAHQcij0UqgMtzNPO-wKYoXwLc8yZGi_mfamAIX0izFOlFiuL26X8XUMP5HkuypUqDa3MLg91f-8oTMWfUjVYYsnjw7lwxKSl7KRKWWhuHwL6iDUjfB23qjEuq2h9JBVkoG71XpA9SrJbunWARYpQ48mc0LlYCXCbGkYIh9pOZba7JGMh7E15YyRla8qhU9pEkgWVYjzgYJaNkhrSNBaIdY56i_qlnTBpC00sqOnHRNVpYMb4gF3PPKalUMMJjbSqzEE2BNTFO5dGxGcz2cKP0smoVi_SK3XcKgPXc',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36',
}
url = 'https://gomechanic.in/api/v1/priceList?city=gurgaon&brand=maruti-suzuki&service=car-battery-replacement'
response = requests.get(url, headers=headers)
print(response.json())
Which will result in:
{
"success": true,
"data": [
{
"id": 1,
"name": "800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 2,
"name": "800 CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 3,
"name": "Alto Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 4,
"name": "Alto CNG",
"price": 3400,
"savings": "25%"
},
{
"id": 5,
"name": "Alto 800 Petrol",
"price": 3400,
"savings": "25%"
},
{
"id": 6,
"name": "Alto 800 CNG",
"price": 3400,
"savings": "25%"
}
]
}
This whole process is called reverse engineering and for a more in-depth introduction you can see my tutorial blog here: https://scrapecrow.com/reverse-engineering-intro.html
As for parameters that are used in these backend API requests - they are most likely in initial html document initial state json object. If you view page source of the html page and ctrl+f parameter name like city_id you can see it's hidden deep in some json. You can either extract this whole JSON and parse it or use regular expressions like re.findall('"city_id":(\d+)', html)[0] to just get this one value.

Is there a requests function that allows to extract only a portion of a JSON response from an api?

I have this giant response and I just need the IDs in "SingleItemOffers" at the end of the response (I had to cut down a lot of the json reponse due to stack overflow):
{
"FeaturedBundle": {
"Bundle": {
"ID": "2b18d53c-6173-460e-bb72-63bbb114b182",
"DataAssetID": "441117e1-40be-42e2-3aeb-49957e5c03fd",
"CurrencyID": "85ad13f7-3d1b-5128-9eb2-7cd8ee0b5741",
"Items": [
{
"Item": {
"ItemTypeID": "e7c63390-eda7-46e0-bb7a-a6abdacd2433",
"ItemID": "291cb44a-410d-b035-4d0b-608a92c2cd91",
"Amount": 1
},
"BasePrice": 1775,
"CurrencyID": "85ad13f7-3d1b-5128-9eb2-7cd8ee0b5741",
"DiscountPercent": 0.33,
"DiscountedPrice": 1189,
"IsPromoItem": false
}
]
},
"BundleRemainingDurationInSeconds": 804392
},
"SkinsPanelLayout": {
"SingleItemOffers": [
"5a0cd3b5-4249-bf6f-d009-17a81532660e",
"7e44fc1b-44fa-cdda-8491-f8a5bca1cfa3",
"daa73753-4b56-9d21-d73e-f3b3f4c9b1a6",
"f7425a39-43ca-e1fe-5b2b-56a51ed479c5"
],
"SingleItemOffersRemainingDurationInSeconds": 37592
}
}
This is my code at the moment and when I print the reponse it prints the entire thing:
import requests
import json
url = "https://pd.na.a.pvp.net/store/v2/storefront/XXX"
payload={}
headers = {
'X-Riot-Entitlements-JWT': 'XXX',
'Authorization': 'XXX'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Maybe you can try this:
response[0]["FeaturedBundle"]["Bundle"]["ID"]
you can use it specifically with a FOR loop to change the sub index and get a longer response, I hope it helps you, greetings.

No alternative languages in Microsoft Translator v. 3.0 Detect JSON Response

According to the Microsoft Translator 3.0 documentation the JSON Response body for the Detect endpoint should contain the following property:
alternatives: An array of other possible languages. Each element of the array is another object with the same properties listed above: language, score, isTranslationSupported and isTransliterationSupported.
Here is an example of a Request body from the Translator Quickstart web page:
[
{ "Text": "Ich würde wirklich gern Ihr Auto um den Block fahren ein paar Mal." }
]
And here is an expected Response body:
[
{
"alternatives": [
{
"isTranslationSupported": true,
"isTransliterationSupported": false,
"language": "nl",
"score": 0.92
},
{
"isTranslationSupported": true,
"isTransliterationSupported": false,
"language": "sk",
"score": 0.77
}
],
"isTranslationSupported": true,
"isTransliterationSupported": false,
"language": "de",
"score": 1.0
}
]
However, when I use the same Request body in my language detection endpoint, I only get one language with the score of 1.0:
import requests, uuid, json
# Add your subscription key and endpoint
subscription_key = "XXXXXXXXXXXXXXXXXX"
endpoint = "https://api.cognitive.microsofttranslator.com"
# Add your location, also known as region. The default is global.
# This is required if using a Cognitive Services resource.
location = "global"
path = '/detect'
constructed_url = endpoint + path
params = {
'api-version': '3.0'
}
constructed_url = endpoint + path
headers = {
'Ocp-Apim-Subscription-Key': subscription_key,
'Ocp-Apim-Subscription-Region': location,
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
# You can pass more than one object in body.
body = [{
'text': 'Ich würde wirklich gern Ihr Auto um den Block fahren ein paar Mal.'
}]
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
print(json.dumps(response, sort_keys=True, ensure_ascii=False, indent=4, separators=(',', ': ')))
[
{
"isTranslationSupported": true,
"isTransliterationSupported": false,
"language": "de",
"score": 1.0
}
]
Does anyone have an idea what I am missing here?
After tested, I have try to use nodejs, offical restapi, and C# language to test. Get same result. And I can debug it, find Alternatives is always null.
So I am sure offical document is not lastest.
Now the response you got is right. You can submit feedback in this page.

Beautiful Soup returns an empty string when website has text

Considering this website here: https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/
I'm looking to scrape the content under the headings on the right. Here is my sample code which should return the list of contents but is returning empty strings:
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
This returns:
<ul class="park_icon">
<div class="clearfix"></div>
</ul>
The website doesn't show any div element with that class. Also, the list items are not being captured.
Facilities, and other info in that frame, are loaded dynamically by JavaScript, so bs4 doesn't see them in the source HTML because they're simply not there.
However, you can query the endpoint and get all the info you need.
Here's how:
import json
import re
import time
import requests
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://dlnr.hawaii.gov/",
}
endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback\((.*)\);", response).group(1))
print("\n".join(f for f in data["park info"]["facilities"]))
Output:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
Here's the entire JSON:
{
"park info": {
"name": "Ahupua\u02bba \u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapa\u2018ele\u2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
You've already been given the necessary answer and I thought I would provide insight into another way you could have divined what was going on (other than looking in network traffic).
Let's start with your observation:
the list items are not being captured.
Examining each of the li elements we see that the html is of the form
class="parkicon facilities icon01" - where 01 is a variable number representing the particular icon visible on the page.
A quick search through the associated source files will show you that these numbers, and their corresponding facility reference are listed in
https://dlnr.hawaii.gov/dsp/wp-content/themes/hic_state_template_StateParks/js/icon.js:
var w_fac_icons={"ADA Accessible":"01","Boat Ramp":"02","Campsites":"03","Food Concession":"04","Lodging":"05","No Drinking Water":"06","Picnic Pavilion":"07","Picnic table":"08","Pier Fishing":"09","Restroom":"10","Showers":"11","Trash Cans":"12","Walking Path":"13","Water Fountain":"14","Gift Shop":"15","Scenic Viewpoint":"16"}
If you then search the source html for w_fac_icons you will come across (lines 560-582):
// Icon Facilities
var i_facilities =[];
for(var i=0, l=parkfac.length; i < l ; ++i) {
var icon_fac = '<li class="parkicon facilities icon' + w_fac_icons[parkfac[i]] + '"><span>' + parkfac[i] + '</span></li>';
i_facilities.push(icon_fac);
};
if (l > 0){
jQuery('#i_facilities ul').html(i_facilities.join(''));
} else {
jQuery('#i_facilities').hide();
}
This shows you how the li element html is constructed through javascript running on the page with parkfac[i] returning the text description in the span, and w_fac_icons[parkfac[i]] returning the numeric value associated with the icon in the class value.
If you track back parkfac you will arrive at line 472
var parkfac = parkinfo.facilities;
If you then track back function parkinfo you will arrive at line 446 onwards, where you will find the ajax request which dynamically grabs the json data used to update the webpage:
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,
data can be passed within a querystring as params using a GET.
This is therefore the request you are looking for in the network tab.
While the above answers technically answer the question, if you're scraping data from multiple pages its not feasible to look into endpoints each time.
The simpler approach when you know you're handling a javascript page is to simply load it with scrapy-splash or selenium. Then the javascript elements can be parsed with BeautifulSoup.

How can I scrape the content of this specific website (cineatlas)?

I am trying to scrape the content of this particular website : https://www.cineatlas.com/
I tried scraping the date part as shown in the print screen :
I used this basic beautifulsoup code
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')
This is what I get instead of the list of elements
<ul class="slidee">
<!-- adding dates -->
</ul>
The site creates HTML elements dynamically from the Javascript content. You can get the JS content by using re for example:
import re
import json
import requests
from ast import literal_eval
url = 'https://www.cineatlas.com/'
html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData) # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies
print(json.dumps(movieData, indent=4)) # print data to the screen
Prints:
{
"2019-08-06": [
{
"url": "fast--furious--hobbs--shaw",
"image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
"image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
"title": "FAST & FURIOUS : HOBBS & SHAW",
"releaseDate": "2019-08-07",
"endpoint": "ST00000392",
"duration": "120 mins",
"rating": "Classification TOUT",
"director": "",
"actors": "",
"times": [
{
"time": "7:00pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
"attributes": [
{
"_id": "5d468c20f67cc430833a5a2b",
"shortName": "VF",
"description": "Version Fran\u00e7aise"
},
{
"_id": "5d468c20f67cc430833a5a2a",
"shortName": "3D",
"description": "3D"
}
]
},
{
"time": "9:50pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
... and so on.
lis = time.findChildren()
This returns a list of child nodes

Categories

Resources