import requests
from bs4 import BeautifulSoup
import json
data = {
0:{
0:"title",
1:"dates",
2:"city/state",
3:"country"
},
1:{
0:"event",
1:"reps",
2:"prize"
},
2:{
0:"results"
}
}
url = "https://mms.kcbs.us/members/evr_search.php?org_id=KCBA"
response = requests.get(url).text
soup = BeautifulSoup(response, features='lxml')
all_data = []
for element in soup.find_all('div', class_="row"):
event = {}
for i, col in enumerate(element.find_all('div', class_='col-md-4')):
for j, item in enumerate(col.strings):
event[data[i][j]] = item
all_data.append(event)
print(json.dumps(all_data,indent=4))
heres a link to the website https://mms.kcbs.us/members/evr_search.php?org_id=KCBA
Im unsure why nothing gets added to the list and dictionaries
The data you see is loaded from external URL via JavaScript. To simulate the Ajax request you can use next example:
import json
import requests
from bs4 import BeautifulSoup
api_url = "https://mms.kcbs.us/members/evr_search_ol_json.php"
params = {
"otype": "TEXT",
"evr_map_type": "2",
"org_id": "KCBA",
"evr_begin": "6/16/2022",
"evr_end": "7/16/2022",
"evr_address": "",
"evr_radius": "50",
"evr_type": "269",
"evr_openings": "0",
"evr_region": "",
"evr_region_type": "1",
"evr_judge": "0",
"evr_keyword": "",
"evr_rep_name": "",
}
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
data = {
0: {0: "title", 1: "dates", 2: "city/state", 3: "country"},
1: {0: "event", 1: "reps", 2: "prize"},
2: {0: "results"},
}
all_data = []
for element in soup.find_all("div", class_="row"):
event = {}
for i, col in enumerate(element.find_all("div", class_="col-md-4")):
for j, item in enumerate(col.strings):
event[data[i][j]] = item
all_data.append(event)
print(json.dumps(all_data, indent=4))
Prints:
[
{
"title": "Frisco BBQ Challenge",
"dates": "6/16/2022 - 6/18/2022",
"city/state": "Frisco, CO 80443",
"country": "UNITED STATES",
"event": "STATE CHAMPIONSHIP",
"reps": "Reps: BUNNY TUTTLE, RICH TUTTLE, MICHAEL WINTER",
"prize": "Prize Money: $13,050.00",
"results": "Results Not In"
},
{
"title": "York County BBQ Festival",
"dates": "6/17/2022 - 6/18/2022",
"city/state": "Delta, PA 17314",
"country": "UNITED STATES",
"event": "STATE CHAMPIONSHIP",
"reps": "Reps: ANGELA MCKEE, ROBERT MCKEE, LOUISE WEIDNER",
"prize": "Prize Money: $5,500.00",
"results": "Results Not In"
},
...and so on.
Related
As an example I have code like this:
import requests
from bs4 import BeautifulSoup
def get_data(url):
r = requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
word = soup.find(class_='mdl-cell mdl-cell--11-col')
print(word)
get_data('http://savodxon.uz/izoh?sher')
I don't know why, but when I print the word there will be nothing
Like this:
<h2 class="mdl-cell mdl-cell--11-col" id="definition_l_title"></h2>
But should be like this:
<h2 id="definition_l_title" class="mdl-cell mdl-cell--11-col">acha</h2>
You have common problem with modern pages: this page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JavaScript.
You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTools I found it read data from other URLs (using post)
http://savodxon.uz/api/search
http://savodxon.uz/api/get_definition
and they give results as JSON data so it doesn't need beautifulsoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest',
}
# ---- suggestions ---
url = 'http://savodxon.uz/api/search'
payload = {
'keyword': 'sher',
'names': '[object HTMLInputElement]',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data)
# ---
print('--- suggestions ---')
for word in data['suggestions']:
print('-', word)
# --- definitons ---
url = 'http://savodxon.uz/api/get_definition'
payload = {
'word': 'sher',
}
response = requests.post(url, data=payload, headers=headers)
data = response.json()
#print(data.keys())
print('--- definitons ---')
for item in data['definition']:
for meaning in item['meanings']:
print(meaning['text'])
for example in meaning['examples']:
print('-', example['text'], f"({example['takenFrom']})")
Result:
--- suggestions ---
- sher
- sherboz
- sherdil
- sherik
- sherikchilik
- sheriklashmoq
- sheriklik
- sherlanmoq
- sherobodlik
- sherolgʻin
- sheroz
- sheroza
- sherqadamlik
- shershikorlik
- sherst
--- definitons ---
Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.
- Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi. (Maqol)
- Oʻzingni er bilsang, oʻzgani sher bil. (Maqol)
- Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi. (Ertaklar)
Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).
- Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi? (I. Rahim, Ixlos)
- — Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol. (A. Qodiriy, Oʻtgan kunlar)
- Yoppa yov-lik otga mining, sherlarim. (Yusuf va Ahmad)
- Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar (Bahrom va Gulandom)
BTW:
You may also run it without headers.
Here is example video (without sound) how to use DevTools
How to use DevTools in Firefox to find JSON data in EpicGames.com - YouTube
The data you see on the page is loaded via JavaScript from external URL so beautifulsoup cannot see it. To load the data you can use requests module:
import requests
api_url = "https://savodxon.uz/api/get_definition"
data = requests.post(api_url, data={"word": "sher"}).json()
print(data)
Prints:
{
"core": "",
"definition": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Maqol",
"text": "Ovchining zoʻri sher otadi, Dehqonning zoʻri yer ochadi.",
},
{
"takenFrom": "Maqol",
"text": "Oʻzingni er bilsang, oʻzgani sher bil.",
},
{
"takenFrom": "Ertaklar",
"text": "Bular [uch ogʻayni botirlar] tushgan toʻqayning narigi tomonida bir sherning makoni bor edi.",
},
],
"reference": "",
"tags": "",
"text": "Mushuksimonlar oilasiga mansub, kalta va sargʻish yungli (erkaklari esa qalin yolli) yirik sutemizuvchi yirtqich hayvon; arslon.",
},
{
"examples": [
{
"takenFrom": "I. Rahim, Ixlos",
"text": "Bu hujjatni butun rayonga tarqatmoqchimiz, sher, obroʻying oshib, choʻqqiga koʻtarilayotganingni bilasanmi?",
},
{
"takenFrom": "A. Qodiriy, Oʻtgan kunlar",
"text": "— Balli, sher, xatni qoʻlingizdan kim oldi? — Bir chol.",
},
{
"takenFrom": "Yusuf va Ahmad",
"text": "Yoppa yov-lik otga mining, sherlarim.",
},
{
"takenFrom": "Bahrom va Gulandom",
"text": "Figʻon qilgan bunda sherlar, Yoʻlbars, qoplon, bunda erlar",
},
],
"reference": "",
"tags": "koʻchma",
"text": "Shaxsni sherga nisbatlab ataydi (“azamat“, “botir“ polvon maʼnosida).",
},
],
"phrases": [
{
"meanings": [
{
"examples": [
{
"takenFrom": "Gazetadan",
"text": "Ichkilikning zoʻridan sher boʻlib ketgan Yazturdi endi koʻcha harakati qoidasini unutib qoʻygan edi.",
},
{
"takenFrom": "H. Tursunqulov, Hayotim qissasi",
"text": "Balli, azamat, bugun jang vaqtida sher boʻlib ketding.",
},
],
"reference": "",
"tags": "ayn.",
"text": "Sherlanmoq.",
}
],
"tags": "",
"text": "Sher boʻlmoq",
}
],
"tags": "",
}
],
"isDerivative": False,
"tailStructure": "",
"type": "ot",
"wordExists": True,
}
EDIT: To get words:
import requests
api_url = "https://savodxon.uz/api/search"
d = {"keyword": "sher", "names": "[object HTMLInputElement]"}
data = requests.post(api_url, data=d).json()
print(data)
Prints:
{
"success": True,
"matchFound": True,
"suggestions": [
"sher",
"sherboz",
"sherdil",
"sherik",
"sherikchilik",
"sheriklashmoq",
"sheriklik",
"sherlanmoq",
"sherobodlik",
"sherolgʻin",
"sheroz",
"sheroza",
"sherqadamlik",
"shershikorlik",
"sherst",
],
}
I need to get the link of the first photo from the link "https://www.balticshipping.com/vessel/imo/9127382" using Python.
I am testing with the BeautifullSoup library but there is no way to get it. From what I see, the image is not in JPG or PNG format, therefore, it does not detect it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://www.balticshipping.com/vessel/imo/9127382')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
for image in images:
print(image['src']+'\n')
Does anyone have any ideas how to do it?
Full Loop Code: ("s" contains many ships data (IMO, date, shipname...))
def create_geojson_features(s):
features = []
for _, row in s.iterrows():
vessel_id = row['IMO']
data = {
"templates[]": [
"modal_validation_errors:0",
"modal_email_verificate:0",
"r_vessel_types_multi:0",
"r_positions_single:0",
"vessel_profile:0",
],
"request[0][module]": "ships",
"request[0][action]": "list",
"request[0][id]": "0",
"request[0][data][0][name]": "imo",
"request[0][data][0][value]": vessel_id,
"request[0][sort]": "",
"request[0][limit]": "1",
"request[0][stamp]": "0",
"request[1][module]": "top_stat",
"request[1][action]": "list",
"request[1][id]": "0",
"request[1][data]": "",
"request[1][sort]": "",
"request[1][limit]": "",
"request[1][stamp]": "0",
"dictionary[]": ["countrys:0", "vessel_types:0", "positions:0"],
}
data = requests.post("https://www.balticshipping.com/", data=data).json()
image = data["data"]["request"][0]["ships"][0]["data"]["gallery"][0]["file"]
print(image)
feature = {
'type': 'Feature',
'geometry': {
'type':'Point',
'coordinates':[row['lon'],row['lat']]
},
'properties': {
'time': pd.to_datetime(row['date']).__str__(),
'popup': "<img src=" + image.__str__() + " width = '250' height='200'/>"+'<br>'+'<br>'+'Shipname: '+row['shipname'].__str__() +'<br>'+ 'MMSI: '+row['mmsi'].__str__() +'<br>' + 'Group: '+row['group'].__str__() +'<br>''Speed: '+row['speed'].__str__()+' knots',
'style': {'color' : ''},
'icon': 'circle',
'iconstyle':{
'fillColor': row['fillColor'],
'fillOpacity': 0.8,
'radius': 5
}
}
}
features.append(feature)
return features
The data you see are loaded via Ajax from external source. You can use this example how to get the picture URLs:
import json
import requests
url = "https://www.balticshipping.com/vessel/imo/9127382"
vessel_id = url.split("/")[-1]
data = {
"templates[]": [
"modal_validation_errors:0",
"modal_email_verificate:0",
"r_vessel_types_multi:0",
"r_positions_single:0",
"vessel_profile:0",
],
"request[0][module]": "ships",
"request[0][action]": "list",
"request[0][id]": "0",
"request[0][data][0][name]": "imo",
"request[0][data][0][value]": vessel_id,
"request[0][sort]": "",
"request[0][limit]": "1",
"request[0][stamp]": "0",
"request[1][module]": "top_stat",
"request[1][action]": "list",
"request[1][id]": "0",
"request[1][data]": "",
"request[1][sort]": "",
"request[1][limit]": "",
"request[1][stamp]": "0",
"dictionary[]": ["countrys:0", "vessel_types:0", "positions:0"],
}
data = requests.post("https://www.balticshipping.com/", data=data).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for g in data["data"]["request"][0]["ships"][0]["data"]["gallery"]:
print(g["file"])
Prints:
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2948097
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2864147
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2830344
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2674783
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2521379
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2083722
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=2083721
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1599301
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1464102
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1464099
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1464093
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1464089
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=1110349
https://photos.marinetraffic.com/ais/showphoto.aspx?photoid=433106
As a followup to this question, how can I locate the XHR request which is used to retrieve the data from the back-end API on CNBC News in order to be able to scrape this CNBC search query?
The end goal is to have a doc with: headline, date, full article and url.
I have found this: https://api.sail-personalize.com/v1/personalize/initialize?pageviews=1&isMobile=0&query=coronavirus&qsearchterm=coronavirus
Which tells me I don't have access. Is there a way to access information anyway?
Actually my previous answer for you were addressing your question regarding the XHR request:
But here we go with a screenshot:
import requests
params = {
"queryly_key": "31a35d40a9a64ab3",
"query": "coronavirus",
"endindex": "0",
"batchsize": "100",
"callback": "",
"showfaceted": "true",
"timezoneoffset": "-120",
"facetedfields": "formats",
"facetedkey": "formats|",
"facetedvalue":
"!Press Release|",
"needtoptickers": "1",
"additionalindexes": "4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"
}
goal = ["cn:title", "_pubDate", "cn:liveURL", "description"]
def main(url):
with requests.Session() as req:
for page, item in enumerate(range(0, 1100, 100)):
print(f"Extracting Page# {page +1}")
params["endindex"] = item
r = req.get(url, params=params).json()
for loop in r['results']:
print([loop[x] for x in goal])
main("https://api.queryly.com/cnbc/json.aspx")
Pandas DataFrame version:
import requests
import pandas as pd
params = {
"queryly_key": "31a35d40a9a64ab3",
"query": "coronavirus",
"endindex": "0",
"batchsize": "100",
"callback": "",
"showfaceted": "true",
"timezoneoffset": "-120",
"facetedfields": "formats",
"facetedkey": "formats|",
"facetedvalue":
"!Press Release|",
"needtoptickers": "1",
"additionalindexes": "4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"
}
goal = ["cn:title", "_pubDate", "cn:liveURL", "description"]
def main(url):
with requests.Session() as req:
allin = []
for page, item in enumerate(range(0, 1100, 100)):
print(f"Extracting Page# {page +1}")
params["endindex"] = item
r = req.get(url, params=params).json()
for loop in r['results']:
allin.append([loop[x] for x in goal])
new = pd.DataFrame(
allin, columns=["Title", "Date", "Url", "Description"])
new.to_csv("data.csv", index=False)
main("https://api.queryly.com/cnbc/json.aspx")
Output: view online
If you do View Page source on the below link;
https://www.zoopla.co.uk/for-sale/details/53818653?search_identifier=7e57533214fc2402ba53dd6c14b624f8
Line 89 has the tag <script> with information under it up to line 164. I am trying to extract this with beautiful soup but am unable to.
I can successfully extract other tags like "h2"/"Div" etc using the below:
From line 1,028 of the page source.
for item_name in soup.findAll('h2', {'class': 'ui-property-summary__address'}):
ad = item_name.get_text(strip=True)"
Can you please advise how I can extract the Script tag from line 89?
Thanks
This example will locate the <script> tag and parse some data from it:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/details/53818653?search_identifier=7e57533214fc2402ba53dd6c14b624f8'
# locate the tag
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
script = soup.select_one('script:contains("ZPG.trackData.taxonomy")')
# parse some data from script
data1 = re.findall(r'ZPG\.trackData\.ecommerce = ({.*?});', script.text, flags=re.S)[0]
data1 = json.loads( re.sub(r'([^"\s]+):\s', r'"\1": ', data1) )
data2 = re.findall(r'ZPG\.trackData\.taxonomy = ({.*?});', script.text, flags=re.S)[0]
data2 = json.loads( re.sub(r'([^"\s]+):\s', r'"\1": ', data2) )
# print the data
print(json.dumps(data1, indent=4))
print(json.dumps(data2, indent=4))
Prints:
{
"detail": {
"products": [
{
"brand": "Walton and Allen Estate Agents Ltd",
"category": "for-sale/resi/agent/pre-owned/gb",
"id": 53818653,
"name": "FS_Contact",
"price": 1,
"quantity": 1,
"variant": "standard"
}
]
}
}
{
"signed_in_status": "signed out",
"acorn": 44,
"acorn_type": 44,
"area_name": "Aspley, Nottingham",
"beds_max": 3,
"beds_min": 3,
"branch_id": "43168",
"branch_logo_url": "https://st.zoocdn.com/zoopla_static_agent_logo_(586192).png",
"branch_name": "Walton & Allen Estate Agents",
"brand_name": "Walton and Allen Estate Agents Ltd",
"chain_free": false,
"company_id": "21619",
"country_code": "gb",
"county_area_name": "Nottingham",
"currency_code": "GBP",
"display_address": "Melbourne Road, Aspley, Nottingham NG8",
"furnished_state": "",
"group_id": "",
"has_epc": false,
"has_floorplan": true,
"incode": "5HN",
"is_retirement_home": false,
"is_shared_ownership": false,
"listing_condition": "pre-owned",
"listing_id": 53818653,
"listing_status": "for_sale",
"listings_category": "residential",
"location": "Aspley",
"member_type": "agent",
"num_baths": 1,
"num_beds": 3,
"num_images": 15,
"num_recepts": 1,
"outcode": "NG8",
"post_town_name": "Nottingham",
"postal_area": "NG",
"price": 150000,
"price_actual": 150000,
"price_max": 150000,
"price_min": 150000,
"price_qualifier": "guide_price",
"property_highlight": "",
"property_type": "semi_detached",
"region_name": "East Midlands",
"section": "for-sale",
"size_sq_feet": "",
"tenure": "",
"zindex": "129806"
}
Find all the <script> tags, then search them for the one that contains ZPG.trackData.ecommerce.
ecommerce = None
for item in soup.findAll('script'):
if 'ZPG.trackData.ecommerce' in item.string:
ecommerce = item.string
break
I am trying to scrape the content of this particular website : https://www.cineatlas.com/
I tried scraping the date part as shown in the print screen :
I used this basic beautifulsoup code
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')
This is what I get instead of the list of elements
<ul class="slidee">
<!-- adding dates -->
</ul>
The site creates HTML elements dynamically from the Javascript content. You can get the JS content by using re for example:
import re
import json
import requests
from ast import literal_eval
url = 'https://www.cineatlas.com/'
html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData) # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies
print(json.dumps(movieData, indent=4)) # print data to the screen
Prints:
{
"2019-08-06": [
{
"url": "fast--furious--hobbs--shaw",
"image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
"image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
"title": "FAST & FURIOUS : HOBBS & SHAW",
"releaseDate": "2019-08-07",
"endpoint": "ST00000392",
"duration": "120 mins",
"rating": "Classification TOUT",
"director": "",
"actors": "",
"times": [
{
"time": "7:00pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
"attributes": [
{
"_id": "5d468c20f67cc430833a5a2b",
"shortName": "VF",
"description": "Version Fran\u00e7aise"
},
{
"_id": "5d468c20f67cc430833a5a2a",
"shortName": "3D",
"description": "3D"
}
]
},
{
"time": "9:50pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
... and so on.
lis = time.findChildren()
This returns a list of child nodes