scrape data from interactive map - python

I'm trying to get the data from each pop-up on the map. I've used beautifulsoup in the past but this is a first getting data from an interactive map.
Any push in the right direction is helpful. So far i'm returning blanks.
Here's what i have, it isn't substantial...
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.oaklandconduit.com/development_map'
r = requests.get(url).text
soup = bs4(r, "html.parser")
address = soup.find_all("div", {"class": "leaflet-pane leaflet-marker-pane"})
Updated
On recommendations, I went with parsing the javascript content with re using the script below. But loading into json returns an error
import requests, re
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default/mapscript.js'
r = requests.get(url).content
content = re.findall(r'var.*?=\s*(.*?);', r, re.DOTALL | re.MULTILINE)[2]
json_content = json.loads(content)

The interactive map is loaded through and driven by JavaScript, therefore, using the requests library is not going to be sufficient enough to get the data you want because it only gets you the initial response (in this case, HTML source code).
If you view the source for the page (on Chrome: view-source:https://www.oaklandconduit.com/development_map) you'll see that there is an empty div like so:
<div id='map'></div>
This is the placeholder div for the map.
You'll want to use a method that allows the map to load and for you to programmatically interact with it. Selenium can do this for you but will be significantly slower than requests because it has to allow for this interactivity by launching a programmatically driven browser.

Continued with regex to parse map contents into Json. Here's my approach with comments if helpful to others.
import re, requests, json
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default' \
'/mapscript.js'
r = requests.get(url).content
# use regex to get geoJSON and replace single quotes with double
content = re.findall(r'var geoJson.*?=\s*(.*?)// Add custom popups', r, re.DOTALL | re.MULTILINE)[0].replace("'", '"')
# add quotes to key: "type" and remove trailing tab from value: "description"
content = re.sub(r"(type):", r'"type":', content).replace('\t', '')
# remove ";" from dict
content = content[:-5]
json_content = json.loads(content)
also open to other pythonic approaches.

Related

scraping inside script tag with beautifulsoup

I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.
Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A
In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A
That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.

Accessing text data in web-hosted GIS Map (ESRI) via python

I would like to interact with a web-hosted GIS map application here to scrape data contained therein. The data is behind a toggle button.
Normally, creating a soup of the websites text via BeautifulSoup and requests.get() suffices to where the text data is parse-able, however this method returns some sort of esri script, and none of the desired html or text data.
Snapshot of the website with desired element inspected:
Snapshot of the button toggled, showing the text data I'd like to scrape:
The code's first mis(steps):
import requests
from bs4 import BeautifulSoup
site = 'https://dwrapps.utah.gov/fishing/fStart'
soup = BeautifulSoup(requests.get(site).text.lower(), 'html.parser')
The return of said soup is too lengthy to post here, but there is no way to access the html data behind the toggle shown above.
I assume use of selenium would do the trick, but was curious if there was an easier method of interacting directly with the application.
the site is get json from https://dwrapps.utah.gov/fishing/GetRegionReports (in the js function getForecastData)
so you can use it in requests:
from json import dump
from typing import List
import requests
url = "https://dwrapps.utah.gov/fishing/GetRegionReports"
json:List[dict] = requests.get(url).json()
with open("gis-output.json","w") as io:
dump(json,io,ensure_ascii=False,indent=4) # export full json from to the filename gis-output.json
for dt in json:
reportData = dt.get("reportData",'') # the full text
displayName = dt.get("displayName",'')
# do somthing with the data.
"""
you can acsses also this fields:
regionAdm = dt.get("regionAdm",'')
updateDate = dt.get("updateDate",'')
dwrRating = dt.get("dwrRating",'')
ageDays = dt.get("ageDays",'')
publicCount = dt.get("publicCount",'')
finalRating = dt.get("finalRating",'')
lat = dt.get("lat",'')
lng = dt.get("lng",'')
"""

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Why I can't scrape Indiegogo although I am using the right xpath?

Here the code I am using to scrape an Indiegogo project but I get nothing:
url = 'https://www.indiegogo.com/projects/red-dot-watch'
page = requests.get(url=url)
tree = html.fromstring(page.content)
pledged = tree.xpath('//*[#id="main"]/div/div[2]/div/div/div[16]/div[1]/span[1]/span/span[2]/span[1]/text()')
if(len(pledged) > 0):
print(pledged[0])
else:
print("MISSING")
As #Ron said, Indiegogo is rendering its contents mostly via JavaScript, and simply requesting the page with Requests does not do that.
Happily, though, the structure of the Indiegogo pages may make it even easier for you to scrape things; there's a gon.campaign={...} JavaScript statement that seems to contain the data you're looking for. You should be able to use a regexp in the vein of gon.campaign=(\{.+\});gon to extract the data, then parse it as JSON.
EDIT: Here's an example - should work until Indiegogo decides to change their layout.
import re
import requests
import json
url = 'https://www.indiegogo.com/projects/red-dot-watch'
resp = requests.get(url)
resp.raise_for_status()
m = re.search(r'gon\.campaign=(\{.+?\});gon', resp.text)
if m:
data = json.loads(m.group(1))
else:
data = {}
print(data.get('balance'), '/', data.get('target_goal'))
Because your script is not parsing JS, thus you are not seeing the same webpage you get generated in your browser.

Searching Large String for file path. Return filepath + filename

I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python.
I'm using the urllib library, which is returning a long string of web page data which includes
<a href="http://website.com/wallpaper/filename.jpg">
I know that every filename I need to download has
'http://website.com/wallpaper/'
How can i search the page source for this portion of text, and return the rest of the image link, ending with "*.jpg" extension?
r'http://website.com/wallpaper/ xxxxxx .jpg'
I'm thinking if I could format a regular expression with the xxxx portion not being evaluated? Just check for the path, and the .jpg extension. Then return the whole string once a match is found
Am I on the right track?
BeautifulSoup is pretty convenient for this sort of thing.
import re
import urllib3
from bs4 import BeautifulSoup
jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')
pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)
jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))
result_list = map(lambda a: a.get('href'), jpg_list and site_list)
I think a very basic regex will do.
Like:
(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)
and if you use $1this will return the whole String .
And if you use
(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)
then $1 will give the whole string and $2 will give the file name only.
Note: escaping (\/) is language dependent so use what is supported by python.
Don't use a regular expression against HTML.
Instead, use a HTML parsing library.
BeautifulSoup is a library for parsing HTML and urllib2 is a built-in module for fetching URLs
import urllib2
from bs4 import BeautifulSoup as bs
content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list
for link in html.find_all('a'):
href = link.get('href')
if '/wallpaper/' in href:
links.append(href)
Search for the "http://website.com/wallpaper/" substring in url and then check for ".jpg" in url, as shown below:
domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
//do something

Categories

Resources