I would like to interact with a web-hosted GIS map application here to scrape data contained therein. The data is behind a toggle button.
Normally, creating a soup of the websites text via BeautifulSoup and requests.get() suffices to where the text data is parse-able, however this method returns some sort of esri script, and none of the desired html or text data.
Snapshot of the website with desired element inspected:
Snapshot of the button toggled, showing the text data I'd like to scrape:
The code's first mis(steps):
import requests
from bs4 import BeautifulSoup
site = 'https://dwrapps.utah.gov/fishing/fStart'
soup = BeautifulSoup(requests.get(site).text.lower(), 'html.parser')
The return of said soup is too lengthy to post here, but there is no way to access the html data behind the toggle shown above.
I assume use of selenium would do the trick, but was curious if there was an easier method of interacting directly with the application.
the site is get json from https://dwrapps.utah.gov/fishing/GetRegionReports (in the js function getForecastData)
so you can use it in requests:
from json import dump
from typing import List
import requests
url = "https://dwrapps.utah.gov/fishing/GetRegionReports"
json:List[dict] = requests.get(url).json()
with open("gis-output.json","w") as io:
dump(json,io,ensure_ascii=False,indent=4) # export full json from to the filename gis-output.json
for dt in json:
reportData = dt.get("reportData",'') # the full text
displayName = dt.get("displayName",'')
# do somthing with the data.
"""
you can acsses also this fields:
regionAdm = dt.get("regionAdm",'')
updateDate = dt.get("updateDate",'')
dwrRating = dt.get("dwrRating",'')
ageDays = dt.get("ageDays",'')
publicCount = dt.get("publicCount",'')
finalRating = dt.get("finalRating",'')
lat = dt.get("lat",'')
lng = dt.get("lng",'')
"""
Related
I created bs4 web-scraping app with python. My program return empty list for review. For soup program runs normally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
usernames = []
titles = []
comments = []
result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')
soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)
for i in review:
header = i.find('div', class_="single-review__header")
footer = i.find('div', class_="comment-holder")
username = header.find('a', class_="single-review__username").text
title = header.find('div', class_="single-review__related-to").text
comment = footer.find('div', class_="single-review__comment").text
usernames.append(username)
titles.append(title)
comments.append(comment)
data.append(usernames)
data.append(titles)
data.append(comments)
print(data)
It isn't problem with class.
It looks like the reason this doesn't work is because the website needs a login in order to access that page. If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list, it would just take you to a login page.
There's 2 paths I can think of that you could take here:
Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.
(much simpler) use selenium instead. Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium. I'd definitely suggest looking into it if you haven't already.
Instead of a Selenium session, I'd like to fetch data from the regular Chrome session because the data is already present there and recreating the same scenario in selenium takes too long for it to be handy.
Is there any way of seeing HTML of a currently opened tab?
I'd recommend using urllib.request for this:
from urllib.request import urlopen
link = "https://stackoverflow.com/questions/68120200/python-get-html- from-website-opened-in-users-browser"
openedpage = urlopen(link)
content = openedpage.read()
code = bytes.decode("utf-8")
print(code)
This would for example give out the code for the page of this question. Hope this is what you wanted to achieve. In case you wanted to extract actual data and not code, you can do that with the same library:
from urllib.request import urlopen
link = "https://stackoverflow.com/questions/68120200/python-get-html-from-website-opened-in-users-browser"
openedpage = urlopen(link)
content = openedpage.read()
code = content.decode("utf-8")
title = code.find("<title>")
title_start = title + len("<title>")
title_end = code.find("</title>")
full_title = code[title_start:title_end]
print(full_title)
Basically what you want to do to get any part of the code is to collect the start and end-index of the tag, end then combine them together like in the example.
I am trying to scrape the SEC Edgar S&P500 annual reports with Python 3 and receive very slow loading time with some links. My current code works well for most of the report links, but returns only half of the website content for other links (e.g., the link below).
Is there a way around this? I am happy if the result is a text file without any weird html characters and only contains all text for the "end-user"
# import libraries
from simplified_scrapy import SimplifiedDoc,req,utils
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/718877/000104746919000788/0001047469-19-000788.txt"
html = req.get(new_html_text)
doc = SimplifiedDoc(html)
textfile = doc.body.text
textfile = doc.body.unescape() # Converting HTML entities
utils.saveFile("test.txt", textfile)
I found that your data contains multiple bodies. I'm sorry I didn't notice that before. See if the following code works.
from simplified_scrapy import SimplifiedDoc,req,utils
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/718877/000104746919000788/0001047469-19-000788.txt"
html = req.get(new_html_text,timeout=300) # Add timeout
doc = SimplifiedDoc(html)
texts = []
bodys = doc.selects('body|BODY') # Get all
for body in bodys:
texts.append(body.unescape()) # Converting HTML entities
utils.saveFile("test.txt", "\n".join(texts))
Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds¶ms=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.
I'm trying to get the data from each pop-up on the map. I've used beautifulsoup in the past but this is a first getting data from an interactive map.
Any push in the right direction is helpful. So far i'm returning blanks.
Here's what i have, it isn't substantial...
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.oaklandconduit.com/development_map'
r = requests.get(url).text
soup = bs4(r, "html.parser")
address = soup.find_all("div", {"class": "leaflet-pane leaflet-marker-pane"})
Updated
On recommendations, I went with parsing the javascript content with re using the script below. But loading into json returns an error
import requests, re
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default/mapscript.js'
r = requests.get(url).content
content = re.findall(r'var.*?=\s*(.*?);', r, re.DOTALL | re.MULTILINE)[2]
json_content = json.loads(content)
The interactive map is loaded through and driven by JavaScript, therefore, using the requests library is not going to be sufficient enough to get the data you want because it only gets you the initial response (in this case, HTML source code).
If you view the source for the page (on Chrome: view-source:https://www.oaklandconduit.com/development_map) you'll see that there is an empty div like so:
<div id='map'></div>
This is the placeholder div for the map.
You'll want to use a method that allows the map to load and for you to programmatically interact with it. Selenium can do this for you but will be significantly slower than requests because it has to allow for this interactivity by launching a programmatically driven browser.
Continued with regex to parse map contents into Json. Here's my approach with comments if helpful to others.
import re, requests, json
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default' \
'/mapscript.js'
r = requests.get(url).content
# use regex to get geoJSON and replace single quotes with double
content = re.findall(r'var geoJson.*?=\s*(.*?)// Add custom popups', r, re.DOTALL | re.MULTILINE)[0].replace("'", '"')
# add quotes to key: "type" and remove trailing tab from value: "description"
content = re.sub(r"(type):", r'"type":', content).replace('\t', '')
# remove ";" from dict
content = content[:-5]
json_content = json.loads(content)
also open to other pythonic approaches.