(Python) Get HTML from website opened in user's browser - python

Instead of a Selenium session, I'd like to fetch data from the regular Chrome session because the data is already present there and recreating the same scenario in selenium takes too long for it to be handy.
Is there any way of seeing HTML of a currently opened tab?

I'd recommend using urllib.request for this:
from urllib.request import urlopen
link = "https://stackoverflow.com/questions/68120200/python-get-html- from-website-opened-in-users-browser"
openedpage = urlopen(link)
content = openedpage.read()
code = bytes.decode("utf-8")
print(code)
This would for example give out the code for the page of this question. Hope this is what you wanted to achieve. In case you wanted to extract actual data and not code, you can do that with the same library:
from urllib.request import urlopen
link = "https://stackoverflow.com/questions/68120200/python-get-html-from-website-opened-in-users-browser"
openedpage = urlopen(link)
content = openedpage.read()
code = content.decode("utf-8")
title = code.find("<title>")
title_start = title + len("<title>")
title_end = code.find("</title>")
full_title = code[title_start:title_end]
print(full_title)
Basically what you want to do to get any part of the code is to collect the start and end-index of the tag, end then combine them together like in the example.

Related

Accessing text data in web-hosted GIS Map (ESRI) via python

I would like to interact with a web-hosted GIS map application here to scrape data contained therein. The data is behind a toggle button.
Normally, creating a soup of the websites text via BeautifulSoup and requests.get() suffices to where the text data is parse-able, however this method returns some sort of esri script, and none of the desired html or text data.
Snapshot of the website with desired element inspected:
Snapshot of the button toggled, showing the text data I'd like to scrape:
The code's first mis(steps):
import requests
from bs4 import BeautifulSoup
site = 'https://dwrapps.utah.gov/fishing/fStart'
soup = BeautifulSoup(requests.get(site).text.lower(), 'html.parser')
The return of said soup is too lengthy to post here, but there is no way to access the html data behind the toggle shown above.
I assume use of selenium would do the trick, but was curious if there was an easier method of interacting directly with the application.
the site is get json from https://dwrapps.utah.gov/fishing/GetRegionReports (in the js function getForecastData)
so you can use it in requests:
from json import dump
from typing import List
import requests
url = "https://dwrapps.utah.gov/fishing/GetRegionReports"
json:List[dict] = requests.get(url).json()
with open("gis-output.json","w") as io:
dump(json,io,ensure_ascii=False,indent=4) # export full json from to the filename gis-output.json
for dt in json:
reportData = dt.get("reportData",'') # the full text
displayName = dt.get("displayName",'')
# do somthing with the data.
"""
you can acsses also this fields:
regionAdm = dt.get("regionAdm",'')
updateDate = dt.get("updateDate",'')
dwrRating = dt.get("dwrRating",'')
ageDays = dt.get("ageDays",'')
publicCount = dt.get("publicCount",'')
finalRating = dt.get("finalRating",'')
lat = dt.get("lat",'')
lng = dt.get("lng",'')
"""

scrape data from interactive map

I'm trying to get the data from each pop-up on the map. I've used beautifulsoup in the past but this is a first getting data from an interactive map.
Any push in the right direction is helpful. So far i'm returning blanks.
Here's what i have, it isn't substantial...
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.oaklandconduit.com/development_map'
r = requests.get(url).text
soup = bs4(r, "html.parser")
address = soup.find_all("div", {"class": "leaflet-pane leaflet-marker-pane"})
Updated
On recommendations, I went with parsing the javascript content with re using the script below. But loading into json returns an error
import requests, re
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default/mapscript.js'
r = requests.get(url).content
content = re.findall(r'var.*?=\s*(.*?);', r, re.DOTALL | re.MULTILINE)[2]
json_content = json.loads(content)
The interactive map is loaded through and driven by JavaScript, therefore, using the requests library is not going to be sufficient enough to get the data you want because it only gets you the initial response (in this case, HTML source code).
If you view the source for the page (on Chrome: view-source:https://www.oaklandconduit.com/development_map) you'll see that there is an empty div like so:
<div id='map'></div>
This is the placeholder div for the map.
You'll want to use a method that allows the map to load and for you to programmatically interact with it. Selenium can do this for you but will be significantly slower than requests because it has to allow for this interactivity by launching a programmatically driven browser.
Continued with regex to parse map contents into Json. Here's my approach with comments if helpful to others.
import re, requests, json
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default' \
'/mapscript.js'
r = requests.get(url).content
# use regex to get geoJSON and replace single quotes with double
content = re.findall(r'var geoJson.*?=\s*(.*?)// Add custom popups', r, re.DOTALL | re.MULTILINE)[0].replace("'", '"')
# add quotes to key: "type" and remove trailing tab from value: "description"
content = re.sub(r"(type):", r'"type":', content).replace('\t', '')
# remove ";" from dict
content = content[:-5]
json_content = json.loads(content)
also open to other pythonic approaches.

While I use bs4 to parse site

I want to parse the price information in Bitmex using bs4.
(The site url is 'https://www.bitmex.com/app/trade/XBTUSD')
So, I wrote down the code like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)
And, the result is like this
connected...
[]
Why '[]' poped up? and To bring the price text like '6065.5', what should I do?
The text I want to parse is
<span class="price">6065.5</span>
and the selector is
content > div > div.tickerBar.overflown > div > span.instruments.tickerBarSection > span:nth-child(1) > span.price
I just study Python, so question can seems to be odd to pro...sorry
You were pretty close. Give the following a try and see if it's more what you wanted. Perhaps the format you seeing or retrieving is not quite what you expect. Hope this is helpful.
from bs4 import BeautifulSoup
import requests
import sys
import json
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
sys.exit(1)
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()
# parse json text
d = json.loads(price.text)
# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)
Example output:
connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]
Your problem is that the page doesn't contain those span elements in first place. If you check the response tab in your browser developer tools (press F12 in firefox) you can see that the page is composed of script tags with some code written in javascript that creates the elements dynamically when executed.
Since BeautifulSoup can't execute Javascript, you can't extract the elements directly with it. You have two alternatives:
Use something like selenium that allows you to drive a browser from python - that means javascript will be executed because you're using a real browser - however the performance suffers.
Read the javascript code, understand it and write python code to simulate it. This usually is harder but luckly for you this seem very simple for the page you want:
import requests
import lxml.html
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[#id='initialData']/text()")[0])
As you can see the data is in json format inside the page. After loading the data variable you can use it to access the infomation you want:
for row in data['orderBook']:
print(row['symbol'], row['price'], row['side'])
Will print:
('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")
You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.
The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

Get the code from inspect element using Python

In the Safari browser, I can right-click and select "Inspect Element", and a lot of code appears. Is it possible to get this code using Python? The best solution would be to get a file with the code in it.
More specifically, I am trying to find the links to the images on this page: http://500px.com/popular. I can see the links from "Inspect Element" and I would like to retrieve them with Python.
One way to get at the source code of a web page is to use the Beautiful Soup library. A tutorial of this is shown here. The code from the page is shown below, the comments are mine. This particular code does not work as the contents have changed on the site it uses as an example, but the concept should help you to do what you want to do. Hope it helps.
from bs4 import BeautifulSoup
# If Python2:
#from urllib2 import urlopen
# If Python3 (urllib2 has been split into urllib.request and urllib.error):
from urllib.request import urlopen
BASE_URL = "http://www.chicagoreader.com"
def get_category_links(section_url):
# Put the stuff you see when using Inspect Element in a variable called html.
html = urlopen(section_url).read()
# Parse the stuff.
soup = BeautifulSoup(html, "lxml")
# The next two lines will change depending on what you're looking for. This
# line is looking for <dl class="boccat">.
boccat = soup.find("dl", "boccat")
# This line organizes what is found in the above line into a list of
# hrefs (i.e. links).
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
return category_links
EDIT 1: The solution above provides a general way to web-scrape, but I agree with the comments to the question. The API is definitely the way to go for this site. Thanks to yuvi for providing it. The API is available at https://github.com/500px/PxMagic.
EDIT 2: There is an example of your question regarding getting links to popular photos. The Python code from the example is pasted below. You will need to install the API library.
import fhp.api.five_hundred_px as f
import fhp.helpers.authentication as authentication
from pprint import pprint
key = authentication.get_consumer_key()
secret = authentication.get_consumer_secret()
client = f.FiveHundredPx(key, secret)
results = client.get_photos(feature='popular')
i = 0
PHOTOS_NEEDED = 2
for photo in results:
pprint(photo)
i += 1
if i == PHOTOS_NEEDED:
break

Categories

Resources