scraping dynamic updates of temperature sensor data from a website - python

I wrote following python code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq
The result is:
<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>
When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page is
replacing the string 'tempsensor' with a numerical value automatically.
My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor in BeautifulSoup?

Sorry No, Not possible with BeautifulSoup alone.
The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.
A complete web browser does at least four things:
Connects to web servers, fetches data
Parses HTML content and CSS formatting and presents a web page
Parses Javascript content, runs it.
Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program
Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2.
Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.
PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.
Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.
In comments, #Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout and setInterval which schedules updates, or ajax, or jQuery's .load function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz') refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz').

You're missing a tiny bit of code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
freq = soup.find('div', attrs={'id':'frequenz'})
print freq.string # Added .string

This should do it:
freq.text.strip()
As in
>>> html = '<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>'
>>> soup = BeautifulSoup(html)
>>> soup.text.strip()
u'tempsensor'

Related

Extracting information from website with BeautifulSoup and Python

I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4>  |  המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה)  |  המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6>  |  המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.

Requests won't get the text from web page?

I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.

Requests gets dashes, while the same webpage gives me the page with all details? [duplicate]

I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.

Unable to fetch information as 3rd party browser plugin is blocking JS to work

I wanted to extract data from https://www.similarweb.com/ but when I run my code it shows (converted the output of HTML into text):
Pardon Our Interruption http://cdn.distilnetworks.com/css/distil.css" media="all" /> http://cdn.distilnetworks.com/images/anomaly-detected.png" alt="0" />
Pardon Our Interruption...
As you were browsing www.similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .
After completing the CAPTCHA below, you will immediately regain access to www.similarweb.com.
if (!RecaptchaOptions){ var RecaptchaOptions = { theme : 'blackglass' }; }
You reached this page when attempting to access https://www.similarweb.com/ from 14.139.82.6 on 2017-05-22 12:02:37 UTC.
Trace: 9d8ae335-8bf6-4218-968d-eadddd0276d6 via 536302e7-b583-4c1f-b4f6-9d7c4c20aed2
I have written the following piece of code:
import urllib
from BeautifulSoup import *
url = "https://www.similarweb.com/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print (soup.prettify())
# tags = soup('a')
# for tag in tags:
# print 'TAG:',tag
# print tag.get('href', None)
# print 'Contents:',tag.contents[0]
# print 'Attrs:',tag.attrs
Can anyone help me as to how I can extract the information?
I tried with requests; it failed. selenium seems to work.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('https://www.similarweb.com/')

Mimicking HTML5 Video support on PhantomJS used through Selenium in Python

I am trying to extract the source link of an HTML5 video found in the video tag . Using Firefox webdrive , I am able to get the desired result ie -
[<video class="video-stream html5-main-video" src='myvideoURL..'</video>]
but if I use PhantomJS -
<video class="video-stream html5-main-video" style="width: 854px; height: 480px; left: 0px; top: 0px; -webkit-transform: none;" tabindex="-1"></video>
I suspect this is because of PhantomJS' lack of HTML5 Video support . Is there anyway I can trick the webpage into thinking that HTML5 Video is supported so that it generates the URL ? Or can I do something else ?
tried this
try:
WebDriverWait(browser,10).until(EC.presence_of_element_located((By.XPATH, "//video")))
finally:
k = browser.page_source
browser.quit()
soup = BeautifulSoup(k,'html.parser')
print (soup.find_all('video'))
The way Firefox and phantomjs webdrivers communicate with Selenium are quite different.
When using Firefox, it signals back that the page has finished loading after it loaded some of the javascript
Differently in phantomjs, it signals Selenium that the page has finished loading as soon as it is able to get the page source meaning it wouldn't have loaded any javascript.
What you need to do is Wait for the element to be present before extracting it, in this case it would be:
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//video")))
EDIT:
Youtube first checks if the browser supports the video content before deciding whether to provide the source, theres a workaround though described here

Categories

Resources