When I get HTML of the page, e.g
response = urllib2.urlopen('http://www.wunderground.com/us/fl/miami/precipitation')
html = response.read()
I get HTML with collapsed containers, e.g
<h2>6-Hour Precipitation Forecast</h2>
<div id="precip-statement"></div>
<div id="precip-graph">
while the real HTML looks like that:
Clearly, I need to extract 6-hour forecast, which I cannot do having it collapsed into <div id="precip-statement"></div>
I will be very thankful if you can help me with this issue. Thank you
The content is loaded dynamically using ajax. You can sniff this request with Chrome. Press F12 -> Network -> XHR and look at requests, one of them (wwir.json) returns a nice json that you can parse using:
import json
weather = json.loads(response)
It looks like they use API key from api.weather.com, which probably means you should get your own.
Related
I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4> | המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה) | המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6> | המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.
i am trying to get the weather from a website and collect this data. but some requests return empty lists or different information then expected. why does this happen and what is the correct format and method to getting the right xpath and information from a website.
i have tried using multiple websites but cannot consistantly get results.
import requests
from lxml import html
site1data = requests.get('http://m.bom.gov.au/vic/melbourne/', verify =
False)
tree = html.fromstring(site1data.content)
humidity = tree.xpath('//div[#class="humidity"]/text()')
print(humidity)
the expected result was something like:
67%
but i got:
['\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t']
Because the text data you are looking for is presented inside a <p> tag, not inside the <div> itself:
<div class="humidity">
<h3>Humidity</h3>
<img class="humidity" src="/assets/images/ui/humidity.svg" />
<p>65%</p>
</div>
This xpath should solve your immediate problem:
humidity = tree.xpath('//div[#class="humidity"]/p/text()')
If you look at the site they offer a beta site which is API fed so you can get all the info from that endpoint as json
import requests
r = requests.get('https://api.weather.bom.gov.au/v1/locations/r1r0fs/observations').json()
print(r)
I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.
I am trying to get the value of VIX from a webpage.
The code I am using:
raw_page = requests.get("https://www.nseindia.com/live_market/dynaContent/live_watch/vix_home_page.htm").text
soup = BeautifulSoup(raw_page, "lxml")
vix = soup.find("span",{"id":"vixIdxData"})
print(vix.text)
This gives me:
' '
If I see vix,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">/span>
On the site the element has text,
<span id="vixIdxData" style=" font-size: 1.8em;font-weight: bold;line-height: 20px;">15.785/span>
The 15.785 value is what I want to get by using requests.
The data you're looking for, is not available in the page source. And requests.get(...) gets you only the page source without the elements that are dynamically added through JavaScript. But, you can still get it using requests module.
In the Network tab, inside the developer tools, you can see a file named VixDetails.json. A request is being sent to https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json, which returns the data in the form of JSON.
You can access it using the built-in .json() function of the requests module.
r = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/VixDetails.json')
data = r.json()
vix_price = data['currentVixSnapShot'][0]['CURRENT_PRICE']
print(vix_price)
# 15.7000
When you open the page in a web browser, the text (e.g., 15.785) is inserted into the span element by the getIndiaVixData.js script.
When you get the page using requests in Python, only the HTML code is retrieved and no JavaScript processing is done. So, the span element stays empty.
It is impossible to get that data by solely parsing the HTML code of the page using requests.
I'm working in web2py and I'm trying to print out html code from the controller, which is written in python. The issue is even when I write the html in a string in python, the page is rendering this string as it would normal html. This seems like there would be a simple fix, but I have not been able to find an answer. Here is the specific code.
return ('Here is the html I'm trying to show: <img src= {0}>'.format(x))
The resulting page shows "Here is the html I'm trying to show: " and then the rest is blank. If I inspect the page the rest of the code is still there, which means it is being read, just not displayed. So I just need a way to keep the html that is in the string from being interpreted as html. Any ideas?
If you want to send HTML markup but have the browser treat it and display it as plain text, then simply set the HTTP Content-Type header appropriately. For example, in the web2py controller:
def myfunc():
...
response.headers['Content-Type'] = 'text/plain'
return ("Here is the html I'm trying to show: <img src={0}>".format(x))
On the other hand, if you want the browser to treat and render the response as HTML and you care only about how it is displayed in the browser (but not about the actual text characters in the returned content), you can simply escape the HTML markup. web2py provides the xmlescape function for this purpose:
def myfunc():
x = '/static/myimage.png'
html = xmlescape("<img src={0}>".format(x))
return ("Here is the html I'm trying to show: {0}>".format(html))
The above will return the following to the browser:
Here is the html I'm trying to show: <img src=/static/myimage.png>
which the browser will display as:
Here is the html I'm trying to show: <img src=/test/image.png>
Note, if you instead use a web2py template to generate the response, any HTML markup inserted will automatically be escaped. For example, you could have a myfunc.html template like the following:
{{=markup}}
And in the controller:
def myfunc():
...
return dict(markup="Here is the html I'm trying to show: <img src={0}>".format(x))
In that case, web2py will automatically escape the content inserted via {{=markup}} (so no need to explicitly call xmlescape).
I take it you are trying to view this string in a web browser.
To take the raw html and not have the browser render it, you can wrap it in <xmp> tags:
return ("Here is the html I'm trying to show: <xmp><img src= {0}></xmp>".format(x))