I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4> | המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה) | המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6> | המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.
I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.
I am trying to scrape some sports data.
In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?
Code I have tried so far is:
req = requests.get(url) # + str(page) + '/')
soup = bs(req.text, 'html.parser')
stype = soup.find('h1', class_ ='type-game')
print(stype)
This code returns "None". I have checked other articles on here and nothing has worked so far.
For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?
For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?
Pseudocode would be something like this:
For webpages 1 to 10000:
if class_='type-game' 'h1' contains "Blackhawks"
then proceed with parsing the code
if not, skip the code and go to the next webpage
I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.
Select your elements more specific for example with css selectors:
soup.select('h1:-soup-contains("Blackhawks")')
or
soup.select('div.type-game h1:-soup-contains("Blackhawks")')
To get the text from a tag just use .text or get_text()
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Example
html='''
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Hawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Ducks vs. Blackhawks</h1>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Output
Blackhawks vs. Ducks
Ducks vs. Blackhawks
EDIT
for e in soup.select('div.type-game h1'):
if 'Blackhawks' in e:
pint(e.text)#or do what ever is to do
I want to get all the social link of a company from this. When doing
summary_div.find("div", {'class': "cp-summary__social-links"})
I am getting this
<div class="cp-summary__social-links">
<div data-integration-name="react-component" data-payload='{"props":
{"links":[{"url":"http://www.snapdeal.com?utm_source=craft.co","icon":"web","label":"Website"},
{"url":"http://www.linkedin.com/company/snapdeal?utm_source=craft.co","icon":"linkedin","label":"LinkedIn"},
{"url":"https://instagram.com/snapdeal/?utm_source=craft.co","icon":"instagram","label":"Instagram"},
{"url":"https://www.facebook.com/Snapdeal?utm_source=craft.co","icon":"facebook","label":"Facebook"},
{"url":"https://www.crunchbase.com/organization/snapdeal?utm_source=craft.co","icon":"cb","label":"CrunchBase"},
{"url":"https://www.youtube.com/user/snapdeal?utm_source=craft.co","icon":"youtube","label":"YouTube"},
{"url":"https://twitter.com/snapdeal?utm_source=craft.co","icon":"twitter","label":"Twitter"}],
"companyName":"Snapdeal"},"name":"CompanyLinks"}' data-rwr-element="true"></div></div>
I also tried getting children of cp-summary__social-links, which I want indeed and then find all a tag to get all the links. This does not work too.
Any idea, how to do this?
Update: As Sraw suggested, I managed to get all urls by doing like this.
urls = []
social_link = summary_div.find("div", {'class': "cp-summary__social-links"}).find("div", {"data-integration-name": "react-component"})
json_text = json.loads(social_link["data-payload"])
for link in json_text['props']['links']:
urls.append(link['url'])
Thanks in advance.
I am trying to create a python script which finds a specific test inside a spam which comes from a class. Unfortunately i keep getting an empty response or "none".
It comes from a very specific page so ill paste a small bit of it which im trying to find:
<tbody>
<tr class="zone-dedicated-availability" data-actions="refUnavailable" data-dc="" data-ref="160sk5" data-availability="3600-">
<td class="show-on-ref-unavailable elapsed-time-since-last-delivery" colspan="5">
<span qtlid="47402">
Last server delivered: today at 01:59.
</span><br><a style="font- size:14px;" href=".." qtlid="50602">Go for a VPS-CLOUD<br><span style="font-size:0.9em;" qtlid="50615">(from £5.99 excl.VAT)</span></a>
</td>
I am trying to get the "last server delivered" tekst from my script. I am still learning so would appreciate the help:
page = requests.get('...')
tree = page.content
soup = BeautifulSoup(tree)
table = soup.find('tbody', {'class': 'zone-dedicated-availability'})
print table
I am probably missing some at the find statement as this is where im stuck at now, tried a few different things but not sure how i can get a valid output like i need to.
The class attribute is in tr so you need to use this:
table = soup.find('tbody').find('tr', {'class': 'zone-dedicated-availability'})
or even better:
table = soup.find('tr', {'class': 'zone-dedicated-availability'})
You can also use a CSS selector and the select method:
soup.select('tbody tr.zone-dedicated-availability')
To get the data you want is in the first "span" with qtlid="47402" thus:
In [19]: soup.find('tr', class_='zone-dedicated-availability').find('span', qtlid='47402').get_text(strip=True)
Out[19]: 'Last server delivered: today at 01:59.'
Have you tried looking for a table row with the class of "zone-dedicated-availability"? It seems that you are currently searching for a table body with that class, and that it is unable to find it.
I am trying to get the text from a div that is nested. Here is the code that I currently have:
sites = hxs.select('/html/body/div[#class="content"]/div[#class="container listing-page"]/div[#class="listing"]/div[#class="listing-heading"]/div[#class="price-container"]/div[#class="price"]')
But it is not returning a value. Is my syntax wrong? Essentially I just want the text out of <div class="price">
Any ideas?
The URL is here.
The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978
Once you request this url:
hxs.select('//div[#class="price"]/text()').extract()[0]