XPath: Select All Text of Descendants where first Div attribute matches condition - python

Please consider the following code:
from lxml import html
import requests
page = requests.get('https://advisorless.substack.com/?no_cover=true')
tree = html.fromstring(page.content)
Within the HTML, the relevant sections are something like:
<div class="body markup">
<p>123</p>
<a href=''>456</a>
</div>
<div class="body markup">
<p>ABC</p>
<p>DEF</p>
</div>
Attempt 1
tree.xpath('//div[#class="body markup"]/descendant::*/text()')
Produces the following result: ['123', '456', 'ABC', 'DEF']
Attempt 2
tree.xpath('//div[#class="body markup"]/descendant::*/text()')[0]
Produces the following result: ['123']
What I Want to Get ['123', '456']
I'm not sure if this can be done with a sibling selector instead of descendants
For Specific URL:
The following code from Inspect Element is the result I'm looking for; although my code needs something more dynamic. Where div[3] is the div with class="body markup":
//*[#id="main"]/div[2]/div[2]/div[1]/div/article/div[3]/descendant::*/text()')
For more specificity, this also works:
//div[#class="post-list"]/div[1]/div/article[#class="post"]/div[#class="body markup"]/descendant::*/text()
It's that one static div that I don't know how to modify. I'm sure there's a simple piece I'm not putting together.

I'm still not entirely sure what you are after, but let's start with this and let me know how to modify the outcome, if necessary:
import requests
from lxml import html
url = "https://advisorless.substack.com/?no_cover=true"
resp = requests.get(url)
root = html.fromstring(resp.text)
targets = root.xpath("//div[#class='body markup'][./p][./a]")
for target in targets:
print(target.text_content())
for link in target.xpath('a'):
print(link.attrib['href'])
print('=====')
The output is too long to reproduce here, but see if it fits your desired output.

Related

Extracting information from website with BeautifulSoup and Python

I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4>  |  המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה)  |  המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6>  |  המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.

extract text from h1 and id with python beautiful soup

I'm trying to extract the text from HTML id="itemSummaryPrice" but I couldn't figure it out.
html = """
<div id="itemSummaryContainer" class="content">
<div id="itemSummaryMainWrapper">
<div id="itemSummaryImage">
<img src="https://img.rl.insider.gg/itemPics/large/endo.fgreen.891c.jpg" alt="Forest Green Endo">
</div>
<h2 id="itemSummaryTitle">Item Report</h2>
<h2 id="itemSummaryDivider"> | </h2>
<h2 id="itemSummaryDate">Friday, January 15, 2021, 8:38 AM EST</h2>
<div id="itemSummaryBlankSpace"></div>
<h1 id="itemSummaryName">
<span id="itemNameSpan" style="color: rgb(88, 181, 73);"><span>Forest Green</span> <span>Endo</span></span>
</h1>
**<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>**
</div>
</div>
"""
my code:
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item)
returns with:
<h1 id="itemSummaryPrice"></h1>
What I'm trying to extract:
<h1 id="itemSummaryPrice">200 - 300</h1>
OR
<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>
OR
200 - 300
Because I can't place comments yet an answer then. Shouldn't you call .text behind the price_check_item?
So the python code looks like this.
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item.text) #Also possible to do print(price_check_item.text.strip())
I think this is the correct answer. Unfortunately not able to test now. Will check my code for you tonight.
As discussed in the comments, the content you seek is loaded dynamically using JavaScript. Therefore, you must either use a library like Selenium to dynamically run the JS, or find out where/how the data is loaded and replicate that.
Method 1: Use Selenium
from selenium import webdriver
url = 'https://rl.insider.gg/en/psn/octane/grey'
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
price = driver.find_element_by_id('itemSummaryPrice')
print(price.text)
In this case its easy, you just make the request and use find_element_by_id to get the data you want.
Method 2: Trace & Replicate
If you look at your browser's debugger, you can find where/how the itemSummaryPrice it set.
In particular, we find that its set using $('#itemSummaryPrice').text(itemData.currentPriceRange) in https://rl.insider.gg/js/itemDetails.js.
The next step is to find out where itemData comes from. It turns out, this is not from some other file or API call. Instead, it appears to be hard-coded in the HTML source itself (presumably loaded server-side).
If you inspect the source, you'll find the itemData is just a JSON object defined on one line within a script tag on the page itself.
There are two different approaches you can use here.
Use Selenium's execute_script to extract the data. This gives you the JSON object in a ready-to-use format. You can then just index it to get the currentPriceRange.
from selenium import webdriver
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
itemData = driver.execute_script('return itemData')
print(itemData['currentPriceRange'])
Method 2.1: Alternative to Selenium
Alternatively, you can extract this in Python using traditional methods. Then, convert that to a usable Python object using json.loads and then, index the object to extract the currentPriceRange -- this gives you the desired output.
import re
import requests
import json
# Download & convert the response content to a list
url = 'https://rl.insider.gg/en/psn/octane/grey'
site = str(requests.get(url).content).split('\\n')
# Extract the line containing 'var itemData'
itemData = [s for s in site if re.match(r'^\s*var itemData', s)][0].strip()
# Remove 'var itemData' and ';' from that line
# This leaves valid JSON which can be converted from a string using json.loads
itemData = json.loads(re.sub(r'var itemData = |;', '', itemData))
# Index the data to extract the 'currentPriceRange'
print(itemData['currentPriceRange'])
This approach doesn't require Selenium to run the JavaScript and also doesn't require BeautifulSoup to parse the HTML. It does rely on the itemData being initialized in a certain way. Should the developers of that site decide to change the way this is done, you'll have to adapt it slightly in response.
Which method should I use?
If all you really want is the price range and nothing else, then use the first method. If you're interested in other data as well, you'd be better off extracting the full itemData JSON from the source and using that.
One could argue Selenium is more reliable than manual parsing of the HTML, but in this case you're probably fine. In both cases, you assume there is some itemData defined somewhere. If the format does change slightly, then the parsing may break. The other disadvantage is if part of the data relied on JS function calls -- which Selenium would execute, whereas manual parsing couldn't account for. (This isn't the case here, but it could change).

Unable to get text using Xpath although using /text() already

I'm trying to scrape data from here using XPath and although I'm using inspect to copy the path and adding /text() to the end an empty list is being returned instead of ["Class 5"] for the text in between the last span tags.
import requests
from lxml import html
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')
print(r1class)
The element that I'm targeting is the Class for race 1 (Class 5), and the structure matches the XPath that I'm using.
The code below should do the job, i.e. it works when using other sites with a matching XPath expression. The racenet site doesn't deliver valid HTML, which might very probably be the reason your code fails. This can be verified by using the W3C online validator: https://validator.w3.org
import lxml.html
html = lxml.html.parse('https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16')
r1class = html.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')[0]
print(r1class)
This should get you started.
import requests
from lxml.etree import HTML
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[#class="tblLatestHorseResults"]')
for race in races:
rows = race.xpath('.//tr')
for row in rows:
row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]
Your XPath expression doesn't match anything, because the HTML page you are trying to scrape is seriously broken. FF (or any other web browser) fixes the page on the go, before displaying it. This results in HTML tags being added, which are not present in the original document.
The following code contains an XPath expression, which will most likely point you in the right direction.
import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[#id='resultsListContainer']/div/table[#class='tblLatestHorseResults']/tr[#class='raceDetails']/td/span[1]")
for node in nodes:
print etree.tostring(node)
When executed, this prints the following:
$ python test.py
<span class="bold">Class 5</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 3</span> Track:
<span class="bold">Class 2</span> Track:
<span class="bold">Class 3</span> Track:
Tip: whenever you are trying to scrape a web page, and things just don't work as expected, download and save the HTML to a file. In this case, e.g.:
f = open("test.xml", 'w')
f.write(sample_page.content)
Then have a look at the saved HTML. This gives you an idea of how the DOM will look like.

Getting parent tag id with lxml

I am trying scrape a dummy site and get the parent tag of one that I am searching for. Heres the structure of the code I am searching for:
<div id='veg1'>
<div class='veg-icon icon'></div>
</div>
<div id='veg2'>
</div>
Heres my python script:
from lxml import html
import requests
req = requests.get('https://mysite.com')
vegTree = html.fromstring(req.text)
veg = vegTree.xpath('//div[div[#class="veg-icon vegIco"]]/id')
When veg is printed I get an empty list but I am hoping to get veg1. As I am not getting an error I am not sure what has gone wrong. As I was it in a previous question and followed that syntax. See lxml: get element with a particular child element?.
Few things are wrong in your xpath:
you are checking for the classes veg-icon vegIco, while in the HTML the child div has veg-icon icon
attributes are prepended with #: #id instead of id
The fixed version:
//div[div[#class="veg-icon icon"]]/#id

lxml: get element with a particular child element?

Working in lxml, I want to get the href attribute of all links with an img child that has title="Go to next page".
So in the following snippet:
<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>
I'd like to get StdResults.aspx back.
I've got this far:
next_link = doc.xpath("//a/img[#title='Go to next page']")
print next_link[0].attrib['href']
But next_link is the img, not the a tag - how can I get the a tag?
Thanks.
Just change a/img... to a[img...]: (the brackets sort of mean "such that")
import lxml.html as lh
content='''<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>'''
doc=lh.fromstring(content)
for elt in doc.xpath("//a[img[#title='Go to next page']]"):
print(elt.attrib['href'])
# StdResults.aspx
Or, you could go even farther and use
"//a[img[#title='Go to next page']]/#href"
to retrieve the values of the href attributes.
You can also select the parent node or arbitrary ancestors by using //a/img[#title='Go to next page']/parent::a or //a/img[#title='Go to next page']/ancestor::a respectively as XPath expressions.

Categories

Resources