Getting parent tag id with lxml - python

I am trying scrape a dummy site and get the parent tag of one that I am searching for. Heres the structure of the code I am searching for:
<div id='veg1'>
<div class='veg-icon icon'></div>
</div>
<div id='veg2'>
</div>
Heres my python script:
from lxml import html
import requests
req = requests.get('https://mysite.com')
vegTree = html.fromstring(req.text)
veg = vegTree.xpath('//div[div[#class="veg-icon vegIco"]]/id')
When veg is printed I get an empty list but I am hoping to get veg1. As I am not getting an error I am not sure what has gone wrong. As I was it in a previous question and followed that syntax. See lxml: get element with a particular child element?.

Few things are wrong in your xpath:
you are checking for the classes veg-icon vegIco, while in the HTML the child div has veg-icon icon
attributes are prepended with #: #id instead of id
The fixed version:
//div[div[#class="veg-icon icon"]]/#id

Related

Extracting information from website with BeautifulSoup and Python

I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4>  |  המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה)  |  המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6>  |  המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>
To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

XPath: Select All Text of Descendants where first Div attribute matches condition

Please consider the following code:
from lxml import html
import requests
page = requests.get('https://advisorless.substack.com/?no_cover=true')
tree = html.fromstring(page.content)
Within the HTML, the relevant sections are something like:
<div class="body markup">
<p>123</p>
<a href=''>456</a>
</div>
<div class="body markup">
<p>ABC</p>
<p>DEF</p>
</div>
Attempt 1
tree.xpath('//div[#class="body markup"]/descendant::*/text()')
Produces the following result: ['123', '456', 'ABC', 'DEF']
Attempt 2
tree.xpath('//div[#class="body markup"]/descendant::*/text()')[0]
Produces the following result: ['123']
What I Want to Get ['123', '456']
I'm not sure if this can be done with a sibling selector instead of descendants
For Specific URL:
The following code from Inspect Element is the result I'm looking for; although my code needs something more dynamic. Where div[3] is the div with class="body markup":
//*[#id="main"]/div[2]/div[2]/div[1]/div/article/div[3]/descendant::*/text()')
For more specificity, this also works:
//div[#class="post-list"]/div[1]/div/article[#class="post"]/div[#class="body markup"]/descendant::*/text()
It's that one static div that I don't know how to modify. I'm sure there's a simple piece I'm not putting together.
I'm still not entirely sure what you are after, but let's start with this and let me know how to modify the outcome, if necessary:
import requests
from lxml import html
url = "https://advisorless.substack.com/?no_cover=true"
resp = requests.get(url)
root = html.fromstring(resp.text)
targets = root.xpath("//div[#class='body markup'][./p][./a]")
for target in targets:
print(target.text_content())
for link in target.xpath('a'):
print(link.attrib['href'])
print('=====')
The output is too long to reproduce here, but see if it fits your desired output.

Unable to get text using Xpath although using /text() already

I'm trying to scrape data from here using XPath and although I'm using inspect to copy the path and adding /text() to the end an empty list is being returned instead of ["Class 5"] for the text in between the last span tags.
import requests
from lxml import html
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')
print(r1class)
The element that I'm targeting is the Class for race 1 (Class 5), and the structure matches the XPath that I'm using.
The code below should do the job, i.e. it works when using other sites with a matching XPath expression. The racenet site doesn't deliver valid HTML, which might very probably be the reason your code fails. This can be verified by using the W3C online validator: https://validator.w3.org
import lxml.html
html = lxml.html.parse('https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16')
r1class = html.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')[0]
print(r1class)
This should get you started.
import requests
from lxml.etree import HTML
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[#class="tblLatestHorseResults"]')
for race in races:
rows = race.xpath('.//tr')
for row in rows:
row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]
Your XPath expression doesn't match anything, because the HTML page you are trying to scrape is seriously broken. FF (or any other web browser) fixes the page on the go, before displaying it. This results in HTML tags being added, which are not present in the original document.
The following code contains an XPath expression, which will most likely point you in the right direction.
import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[#id='resultsListContainer']/div/table[#class='tblLatestHorseResults']/tr[#class='raceDetails']/td/span[1]")
for node in nodes:
print etree.tostring(node)
When executed, this prints the following:
$ python test.py
<span class="bold">Class 5</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 3</span> Track:
<span class="bold">Class 2</span> Track:
<span class="bold">Class 3</span> Track:
Tip: whenever you are trying to scrape a web page, and things just don't work as expected, download and save the HTML to a file. In this case, e.g.:
f = open("test.xml", 'w')
f.write(sample_page.content)
Then have a look at the saved HTML. This gives you an idea of how the DOM will look like.

How to check if website is listed in DMOZ from Python?

I need to check if website is listed in DMOZ using Python script. How can I do it? I'm trying to do it like this:
import urllib2
search = "http://www.dmoz.org/search?q="
domain = "example.com"
r = urllib2.urlopen(search+domain).read()
It returns html code. I don't understand what should I search in that html code to check if website is listed in DMOZ. Please help me :)
If you look inside returned HTML you will see <!---------- SITES RESULTS ----------> comment with <section class="results sites"> section. Inside this section you will find <div class="site-item">. Several <div> deeper you can see what you are looking for:
<div class="site-url">
...
</div>
Site itself and it's sub-domains listed there.
If your site is not in catalog there will be no <div class="site-item">. Search for it in your Python script.

Categories

Resources