I'm trying to scrape data from here using XPath and although I'm using inspect to copy the path and adding /text() to the end an empty list is being returned instead of ["Class 5"] for the text in between the last span tags.
import requests
from lxml import html
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')
print(r1class)
The element that I'm targeting is the Class for race 1 (Class 5), and the structure matches the XPath that I'm using.
The code below should do the job, i.e. it works when using other sites with a matching XPath expression. The racenet site doesn't deliver valid HTML, which might very probably be the reason your code fails. This can be verified by using the W3C online validator: https://validator.w3.org
import lxml.html
html = lxml.html.parse('https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16')
r1class = html.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')[0]
print(r1class)
This should get you started.
import requests
from lxml.etree import HTML
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[#class="tblLatestHorseResults"]')
for race in races:
rows = race.xpath('.//tr')
for row in rows:
row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]
Your XPath expression doesn't match anything, because the HTML page you are trying to scrape is seriously broken. FF (or any other web browser) fixes the page on the go, before displaying it. This results in HTML tags being added, which are not present in the original document.
The following code contains an XPath expression, which will most likely point you in the right direction.
import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[#id='resultsListContainer']/div/table[#class='tblLatestHorseResults']/tr[#class='raceDetails']/td/span[1]")
for node in nodes:
print etree.tostring(node)
When executed, this prints the following:
$ python test.py
<span class="bold">Class 5</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 3</span> Track:
<span class="bold">Class 2</span> Track:
<span class="bold">Class 3</span> Track:
Tip: whenever you are trying to scrape a web page, and things just don't work as expected, download and save the HTML to a file. In this case, e.g.:
f = open("test.xml", 'w')
f.write(sample_page.content)
Then have a look at the saved HTML. This gives you an idea of how the DOM will look like.
Related
I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.
These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856
get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.
I have been trying to extract property id from the following website: https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any
But whichever combination I try to use I can't seem to retrieve it.
Property id is located here:
<div class="corner-ribbon">
<span class="ribbon-green">NEW!</span>
</div>
<a href="Details?id=182519" title="view this property">
<img class="img-responsive img-prop" src="https://kwsadocuments.blob.core.windows.net/devblob/24c21aa4-ae17-41d1-8719-5abf8f24c766.jpg" alt="Living close to Nature">
</a>
And here is what I have tried so far:
response.xpath('//a[#title="view this property"]/#href').getall(),
response.xpath('//*[#id="divListingResults"]/div/div/a/#href').getall(),
response.xpath('//*[#class="corner-ribbon"]/a/#href').getall()
Any suggestion on what I might be doing wrong?
Thank you in advance!
First you need to understand how this page works. It loads properties using Javascript (check page source in your browser using Ctrl+U) and (as you know) Scrapy can't process Javascript.
But if you check page source you'll find that all information your need is "hidden" inside <input id="propertyJson" name="ListingResults.JsonResult" > tag. So all you need to get that value and process it using json module:
import scrapy
import json
class PropertySpider(scrapy.Spider):
name = 'property_spider'
start_urls = ['https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any']
def parse(self, response):
property_json = response.xpath('//input[#id="propertyJson"]/#value').get()
# with open('Samples/Properties.json', 'w', encoding='utf-8') as f:
# f.write(property_json)
property_data = json.loads(property_json)
for property in property_data:
property_id = property['Id']
property_title = property['Title']
print(property_id)
print(property_data)
I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>
To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.
Please consider the following code:
from lxml import html
import requests
page = requests.get('https://advisorless.substack.com/?no_cover=true')
tree = html.fromstring(page.content)
Within the HTML, the relevant sections are something like:
<div class="body markup">
<p>123</p>
<a href=''>456</a>
</div>
<div class="body markup">
<p>ABC</p>
<p>DEF</p>
</div>
Attempt 1
tree.xpath('//div[#class="body markup"]/descendant::*/text()')
Produces the following result: ['123', '456', 'ABC', 'DEF']
Attempt 2
tree.xpath('//div[#class="body markup"]/descendant::*/text()')[0]
Produces the following result: ['123']
What I Want to Get ['123', '456']
I'm not sure if this can be done with a sibling selector instead of descendants
For Specific URL:
The following code from Inspect Element is the result I'm looking for; although my code needs something more dynamic. Where div[3] is the div with class="body markup":
//*[#id="main"]/div[2]/div[2]/div[1]/div/article/div[3]/descendant::*/text()')
For more specificity, this also works:
//div[#class="post-list"]/div[1]/div/article[#class="post"]/div[#class="body markup"]/descendant::*/text()
It's that one static div that I don't know how to modify. I'm sure there's a simple piece I'm not putting together.
I'm still not entirely sure what you are after, but let's start with this and let me know how to modify the outcome, if necessary:
import requests
from lxml import html
url = "https://advisorless.substack.com/?no_cover=true"
resp = requests.get(url)
root = html.fromstring(resp.text)
targets = root.xpath("//div[#class='body markup'][./p][./a]")
for target in targets:
print(target.text_content())
for link in target.xpath('a'):
print(link.attrib['href'])
print('=====')
The output is too long to reproduce here, but see if it fits your desired output.
I am trying scrape a dummy site and get the parent tag of one that I am searching for. Heres the structure of the code I am searching for:
<div id='veg1'>
<div class='veg-icon icon'></div>
</div>
<div id='veg2'>
</div>
Heres my python script:
from lxml import html
import requests
req = requests.get('https://mysite.com')
vegTree = html.fromstring(req.text)
veg = vegTree.xpath('//div[div[#class="veg-icon vegIco"]]/id')
When veg is printed I get an empty list but I am hoping to get veg1. As I am not getting an error I am not sure what has gone wrong. As I was it in a previous question and followed that syntax. See lxml: get element with a particular child element?.
Few things are wrong in your xpath:
you are checking for the classes veg-icon vegIco, while in the HTML the child div has veg-icon icon
attributes are prepended with #: #id instead of id
The fixed version:
//div[div[#class="veg-icon icon"]]/#id