I have been trying to extract property id from the following website: https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any
But whichever combination I try to use I can't seem to retrieve it.
Property id is located here:
<div class="corner-ribbon">
<span class="ribbon-green">NEW!</span>
</div>
<a href="Details?id=182519" title="view this property">
<img class="img-responsive img-prop" src="https://kwsadocuments.blob.core.windows.net/devblob/24c21aa4-ae17-41d1-8719-5abf8f24c766.jpg" alt="Living close to Nature">
</a>
And here is what I have tried so far:
response.xpath('//a[#title="view this property"]/#href').getall(),
response.xpath('//*[#id="divListingResults"]/div/div/a/#href').getall(),
response.xpath('//*[#class="corner-ribbon"]/a/#href').getall()
Any suggestion on what I might be doing wrong?
Thank you in advance!
First you need to understand how this page works. It loads properties using Javascript (check page source in your browser using Ctrl+U) and (as you know) Scrapy can't process Javascript.
But if you check page source you'll find that all information your need is "hidden" inside <input id="propertyJson" name="ListingResults.JsonResult" > tag. So all you need to get that value and process it using json module:
import scrapy
import json
class PropertySpider(scrapy.Spider):
name = 'property_spider'
start_urls = ['https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any']
def parse(self, response):
property_json = response.xpath('//input[#id="propertyJson"]/#value').get()
# with open('Samples/Properties.json', 'w', encoding='utf-8') as f:
# f.write(property_json)
property_data = json.loads(property_json)
for property in property_data:
property_id = property['Id']
property_title = property['Title']
print(property_id)
print(property_data)
Related
I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4> | המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה) | המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6> | המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.
I am trying to scrape this website for an academic purpose in scrapy using css/xpath selector.
I need to select the details in td in the table with id DataTables_Table_0. However I am unable to even select div element which contains the table, let alone the table data.
HTML block that I want to parse is
# please ignore wrong indentation
<div id="fund-selector-data">
<div class=" ">
<div id="DataTables_Table_0_wrapper" class="dataTables_wrapper no-footer">
<div class="dataTables_scroll">
<div class="dataTables_scrollHead"
</div>
<div class="dataTables_scrollBody" style="position: relative; overflow: auto; width: 100%;">
<table class="row-border dataTable table-snapshot no-footer" data-order="[]" cellspacing="0" width="100%"
id="DataTables_Table_0" role="grid" style="width: 100%;">
<thead>
</thead>
<tbody>
<tr role="row" class="odd">
<td>PDF</td>
<td class=" text-left"><a href="/funds/38821/aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">ABSL Bal
Bhavishya Yojna Dir</a> | <a class="invest-online-blink invest-online " target="_blank"
href="/funds/invest-online-tracking/420/" data-amc="aditya-birla-sun-life-mutual-fund"
data-fund="aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">Invest Online</a></td>
<td data-order="" class=" text-left">
<div class="raterater-layer text-left test-fund-rating-star "><small>Unrated</small></div>
</td>
<td class=" text-left"><a
href="/premium/?utm_medium=vro&utm_campaign=premium-unlock&utm_source=fund-selector">
<div class="unlock-premium"></div>
</a></td>
</tbody>
scrapy CSS selector is as follow:
# Selecting Table (selector)
response.css("#DataTables_Table_0") # returns blank list
# Selecting div class (selector)
response.css(".dataTables_scrollBody") # returns blank list
# Selecting td element
response.css("#DataTables_Table_0 tbody tr td a::text").getall() # returns blank list
I have also tried xpath to select the element but have gotten the same result. I have found that I am unable to select any element which is below the div with empty class.I am unable to comprehend why it will not work in this case? Am I missing anything? Any help will be appreciated.
The problem
It looks as though the elements you're trying to select are loaded via javascript as a separate API call. If you visit the page, the table has message:
Please wait while we are fetching data...
The Scrapy docs have a section about this, with their recommendation being to find the source of the dynamically loaded content and simulate these requests from your crawling code.
The solution
The data source can be found by looking at the XHR network tab in Chrome dev tools.
In this case, it looks as though the data source for the table you're trying to parse is
https://www.valueresearchonline.com/funds/selector-data/primary-category/1/equity/?plan-type=direct&tab=snapshot&output=html-data
This seems to be a replica of the original URL, but with selector replaced with selector-data and a output=html-data query parameter on the end.
This returns a JSON object with the following format:
{
title: ...,
tracking_url: ...,
tools_title: ...,
html_data: ...,
recordsTotal: ...
}
It looks as though html_data is the field you want, since that contains the dynamic table html you originally wanted. You can now simply load this html_data and parse it as before.
In order to simulate all of this in your scraping code, simply add a parse_table method to your spider to handle the above json response. You might also want to
dynamically generate the table data source URL based on the page you're currently scraping, so it's worth adding a method that adds edits the original URL as detailed above.
Example code
I'm not sure how you've set up your spider, so I've written a couple of methods that can be easily ported into whatever spider setup you're currently using.
import json
import scrapy
from scrapy.http import Request
from urllib.parse import urlparse, urlencode, parse_qsl
class TableSpider(scrapy.Spider):
name = 'tablespider'
start_urls = ['https://www.valueresearchonline.com/funds/selector/primary-category/1/equity/?plan-type=direct&tab=snapshot']
def _generate_table_endpoint(self, base_url):
"""Dyanmically generate the table data endpoint."""
# Parse the base url
parsed = urlparse(base_url)
# Add output=html-data query param
current_params = dict(parse_qsl(parsed.query))
new_params = {'output': 'html-data'}
merged_params = urlencode({**current_params, **new_params})
# Update path to get selector data
data_path = parsed.path.replace('selector', 'selector-data')
# Update the URL with the new path and query params
parsed = parsed._replace(path=data_path, query=merged_params)
return parsed.geturl()
def parse(self, response):
# Any pre-request logic goes here
# ...
# Request and parse the table data source
yield Request(
self._generate_table_endpoint(response.url),
callback=self.parse_table
)
def parse_table(self, response):
try:
# Load the json response into a dict
res = json.loads(response.text)
# Get the html_data value (containing the dynamic table html)
table_html = res['html_data']
# Your table data extraction goes here...
# ===========================================================
except:
raise Exception('No table data present.')
yield {'table_data': 'your response data'}
I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.
These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856
get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.
I'm trying to extract the text from HTML id="itemSummaryPrice" but I couldn't figure it out.
html = """
<div id="itemSummaryContainer" class="content">
<div id="itemSummaryMainWrapper">
<div id="itemSummaryImage">
<img src="https://img.rl.insider.gg/itemPics/large/endo.fgreen.891c.jpg" alt="Forest Green Endo">
</div>
<h2 id="itemSummaryTitle">Item Report</h2>
<h2 id="itemSummaryDivider"> | </h2>
<h2 id="itemSummaryDate">Friday, January 15, 2021, 8:38 AM EST</h2>
<div id="itemSummaryBlankSpace"></div>
<h1 id="itemSummaryName">
<span id="itemNameSpan" style="color: rgb(88, 181, 73);"><span>Forest Green</span> <span>Endo</span></span>
</h1>
**<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>**
</div>
</div>
"""
my code:
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item)
returns with:
<h1 id="itemSummaryPrice"></h1>
What I'm trying to extract:
<h1 id="itemSummaryPrice">200 - 300</h1>
OR
<h1 id="itemSummaryPrice" style="color: rgb(88, 181, 73);">200 - 300</h1>
OR
200 - 300
Because I can't place comments yet an answer then. Shouldn't you call .text behind the price_check_item?
So the python code looks like this.
price_checker_site = requests.get(price_checker_url + match2)
price_checker_site_soup = BeautifulSoup(price_checker_site, 'html.parser')
price_check_item = price_checker_site_soup.find('h1', {'id': 'itemSummaryPrice'})
print(price_check_item.text) #Also possible to do print(price_check_item.text.strip())
I think this is the correct answer. Unfortunately not able to test now. Will check my code for you tonight.
As discussed in the comments, the content you seek is loaded dynamically using JavaScript. Therefore, you must either use a library like Selenium to dynamically run the JS, or find out where/how the data is loaded and replicate that.
Method 1: Use Selenium
from selenium import webdriver
url = 'https://rl.insider.gg/en/psn/octane/grey'
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
price = driver.find_element_by_id('itemSummaryPrice')
print(price.text)
In this case its easy, you just make the request and use find_element_by_id to get the data you want.
Method 2: Trace & Replicate
If you look at your browser's debugger, you can find where/how the itemSummaryPrice it set.
In particular, we find that its set using $('#itemSummaryPrice').text(itemData.currentPriceRange) in https://rl.insider.gg/js/itemDetails.js.
The next step is to find out where itemData comes from. It turns out, this is not from some other file or API call. Instead, it appears to be hard-coded in the HTML source itself (presumably loaded server-side).
If you inspect the source, you'll find the itemData is just a JSON object defined on one line within a script tag on the page itself.
There are two different approaches you can use here.
Use Selenium's execute_script to extract the data. This gives you the JSON object in a ready-to-use format. You can then just index it to get the currentPriceRange.
from selenium import webdriver
driver = webdriver.Firefox(executable_path='YOUR PATH') # or Chrome
driver.get(url)
itemData = driver.execute_script('return itemData')
print(itemData['currentPriceRange'])
Method 2.1: Alternative to Selenium
Alternatively, you can extract this in Python using traditional methods. Then, convert that to a usable Python object using json.loads and then, index the object to extract the currentPriceRange -- this gives you the desired output.
import re
import requests
import json
# Download & convert the response content to a list
url = 'https://rl.insider.gg/en/psn/octane/grey'
site = str(requests.get(url).content).split('\\n')
# Extract the line containing 'var itemData'
itemData = [s for s in site if re.match(r'^\s*var itemData', s)][0].strip()
# Remove 'var itemData' and ';' from that line
# This leaves valid JSON which can be converted from a string using json.loads
itemData = json.loads(re.sub(r'var itemData = |;', '', itemData))
# Index the data to extract the 'currentPriceRange'
print(itemData['currentPriceRange'])
This approach doesn't require Selenium to run the JavaScript and also doesn't require BeautifulSoup to parse the HTML. It does rely on the itemData being initialized in a certain way. Should the developers of that site decide to change the way this is done, you'll have to adapt it slightly in response.
Which method should I use?
If all you really want is the price range and nothing else, then use the first method. If you're interested in other data as well, you'd be better off extracting the full itemData JSON from the source and using that.
One could argue Selenium is more reliable than manual parsing of the HTML, but in this case you're probably fine. In both cases, you assume there is some itemData defined somewhere. If the format does change slightly, then the parsing may break. The other disadvantage is if part of the data relied on JS function calls -- which Selenium would execute, whereas manual parsing couldn't account for. (This isn't the case here, but it could change).
I'm trying to scrape data from here using XPath and although I'm using inspect to copy the path and adding /text() to the end an empty list is being returned instead of ["Class 5"] for the text in between the last span tags.
import requests
from lxml import html
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')
print(r1class)
The element that I'm targeting is the Class for race 1 (Class 5), and the structure matches the XPath that I'm using.
The code below should do the job, i.e. it works when using other sites with a matching XPath expression. The racenet site doesn't deliver valid HTML, which might very probably be the reason your code fails. This can be verified by using the W3C online validator: https://validator.w3.org
import lxml.html
html = lxml.html.parse('https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16')
r1class = html.xpath('//*[#id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')[0]
print(r1class)
This should get you started.
import requests
from lxml.etree import HTML
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[#class="tblLatestHorseResults"]')
for race in races:
rows = race.xpath('.//tr')
for row in rows:
row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]
Your XPath expression doesn't match anything, because the HTML page you are trying to scrape is seriously broken. FF (or any other web browser) fixes the page on the go, before displaying it. This results in HTML tags being added, which are not present in the original document.
The following code contains an XPath expression, which will most likely point you in the right direction.
import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[#id='resultsListContainer']/div/table[#class='tblLatestHorseResults']/tr[#class='raceDetails']/td/span[1]")
for node in nodes:
print etree.tostring(node)
When executed, this prints the following:
$ python test.py
<span class="bold">Class 5</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 3</span> Track:
<span class="bold">Class 2</span> Track:
<span class="bold">Class 3</span> Track:
Tip: whenever you are trying to scrape a web page, and things just don't work as expected, download and save the HTML to a file. In this case, e.g.:
f = open("test.xml", 'w')
f.write(sample_page.content)
Then have a look at the saved HTML. This gives you an idea of how the DOM will look like.