Unable to select the table data in scrapy - python

I am trying to scrape this website for an academic purpose in scrapy using css/xpath selector.
I need to select the details in td in the table with id DataTables_Table_0. However I am unable to even select div element which contains the table, let alone the table data.
HTML block that I want to parse is
# please ignore wrong indentation
<div id="fund-selector-data">
<div class=" ">
<div id="DataTables_Table_0_wrapper" class="dataTables_wrapper no-footer">
<div class="dataTables_scroll">
<div class="dataTables_scrollHead"
</div>
<div class="dataTables_scrollBody" style="position: relative; overflow: auto; width: 100%;">
<table class="row-border dataTable table-snapshot no-footer" data-order="[]" cellspacing="0" width="100%"
id="DataTables_Table_0" role="grid" style="width: 100%;">
<thead>
</thead>
<tbody>
<tr role="row" class="odd">
<td>PDF</td>
<td class=" text-left"><a href="/funds/38821/aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">ABSL Bal
Bhavishya Yojna Dir</a> | <a class="invest-online-blink invest-online " target="_blank"
href="/funds/invest-online-tracking/420/" data-amc="aditya-birla-sun-life-mutual-fund"
data-fund="aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">Invest Online</a></td>
<td data-order="" class=" text-left">
<div class="raterater-layer text-left test-fund-rating-star "><small>Unrated</small></div>
</td>
<td class=" text-left"><a
href="/premium/?utm_medium=vro&utm_campaign=premium-unlock&utm_source=fund-selector">
<div class="unlock-premium"></div>
</a></td>
</tbody>
scrapy CSS selector is as follow:
# Selecting Table (selector)
response.css("#DataTables_Table_0") # returns blank list
# Selecting div class (selector)
response.css(".dataTables_scrollBody") # returns blank list
# Selecting td element
response.css("#DataTables_Table_0 tbody tr td a::text").getall() # returns blank list
I have also tried xpath to select the element but have gotten the same result. I have found that I am unable to select any element which is below the div with empty class.I am unable to comprehend why it will not work in this case? Am I missing anything? Any help will be appreciated.

The problem
It looks as though the elements you're trying to select are loaded via javascript as a separate API call. If you visit the page, the table has message:
Please wait while we are fetching data...
The Scrapy docs have a section about this, with their recommendation being to find the source of the dynamically loaded content and simulate these requests from your crawling code.
The solution
The data source can be found by looking at the XHR network tab in Chrome dev tools.
In this case, it looks as though the data source for the table you're trying to parse is
https://www.valueresearchonline.com/funds/selector-data/primary-category/1/equity/?plan-type=direct&tab=snapshot&output=html-data
This seems to be a replica of the original URL, but with selector replaced with selector-data and a output=html-data query parameter on the end.
This returns a JSON object with the following format:
{
title: ...,
tracking_url: ...,
tools_title: ...,
html_data: ...,
recordsTotal: ...
}
It looks as though html_data is the field you want, since that contains the dynamic table html you originally wanted. You can now simply load this html_data and parse it as before.
In order to simulate all of this in your scraping code, simply add a parse_table method to your spider to handle the above json response. You might also want to
dynamically generate the table data source URL based on the page you're currently scraping, so it's worth adding a method that adds edits the original URL as detailed above.
Example code
I'm not sure how you've set up your spider, so I've written a couple of methods that can be easily ported into whatever spider setup you're currently using.
import json
import scrapy
from scrapy.http import Request
from urllib.parse import urlparse, urlencode, parse_qsl
class TableSpider(scrapy.Spider):
name = 'tablespider'
start_urls = ['https://www.valueresearchonline.com/funds/selector/primary-category/1/equity/?plan-type=direct&tab=snapshot']
def _generate_table_endpoint(self, base_url):
"""Dyanmically generate the table data endpoint."""
# Parse the base url
parsed = urlparse(base_url)
# Add output=html-data query param
current_params = dict(parse_qsl(parsed.query))
new_params = {'output': 'html-data'}
merged_params = urlencode({**current_params, **new_params})
# Update path to get selector data
data_path = parsed.path.replace('selector', 'selector-data')
# Update the URL with the new path and query params
parsed = parsed._replace(path=data_path, query=merged_params)
return parsed.geturl()
def parse(self, response):
# Any pre-request logic goes here
# ...
# Request and parse the table data source
yield Request(
self._generate_table_endpoint(response.url),
callback=self.parse_table
)
def parse_table(self, response):
try:
# Load the json response into a dict
res = json.loads(response.text)
# Get the html_data value (containing the dynamic table html)
table_html = res['html_data']
# Your table data extraction goes here...
# ===========================================================
except:
raise Exception('No table data present.')
yield {'table_data': 'your response data'}

Related

Extract property id from attribute using xpath

I have been trying to extract property id from the following website: https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any
But whichever combination I try to use I can't seem to retrieve it.
Property id is located here:
<div class="corner-ribbon">
<span class="ribbon-green">NEW!</span>
</div>
<a href="Details?id=182519" title="view this property">
<img class="img-responsive img-prop" src="https://kwsadocuments.blob.core.windows.net/devblob/24c21aa4-ae17-41d1-8719-5abf8f24c766.jpg" alt="Living close to Nature">
</a>
And here is what I have tried so far:
response.xpath('//a[#title="view this property"]/#href').getall(),
response.xpath('//*[#id="divListingResults"]/div/div/a/#href').getall(),
response.xpath('//*[#class="corner-ribbon"]/a/#href').getall()
Any suggestion on what I might be doing wrong?
Thank you in advance!
First you need to understand how this page works. It loads properties using Javascript (check page source in your browser using Ctrl+U) and (as you know) Scrapy can't process Javascript.
But if you check page source you'll find that all information your need is "hidden" inside <input id="propertyJson" name="ListingResults.JsonResult" > tag. So all you need to get that value and process it using json module:
import scrapy
import json
class PropertySpider(scrapy.Spider):
name = 'property_spider'
start_urls = ['https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any']
def parse(self, response):
property_json = response.xpath('//input[#id="propertyJson"]/#value').get()
# with open('Samples/Properties.json', 'w', encoding='utf-8') as f:
# f.write(property_json)
property_data = json.loads(property_json)
for property in property_data:
property_id = property['Id']
property_title = property['Title']
print(property_id)
print(property_data)

Replace scrapy response.body with selenium response

I try to crawl following product-site from an online-shop with scrapy: https://www.mediamarkt.de/de/product/_lg-65uk6470plc-2391592.html'
The properties of the product are listed in a normal html-table and some of them are getting showed only when the "Alle Details einblenden"-button was clicked.
The properties are safed in a js-var and are preloaded from the begining. By pressing the button, a js-function adds the rest of the properties to the table.
Now I try to get the full content of the webpage and then to crawl it completly.
By the reason, that I need to use the SitemapSpider by scrapy, I decided to use selenium to get the content of this site, then to simulate clicking the button and replace the full content with the scrapy response.body. Afterwards, when the data gets parsed, scrapy should parse the new properties from the table too. But it doesn't work and I really don't know why. The properties, which are shown from the beginning, are getting parsed sucessfully.
chromeDriver = webdriver.Chrome('C:/***/***/chromedriver.exe') #only for testing
def parse(self,response):
chromeDriver.get(response.url)
moreContentButton = chromeDriver.find_element_by_xpath('//div[#class="mms-product-features__more"]/span[#class="mms-link underline"]')
chromeDriver.execute_script('arguments[0].click();', moreContentButton)
newHTMLBody = chromeDriver.page_source.encode('utf-8')
response._set_body(newHTMLBody)
scrapyProductLoader = ItemLoader(item=Product(), response=response)
scrapyProductLoader.add_xpath('propertiesKeys', '//tr[#class="mms-feature-list__row"]/th[#class="mms-feature-list__dt"]')
scrapyProductLoader.add_xpath('propertiesValues', '//tr[#class="mms-feature-list__row"]/td[#class="mms-feature-list__dd"]')
I tried the response.replace(body=chromeDriver.page_source) method instead of response._set_body(newHTMLBody), but that doesn't worked. It changes nothing. I know that response.body contains all properties of the product (by creating a html-file containing the response.body), but scrapy adds only the properties of the product before the button was clicked (in this example: Betriebssystem: webOS 4.0 (AI ThinQ) is the last entry).
But I need all properties.
Here is a part of the reponse.body before the ItemLoader got initialized:
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Betriebssystem</th>
<td class="mms-feature-list__dd">webOS 4.0 (AI ThinQ)</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Prozessor</th>
<td class="mms-feature-list__dd">Quad Core-Prozessor</td></tr><tr class="mms-feature-list__row">
<th scope="row" class="mms-feature-list__dt">Energieeffizienzklasse</th>
<td class="mms-feature-list__dd">A</td></tr>
</tbody></table></div>
<div class="mms-feature-list mms-feature-list--rich">
<h3 class="mms-headline">Bild</h3>
<table class="mms-feature-list__container">
<tbody><tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildschirmauflösung</th>
<td class="mms-feature-list__dd">3.840 x 2.160 Pixel</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildwiederholungsfrequenz</th>
<td class="mms-feature-list__dd">True Motion 100</td></tr>
Thanks for your attention and your help.
You don't need selenium or anything else to get the desired data from the mentioned page.
import json
text_data = response.css('script').re('window.__PRELOADED_STATE__ = (.+);')[0]
# This dict will contain everything you need.
data = json.loads(text_data)
Selenium is a testing tool. Avoid using it for scraping.
You could proabley do this
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Any URL HERE", body=BODY_STRING_HERE, encoding='utf-8')
>>> response.xpath('xpath_here').extract()

Python Splinter Get Value From Table (Following-Sibling)

Given this code ("sleep" instances used to help display what's going on):
from splinter import Browser
import time
with Browser() as browser:
# Visit URL
url = "https://mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx"
browser.visit(url)
browser.fill('ctl00$ContentPlaceHolder1$txtCredentialNumber', 'IF0000000262422')
# Find and click the 'search' button
button = browser.find_by_name('ctl00$ContentPlaceHolder1$btnSearch')
# Interact with elements
button.first.click()
time.sleep(5)
#Only click the link next to "Professional Teaching Certificate Renewal"
certificate_link = browser.find_by_xpath("//td[. = 'Professional Teaching Certificate Renewal']/following-sibling::td/a")
certificate_link.first.click()
time.sleep(10)
I am now trying to get the values from the table that shows after this code runs. I am not well-versed in xpath commands, but based on the response to this question, I have tried these, to no avail:
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/a")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[1]")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[2]")
I tried [2] because I do notice a colon (:) sibling character between "Name" and the cell containing the name. I just want the string value of the name itself (and all other values in the table).
I do notice a different structure (span is used within td instead of just td) in this case (I also tried td span[. ='Name']... but no dice):
Updated to show more detail
<tr>
<td>
<span class="MOECSBold">Name</span>
</td>
<td>:</td>
<td>
<span id="ContentPlaceHolder1_lblName" class="MOECSNormal">MICHAEL WILLIAM LANCE </span>
</td>
</tr>
This ended up working:
browser.find_by_xpath("//td[span='Name']/following-sibling::td")[1].value

Selecting specific table cells in Selenium web driver (Python)

I am trying to extract the information from a link from a page that is structured as such:
...
<td align="left" bgcolor="#FFFFFF">$725,000</td>
<td align="left" bgcolor="#FFFFFF"> Available</td>
*<td align="left" bgcolor="#FFFFFF">
<a href="/washington">
Washington Street Studios
<br>1410 Washington Street SW<br>Albany, Oregon, 97321
</a>
</td>*
<td align="center" bgcolor="#FFFFFF">15</td>
<td align="center" bgcolor="#FFFFFF">8.49%</td>
<td align="center" bgcolor="#FFFFFF">$48,333</td>
</tr>
I tried targeting elements with attribute 'align = left' and iterating over it but that didn't work out. If anybody could help me locate the element <a href = "/washington"> (multiple tags like these within the same page) with selenium I would appreciate it.
I would use lxml instead, if it is just to process hxml...
It would be helpful if you're more specific, but you can try this if you are traversing links in a webpage..
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
list_of_links = [i[2] for i in doc.iterlinks()]
list_of_links will look like ['/en/images/logo_com.gif', 'http://www.brand.com/', '/en/images/logo.gif']
doc.iterlinks() will look for all links such as form, img, a-tags and yield lists containing Element object containing the tag, the type of tag (form, a or img), the url and a number, so the line list_of_links = [i[2] for i in doc.iterlinks()] simply grab the url and returns as a separate list.
Note that the retrieved url is relative. As in you will see urls like
'/en/images/logo_com.gif'
instead of
'http://somedomain.com/en/images/logo_com.gif'
if you want to have the latter kind of url, add the code
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
doc.make_links_absolute() # add this line
list_of_links = [i[2] for i in doc.iterlinks()]
If you are processing the url one by one, then simply modify the code to something like
for i in iterlinks():
url = i[2]
# some processing here with url...
Finally, if for some reason you need selenium to come in to get the webpage content, then simply add the following to the beginning
from selenium import webdriver
from StringIO import StringIO
browser = webdriver.Firefox()
browser.get(url)
doc = parse(StringIO(browser.page_source)).getroot()
From what we have provided at the moment, there is a table and you have the desired links in a specific column. There are no "data-oriented" attributes to rely on, but using column index to locate the links looks good enough:
for row in driver.find_elements_by_css_selector("table#myid tr"):
cells = row.find_elements_by_tag_name("td")
print(cells[2].text) # put a correct index here

Why does this xpath fail using lxml in python?

Here is an example web page I am trying to get data from.
http://www.makospearguns.com/product-p/mcffgb.htm
The xpath was taken from chrome development tools, and firepath in firefox is also able to find it, but using lxml it just returns an empty list for 'text'.
from lxml import html
import requests
site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
page = requests.get(site_url)
tree = html.fromstring(page.text)
text = tree.xpath(xpath)
Printing out the tree text with
print(tree.text_content().encode('utf-8'))
shows that the data is there, but it seems the xpath isn't working to find it. Is there something I am missing? Most other sites I have tried work fine using lxml and the xpath taken from chrome dev tools, but a few I have found give empty lists.
1. Browsers frequently change the HTML
Browsers quite frequently change the HTML served to it to make it "valid". For example, if you serve a browser this invalid HTML:
<table>
<p>bad paragraph</p>
<tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>
To render it, the browser is helpful and tries to make it valid HTML and may convert this to:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
The above is changed because <p>aragraphs cannot be inside <table>s and <tbody>s are recommended. What changes are applied to the source can vary wildly by browser. Some will put invalid elements before tables, some after, some inside cells, etc...
2. Xpaths aren't fixed, they are flexible in pointing to elements.
Using this 'fixed' HTML:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
If we try to target the text of <td> cell, all of the following will give you approximately the right information:
//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()
And the list goes on...
however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. In this case:
/table[0]/tbody[0]/tr[0]/td[0]/text()
3. Conclusion: Browser given Xpaths are usually unhelpful
This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML.
The solution, always refer to the raw HTML and use a flexible, but precise XPath.
Examine the actual HTML that holds the price:
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<font class="pricecolor colors_productprice">
<div class="product_productprice">
<b>
<font class="text colors_text">Price:</font>
<span itemprop="price">$149.95</span>
</b>
</div>
</font>
<br/>
<input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
</td>
</tr>
</table>
If you want the price, there is actually only one place to look!
//span[#itemprop="price"]/text()
And this will return:
$149.95
The xpath is simply wrong
Here is snippet from the page:
<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();">
<img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br />
<table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent">
<tr>
<td colspan="2" class="vCSS_breadcrumb_td"><b>
Home >
You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.
There can be only one element with such id (otherwise it would be broken xml).
The xpath is expecting tbody as child of given element (table) and there is none in whole page.
This can be tested by
>>> "tbody" in page.text
False
How Chrome came to that XPath?
If you simply download this page by
$ wget http://www.makospearguns.com/product-p/mcffgb.htm
and review content of it, it does not contain a single element named tbody
But if you use Chrome Developer Tools, you find some.
How it comes here?
This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.
How to get content of page dynamically modified within browser?
You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.
byselenium.py
from selenium import webdriver
from lxml import html
url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source
tree = html.fromstring(html_source)
text = tree.xpath(xpath)
print text
what prints
$ python byselenimum.py
test tbody True
['$149.95']
Conclusions
Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.
I had a similar issue (Chrome inserting tbody elements when you do Copy as XPath). As others answered, you have to look at the actual page source, though the browser-given XPath is a good place to start. I've found that often, removing tbody tags fixes it, and to test this I wrote a small Python utility script to test XPaths:
#!/usr/bin/env python
import sys, requests
from lxml import html
if (len(sys.argv) < 3):
print 'Usage: ' + sys.argv[0] + ' url xpath'
sys.exit(1)
else:
url = sys.argv[1]
xp = sys.argv[2]
page = requests.get(url)
tree = html.fromstring(page.text)
nodes = tree.xpath(xp)
if (len(nodes) == 0):
print 'XPath did not match any nodes'
else:
# tree.xpath(xp) produces a list, so always just take first item
print (nodes[0]).text_content().encode('ascii', 'ignore')
(that's Python 2.7, in case the non-function "print" didn't give it away)

Categories

Resources