How to identify a change in a websites’ structure programmatically

How to identify a change in a websites’ structure programmatically - python

Within the implementation of a Python Scrapy crawler I would like to add a robust mechanism for monitoring/detecting potential layout changes within a website.
These changes do not necessarily affect existing spider selectors - for example, a site adds a new HTML element to represent the number of visitors an item has received - an element I might now be interested in parsing.
Having said that, detecting selector issues (Xpath/CSS) would be also beneficial in case where they are removed/relocated.
Please note this is not about selector content change or a website refresh (if-modified-since or last-modified), but rather a modification in the structure / nodes / layout of a site.
Therefore, how would one implement logic to monitor such circumstances?

This is actually a topic for research as you can see on this paper but there are of course some implemented tools that you can check out:
https://github.com/matiskay/html-similarity
https://github.com/matiskay/html-cluster
https://github.com/TeamHG-Memex/page-compare
Basically the base for comparing (on the previous approaches) is to use the Tree Edit Distance of the html layout.

Related

Why does the XPath of some elements change sometimes?

I'm working on some automated actions for Instagram using Python and Selenium and sometimes my code crashes because of a NoSuchElementException.
For example, when I first wrote a function for unfollowing a user, I used something like:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
After running a few times, my code crashed because it couldn't find the element so upon inspecting the page I found out that the XPath now is:
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
There's a small difference in div[1]/div[2]/div to div[2]/div/div/div[2]. So I have two questions:
Why does this happen?
Is there a bulletproof method that guarantees I will always be getting the right XPath (or element)?

It's high time we bust the myth that XPath changes.
Locator Strategies e.g. xpath and css-selectors are derived by the user and the more canonical the locators are constructed the more durable they are.
XML Path Language (XPath)
XPath 3.1 is an expression language that allows the processing of values conforming to the data model defined in XQuery and XPath Data Model (XDM) 3.1. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic values, function items, and sequences. This version of XPath supports JSON as well as XML, adding maps and arrays to the data model and supporting them with new expressions in the language and new functions in XQuery and XPath Functions and Operators 3.1.
Selectors
CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS uses Selectors for binding style properties to elements in the document. These expressions can also be used, for instance, to select a set of elements, or a single element from a set of elements, by evaluating the expression across all the elements in a subtree.
This usecase
As per your code trials:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
and
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
Here are a couple of takeaways:
The DOM Tree contains React elements. So it is quite clear that the app uses ReactJS. React is a declarative, efficient, and flexible JavaScript library for building user interfaces. It lets you compose complex UIs from small and isolated pieces of code called components.
The xpaths are absolute xpaths.
The xpaths contains indexes.
So, the application is dynamic in nature and elements are liable to be added and moved within the HTML DOM on firing of any DOM events.
Solution
In such cases when the application is based on either:
JavaScript
Angular
ReactJS
jQuery
AJAX
Vue.js
Ember.js
GWT
The canonical approach is to construct relative and/or dynamic locators inducing WebDriverWait. Some examples:
To interact with the username field on instagram login page:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))).send_keys("anon")
You can find a detailed discussion in Filling in login forms in Instagram using selenium and webdriver (chrome) python OSX
To locate the first line of the address just below the text as FIND US on facebook:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]")))
You can find a detailed discussion in Decoding Class names on facebook through Selenium
Interecting with GWT elements:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#title='Viewers']//preceding::span[1]//label"))).click()
You can find a detailed discussion in How to click on GWT enabled elements using Selenium and Python

The answer to (1) is simple: the page content has changed.
Firstly, the notion that there is "an XPath" for every element in a document is wrong: there are many (an infinite number) of XPath expressions that will select a given element. You've probably generated these XPaths using a tool that tries to give you what it considers the most useful XPath expression, but it's not the only one possible.
The best XPath expression to use is one that isn't going to change when the content of the page changes: but it's very hard for any tool to give you that, because it has no idea what's likely to change in the page content.
Using an #id attribute value (which these paths do) is more likely to be stable than using numeric indexing (which these paths also do), but that's based on guesses about what's likely to change, and those guesses can always be wrong. The only way of writing an XPath expression that continues to do "the right thing" when the page changes is to correctly guess what aspects of the page structure are going to vary and what parts are going to remain stable. So the only "bulletproof" answer (2) is to understand not just the current page structure, but its invariants over time.

How to web scrape to find out new updates on website

I know it's a broad question, but I'm looking for ideas to go about doing this. Not looking for the exact coded answer, but a rough gameplan of how to go about this!
I'm trying to scrape a blog site to check for new blog posts, and if so, to return the URL of that particular blog post.
There are 2 parts to this question, namely
Finding out if the website has been updated
Finding what is the difference (new content)
I'm wondering what are the approaches I could go about doing this. I have been using Selenium for quite a bit, and am aware that with the Selenium driver I could check for 1. with driver.page_source.
Is there a better way to do both 1 and 2 together, and if possible even across various different blog sites (thinking whether it is possible to write more general code applied to various blogposts at once, not a customs script for each post)?
Bonus: Is there a way to do a "diff" on the before and after of the code to see the difference, and extract necessary information from there?
Thanks so much in advance!

If you're looking for a way to know if pages have been added or deleted, you can either look at directly, or build yourself a copy of a sitemap.xml file. If they do not have a sitemap.xml, you can crawl the menu and navigation for the site and build up your own from that. Sitemap files have a 'last modified' entry. If you know the interval you are scraping on, you can calculate rather quickly if the change occurred within the interval. This is good for site-wide changes.
Alternatively, you can also check the site-header to determine the last modified time for the page. Apply the same interval check as the site-map and go from there.

You can always check the last modified value in the web sites header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified

How do I check if a website is responsive using python?

I am using python3 in combination with beautifulsoup.
I want to check if a website is responsive or not. First I thought checking the meta tags of a website and see if there is something like this in it:
content="width=device-width, initial-scale=1.0
Accuracy is not that good using this method but I have not found something better.
Has anybody an idea?
Basically I want to do the same as Google did it here: https://search.google.com/test/mobile-friendly reduced to the output if the website is responsive or not (Y/N)

(Just a suggestion)
I am not an expert on this but my first thought is that you need to render the website and see if it "responds" to different screen sizes. I would normally use something like phantomjs to do this.
Apparently, you can do this in python with selenium (more info at https://stackoverflow.com/a/15699761/3727050). A more comprehensive list of technologies that can be used for this task can be found here. Note that these resources seem a bit old/outdated and some solutions fallback to python subprocess calling phantomjs.
The linked google test seems to
Load the page in a small browser and check:
The font-size to be readable
The distance between clickable elements to ensure the page is usable
I would however do the following:
Load the page in desktop mode, record each div's style.
Gradually reduce the size of the screen and see which percentage of these change style
In most cases, from a large screen to a phone size you should be seeing 1-3 distinct layouts which should be identifiable from the percentage of elements changing style
The above does not guarantee that the page is "mobile-friendly" (ie usable in a mobile) but it shows if the CSS are responsive.

Scrapy not always finding the object

I'm trying to scrape a hidden field on a webpage, with python's scrapy framework:
<input class="currentTime" value="4888599" />
The strange thing is, on about 40% of all pages it cannot find the value of the input field. I tried loading the failing pages with javaScript disabled (thought maybe that's the problem) inside my browser, but the value is just filled on the pages which are failing. So the value is not added with javaScript....
Anyone who had this problem before or might have a solution for this? I don't know why it cannot find the value. I'm using the following syntax to scrape:
sel.css('.currentTime::attr(value)').extract()
The class is just available once on a page and I'm searching from the body tag. So it cannot be the path which is wrong, to my opinion. It's only that object which cannot be found most of the time, all other objects are not a problem.

Instead of the CSS attributes, you should prefer XPath - it's much more powerful and allows you to do things like traverse the tree backwards (for parents, which you can then descend down again)
Not that you'd need to do this in the given example, but XPath is much more reliable in general.
A fairly generic xpath query to do what you want would be something like this.. (in the case where the node may have more than one class name
//input[contains(concat(' ',normalize-space(#class),' '),' currentTime ')]/value/text()
A more targeted example would be..
//input[#class="currentTime"]/value/text()

What is the best practice for writing maintainable web scrapers?

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:
discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);
I guess code like this can very easily become invalid when the web page is changed, even slightly.
How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?
In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?

Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.
Maintainability is somewhat of an art-form centered around how selectors are defined and used.
In the past I have rolled my own "two stage" selectors:
(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.
This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.
I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.
A few important questions to answer are:
What do you expect to change on the page?
What do you expect to stay the same on the page?
When answering these questions, the more accurate you can be the better your selectors can become.
In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.
EDIT: Small example
Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.
Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.
So why not just use a selector like .content > .deal .price?
For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.
I shouldn't say that it's a "best" practice, but it has worked well.

Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.
You would write it like:
<div id="detail-main">
<del class="originPrice">
{extract(., "[0-9.]+")}
</del>
</div>
Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.
Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes.
Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.

EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.
However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.
Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).
I usually roll my own, using:
urllib2 (standard library),
lxml and
cssselect
cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.
Example code to fetch the first question from SO homepage:
import urllib2
import urlparse
import cookielib
from lxml import etree
from lxml.cssselect import CSSSelector
post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cookie_jar),
urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)
elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text
Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.