I have been using python selenium for web automation testing. The key part of automation is to find the right element for a user-visible object in a HTML page. The following API will work most of the time, but not all the time.
find_element_by_xxx, xxx can be id, name, xpath, tag_name etc.
When HTML page is too complicated, I would like to search the dom tree. Wonder if it's possible to ask the selenium server to serialize the entire DOM (with the element id that can be used to perform action on through webdriver server). Client side (python script) can do its own search algorithm to find the right element.
Note that python selenium can get the entire html page by
drv.page_source
However, parsing this doesn't give the internal element id from selenium server's point of view, hence not useful.
EDIT1:
Paraphrase it to make it more clear (thanks #alecxe): what's needed here is a serialized representation of all the DOM elements (with their DOM structure preserved) in the selenium server, this serialized representation can be sent to the client side (a python selenium test app) which can do its own search.
The Problem
Ok, so there may be cases where you need to perform some substantial processing of a page on the client (Python) side rather than on the server (browser) side. For instance, if you have some sort of machine learning system already written in Python and it needs to analyze the whole page before performing actions on them, then although it is possible to do it with a bunch of find_element calls, this gets very expensive because each call is a round-trip between the client and the server. And rewriting it to work in the browser may be too expensive.
Why Selenium's Identifiers wont' do it
However, I do not see an efficient way to get a serialization of the DOM together with Selenium's own identifiers. Selenium creates these identifiers on an as-needed basis, when you call find_element or when DOM nodes are returned from an execute_script call (or passed to the callback that execute_async_script gives to the script). But if you call find_element to get identifiers for each element, then you are back to square one. I could imagine decorating the DOM in the browser with the required information but there is no public API to request some sort of pre-assignment of WebElement ids. As a matter of fact, these identifiers are designed to be opaque so even if a solution managed somehow to get the required information, I'd be concerned about cross-browser viability and ongoing support.
A Solution
There is however a way to get an addressing system that would work on both sides: XPath. The idea is to parse the DOM serialization into a tree on the client side and then get the XPath of the nodes you are interested in and use this to get the corresponding WebElement. So if you'd have to perform dozens of client-server roundtrips to determine which single element you need to perform a click on, you'd be able so reduce this to an initial query of the page source plus a single find_element call with the XPath you need.
Here is a super simple proof of concept. It fetches the main input field of the Google front page.
from StringIO import StringIO
from selenium import webdriver
import lxml.etree
#
# Make sure that your chromedriver is in your PATH, and use the following line...
#
driver = webdriver.Chrome()
#
# ... or, you can put the path inside the call like this:
# driver = webdriver.Chrome("/path/to/chromedriver")
#
parser = lxml.etree.HTMLParser()
driver.get("http://google.com")
# We get this element only for the sake of illustration, for the tests later.
input_from_find = driver.find_element_by_id("gbqfq")
input_from_find.send_keys("foo")
html = driver.execute_script("return document.documentElement.outerHTML")
tree = lxml.etree.parse(StringIO(html), parser)
# Find our element in the tree.
field = tree.find("//*[#id='gbqfq']")
# Get the XPath that will uniquely select it.
path = tree.getpath(field)
# Use the XPath to get the element from the browser.
input_from_xpath = driver.find_element_by_xpath(path)
print "Equal?", input_from_xpath == input_from_find
# In JavaScript we would not call ``getAttribute`` but Selenium treats
# a query on the ``value`` attribute as special, so this works.
print "Value:", input_from_xpath.get_attribute("value")
driver.quit()
Notes:
The code above does not use driver.page_source because Selenium's documentation states that there is no guarantee as to the freshness of what it returns. It could be the state of the current DOM or the state of the DOM when the page was first loaded.
This solution suffers from the exact same problems that find_element suffers from regarding dynamic contents. If the DOM changes while the analysis is occurring, then you are working on a stale representation of the DOM.
If you have to generate JavaScript events while performing the analysis, and these events change the DOM, then you'd need fetch the DOM again. (This is similar to the previous point but a solution that uses find_element calls could conceivably avoid the problem I'm talking about in this point by ordering the sequence of calls carefully.)
lxml's tree could possibly differ structurally from the DOM tree in such a way that the XPath obtained from lxml does not address the corresponding element in the DOM. What lxml processes is the cleaned up serialized view that the browser has of the HTML passed to it. Therefore, so long as the code is written to prevent the problems I've mentioned in point 2 and 3, I do not see this as a likely scenario, but it is not impossible.
Try:
find_elements_by_xpath("//*")
That should match all elements in the document.
UPDATE (to match question refinements):
Use javascript and return the DOM as a string:
execute_script("return document.documentElement.outerHTML")
See my other answer for the issues regarding any attempts at getting Selenium's identifiers.
Again, the problem is to reduce a bunch of find_element calls so as to avoid the round-trips associated with them.
A different method from my other answer is to use execute_script to perform the search on the browser and then return all the elements needed. For instance, this code would require three round-trips but can be reduced to just one round-trip:
el, parent, text = driver.execute_script("""
var el = document.querySelector(arguments[0]);
return [el, el.parentNode, el.textContent];
""", selector)
This returns an element, the element's parent and the element's textual contents on the basis of whatever CSS selector I wish to pass. In a case where the page has jQuery loaded, I could use jQuery to perform the search. And the logic can get as complicated as needed.
This method takes care of the vast majority of cases where reducing round-trips is desirable but it does not take care of a scenario like the one I've given in illustration in my other answer.
You can try to utilize the page object pattern. That sounds closer to what you are looking for in this case. You might not change everything to that, but at least for this part you might want to consider that.
http://selenium-python.readthedocs.org/en/latest/test-design.html?highlight=page%20object
You can also loop through all the elements of the page and save them off one at a time, but there should be some library that can do that. I know for .Net there is htmlAgility. I'm not sure on python.
Update
I found this...perhaps it will help you.
Html Agility Pack for python
Two ways I know are :-
get_source = driver.page_source
Secondly using javascript :-
pageSource = driver.execute_script("return document.documentElement.outerHTML;")
Actually you can do this quite easily. Write output to a stream like var w = window.open... and then document.write...
recursively iterate through the document object returning JSON.Stringify returning each object. I suggest you throw in typeof as well.
var s =
recurse(obj) {
for(var i in obj) {
return typeof(i) + ":" + i.toString() + ":" + JSON.stringify(obj[i]);
}
}
I'd suggest adding some sort of filtering to remove properties that you don't want to see. Also I doubt would run as the browsers detect and escape out of recursive loops.
I found this question looking for something similar, but I was hoping for a DataTable object (I'm using .Net) that I could bind into some sort of debugging window, something better than chrome. Before I used firebug to do this, but that is sorta dead.
So you could also get this data but in real time using a debugger.
Related
I'm working on some automated actions for Instagram using Python and Selenium and sometimes my code crashes because of a NoSuchElementException.
For example, when I first wrote a function for unfollowing a user, I used something like:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
After running a few times, my code crashed because it couldn't find the element so upon inspecting the page I found out that the XPath now is:
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
There's a small difference in div[1]/div[2]/div to div[2]/div/div/div[2]. So I have two questions:
Why does this happen?
Is there a bulletproof method that guarantees I will always be getting the right XPath (or element)?
It's high time we bust the myth that XPath changes.
Locator Strategies e.g. xpath and css-selectors are derived by the user and the more canonical the locators are constructed the more durable they are.
XML Path Language (XPath)
XPath 3.1 is an expression language that allows the processing of values conforming to the data model defined in XQuery and XPath Data Model (XDM) 3.1. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic values, function items, and sequences. This version of XPath supports JSON as well as XML, adding maps and arrays to the data model and supporting them with new expressions in the language and new functions in XQuery and XPath Functions and Operators 3.1.
Selectors
CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS uses Selectors for binding style properties to elements in the document. These expressions can also be used, for instance, to select a set of elements, or a single element from a set of elements, by evaluating the expression across all the elements in a subtree.
This usecase
As per your code trials:
following_xpath = "//*[#id='react-root']/section/main/div/header/section/div[1]/div[2]/div/span/span[1]/button"
and
following_xpath = "//*[#id="react-root"]/section/main/div/header/section/div[2]/div/div/div[2]/div/span/span[1]/button"
Here are a couple of takeaways:
The DOM Tree contains React elements. So it is quite clear that the app uses ReactJS. React is a declarative, efficient, and flexible JavaScript library for building user interfaces. It lets you compose complex UIs from small and isolated pieces of code called components.
The xpaths are absolute xpaths.
The xpaths contains indexes.
So, the application is dynamic in nature and elements are liable to be added and moved within the HTML DOM on firing of any DOM events.
Solution
In such cases when the application is based on either:
JavaScript
Angular
ReactJS
jQuery
AJAX
Vue.js
Ember.js
GWT
The canonical approach is to construct relative and/or dynamic locators inducing WebDriverWait. Some examples:
To interact with the username field on instagram login page:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))).send_keys("anon")
You can find a detailed discussion in Filling in login forms in Instagram using selenium and webdriver (chrome) python OSX
To locate the first line of the address just below the text as FIND US on facebook:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]")))
You can find a detailed discussion in Decoding Class names on facebook through Selenium
Interecting with GWT elements:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#title='Viewers']//preceding::span[1]//label"))).click()
You can find a detailed discussion in How to click on GWT enabled elements using Selenium and Python
The answer to (1) is simple: the page content has changed.
Firstly, the notion that there is "an XPath" for every element in a document is wrong: there are many (an infinite number) of XPath expressions that will select a given element. You've probably generated these XPaths using a tool that tries to give you what it considers the most useful XPath expression, but it's not the only one possible.
The best XPath expression to use is one that isn't going to change when the content of the page changes: but it's very hard for any tool to give you that, because it has no idea what's likely to change in the page content.
Using an #id attribute value (which these paths do) is more likely to be stable than using numeric indexing (which these paths also do), but that's based on guesses about what's likely to change, and those guesses can always be wrong. The only way of writing an XPath expression that continues to do "the right thing" when the page changes is to correctly guess what aspects of the page structure are going to vary and what parts are going to remain stable. So the only "bulletproof" answer (2) is to understand not just the current page structure, but its invariants over time.
I want to retreive the cityname and citycode and store it in one string variable. The image shows the precise location:
Google Chrome gave me the following XPath:
//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span
So I defined the following statement in scrapy to get the desired information:
plz = response.xpath('//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span/text()').extract()
However I was not successful, the string remains empty. What XPath definition should I use instead?
Most of the time this occurs, this is because browsers correct invalid HTML. How do you fix this? Inspect the (raw) HTML source and write your own XPath that navigate the DOM with the shortest/simplest query.
I scrape a lot of data off of the web and I've never used an XPath as specific as the one you got from the browser. This is for a few reasons:
It will fail quickly on invalid HTML or the most basic of hierarchy changes.
It contains no identifying data for debugging an issue when the website changes.
It's way longer than it should be.
Here's an example (there are a lot of different XPath queries you could write to find this data, I'd suggest you learning and re-writing this query so there are common themes for XPath queries throughout your project) query for grabbing that element:
//div[contains(#class, "detail-address")]//h2/following-sibling::span
The other main source of this problem is sites that extensively rely on JS to modify what is shown on the screen. Conveniently, though, this would be debugged the same was as above. As soon as you glance at the HTML returned on page load, you would notice that the data you are querying doesn't exist until JS executes. At that point, you would need to do some sort of headless browsing.
Since my answer was essentially "write your own XPath" (rather than relying on the browser), I'll leave some sources:
basic XPath introduction
list of XPath functions
XPath Chrome extension
The DOM is manipulated by javascript, so what chrome shows is the xpath after
the all the stuff has happened.
If all you want is to get the cities, you can get it this way (using scrapy):
city_text = response.css('.detail-address span::text').extract_first()
city_code, city_name = city_text.split(maxsplit=1)
Or you can manipulate the JSON in CDATA to get all the data you need:
cdata_text = response.xpath('//*[#id="tdakv"]/text()').extract_first()
json_str = cdata_text.splitlines()[2]
json_str = json_str[json_str.find('{'):]
data = json.loads(json_str) # import json
city_code = data['kvzip']
city_name = data['kvplace']
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:
discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);
I guess code like this can very easily become invalid when the web page is changed, even slightly.
How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?
In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?
Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.
Maintainability is somewhat of an art-form centered around how selectors are defined and used.
In the past I have rolled my own "two stage" selectors:
(find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.
(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.
This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.
I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.
A few important questions to answer are:
What do you expect to change on the page?
What do you expect to stay the same on the page?
When answering these questions, the more accurate you can be the better your selectors can become.
In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.
EDIT: Small example
Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.
Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.
So why not just use a selector like .content > .deal .price?
For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.
I shouldn't say that it's a "best" practice, but it has worked well.
Completely unrelated to Python and not auto-flexible, but I think the templates of my Xidel scraper have the best maintability.
You would write it like:
<div id="detail-main">
<del class="originPrice">
{extract(., "[0-9.]+")}
</del>
</div>
Each element of the template is matched against the elements on the webpage, and if they are the same, the expressions inside {} are evaluated.
Additional elements on the page are ignored, so if you find the right balance of included elements and removed elements, the template will be unaffected by all minor changes.
Major changes on the other hand will trigger a matching failure, much better than xpath/css which will just return an empty set. Then you can change in the template just the changed elements, in the ideal case you could directly apply the diff between old/changed page to the template. In any case, you do not need to search which selector is affected or update multiple selectors for a single change, since the template can contain all queries for a single page together.
EDIT: Oops, I now see you're already using CSS selectors. I think they provide the best answer to your question. So no, I don't think there is a better way.
However, sometimes you may find that it's easier to identify the data without the structure. For example, if you want to scrape prices, you can do a regular expression search matching the price (\$\s+[0-9.]+), instead of relying on the structure.
Personally, the out-of-the-box webscraping libraries that I've tried all kind of leave something to desire (mechanize, Scrapy, and others).
I usually roll my own, using:
urllib2 (standard library),
lxml and
cssselect
cssselect allows you to use CSS selectors (just like jQuery) to find specific div's, tables etcetera. This proves to be really invaluable.
Example code to fetch the first question from SO homepage:
import urllib2
import urlparse
import cookielib
from lxml import etree
from lxml.cssselect import CSSSelector
post_data = None
url = 'http://www.stackoverflow.com'
cookie_jar = cookielib.CookieJar()
http_opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cookie_jar),
urllib2.HTTPSHandler(debuglevel=0),
)
http_opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
]
fp = http_opener.open(url, post_data)
parser = etree.HTMLParser()
doc = etree.parse(fp, parser)
elem = CSSSelector('#question-mini-list > div:first-child > div.summary h3 a')(doc)
print elem[0].text
Of course you don't need the cookiejar, nor the user-agent to emulate FireFox, however I find that I regularly need this when scraping sites.