I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!
Related
Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.
For now I'm just maintaining it in '.ini' file and accessing via 'configparser'. But the problem is, when we are working with some big applications with so many pages in it, it's very difficult to make any changes.
[login]
login_window=//h4[text()='Login']
username_input=//input[#name='username']
password_input=//input[#name='password']
login_button=//input[#value='Login']
Keeping XPATH externally is a good approach, but XPATH is time-consuming, it lacks performance and as you found out is hard to maintain.
Instead of that use CSS_SELECOTRS, CLASSNAME, or ID - those are rarely changed and it keeps your tests updated with new UI changes. Also - use Page Object pattern - you can keep UI pages mapped in classes and each field will be defined by a selector, it's easier to keep tracking changes.
eg.
username_input=By.CSS_SELECTOR("input[#name='username']")
password_input=By.CSS_SELECTOR("input[#name='password']")
login_button=By.CSS_SELECTOR("input[#value='Login']")
here is a nice introduction for PO pattern
I am loading and scrolling on dynamically loading pages. An example is the Facebook "wall", which only loads the next items once you have scrolled to somewhere near the bottom.
I scroll until the page is veeeery long, then I copy the source code, save it as a text file and go on to parsing it.
I would like to extract certain parts of the webpage. I have been using the lxml module in python, but with limited success. On there website they only show examples with pretty short Xpaths.
Below is an example of the function and a path that gets me the user names included on the page.
usersID = elTree.xpath('//a[#class="account-group js-account-group js-action-profile js-user-profile-link js-nav"]')
this works fairly well, however I am getting some errors (another post of mine), such as:
TypeError: 'NoneType' object has no attribute 'getitem'
I have also been looking at the Xpaths that Firebug provides. These are of course much longer and very specific. Here is an example for a reoccuring element on the page:
/html/body/div[2]/div[2]/div/div[2]/div[2]/div/div[2]/div/div/div/div/div[2]/ol[1]/li[26]/ol/li/div/div[2]/p
The part towards the end li[26] shows it is the 26th item in a list of the same element, which are found at the same level of the HTML tree.
I would like to know how I might use such firebug-Xpaths with the lxml library, or of anybody knows of a better way to use Xpaths in general?
Using example HTML code and tools like this for test purposes, the Xpaths from Firebug don't work at all. Is that path just ridiculous in people's experience?
Is is very specific to the source code? Are there any other tools like Firebug that produce more reliable output for use with lxml?
FireBug actually generates really poor xpaths. They are long and fragile because they're incredibly non specific beyond hierarchy.
Pages today are incredibly dynamic.
The best way to work with xpath on dynamic pages is to locate common elements as the hook and perform xpath ops from those as your path root.
What I mean here by common elements is stable structural elements that are highly likely or guaranteed to be present. Pick the one closest to your target in terms of containment hierarchy. Shorter paths are faster and clearer.
From there you need to create paths that locate some specific unique attribute or attribute value on the target element.
Sometimes that's not possible so another strategy is to target the closest uniquely identifiable container element then get all elements similar to yours under that and iterate them looking for your goal.
Highly dynamic pages require sophisticated and dynamic approaches.
Facebook changes a lot and will require script maintenance frequently.
I found two things which, together, worked very well for me.
The first thing:
The lxml package allows the usage of some functions in combination with the Xpath. I used the starts-with function, as follows:
tweetID = elTree.xpath("//div[starts-with(#class, 'the-non-varying-part-of-a-very-long-class-name')]")
When exploring the HTML code (tree) using tools such as Firebug/Firepath, everything is shown nice and neatly - for example:
*
*
When I used the highlighted path, i.e. tweet original-tweet js-original-tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable has-cards has-native-media with-media-forward media-forward cards-forward - to search my elTree within the code above, nothing was found.
Having a look at the actual HTML file I was trying to parse, I saw it was really spread over many lines - like this:
this explains why the lxml package was not finding it according to my search.
The second thing:
I know is not generally recommended as a workaround, but the Python approach that it is "easier to ask for forgiveness than permission" applied in my case - The next things I did was to use the python try / except on a TypeError that I kept getting at seemingly arbitrary lines of my code
This may well be specific to my code, but after checking the output on many cases, it seems as though it worked well for me.
Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc. One can guess things like that, but is there any smart way of doing it?
There's the approach from the Readability bookmarklet, with at least two Python implementations available:
decruft
python-readability
In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.
One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).
This field is known as wrapper induction. Unfortunately it is harder than it sounds!
This git hub project solves your problem, but it's in Java. May be worth a look: goose
I'm working on a project that is quite search-oriented. Basically, users will add content to the site, and this content should be immediately available in the search results. The project is still in development.
Up until now, I've been using Haystack with Xapian. One thing I'm worried about is the performance of the website once a lot of content is available. Indexing will have to occur very frequently if I want to emulate real-time search.
I was reading up on MongoDB recently. I haven't found a satisfying answer to my question, but I have the feeling that MongoDB might be of help for the real-time search indexing issue I expect to encounter. Is this correct? In other words, would the search functionality available in MongoDB be more suited for a real-time search function?
The content that will be available on the site is large unstructured text (including HTML) and related data (prices, tags, datetime info).
Thanks in advance,
Laundro
I don't know much about MongoDB, but I'm using with great success Sphinx Search - simple, powerful and very fast tool for full text indexing&search. It also provides Python wrapper out-of-the-box.
It would be easier to pick it up if Haystack provided bindings for it, unfortunately Sphinx bindings are still on a wish list.
Nevertheless, setting Spinx up is so quick (I did it in a few hours, for existing in-production Django-based CRM), that maybe you can give it a try before switching to a more generic solution.
MongoDB is not really a "dedicated full text search engine". Based on their full text search docs you can only create a array of tags that duplicates the string data or other columns, which with many elements (hundreds or thousands) can make inserts very expensive.
Agree with Tomasz, Sphinx Search can be used for what you need. Real time indexes if you want it to be really real time or Delta indexes if several seconds of delay are acceptable.