Scraping and parsing with Python - lxml with long Xpaths

Scraping and parsing with Python - lxml with long Xpaths - python

I am loading and scrolling on dynamically loading pages. An example is the Facebook "wall", which only loads the next items once you have scrolled to somewhere near the bottom.
I scroll until the page is veeeery long, then I copy the source code, save it as a text file and go on to parsing it.
I would like to extract certain parts of the webpage. I have been using the lxml module in python, but with limited success. On there website they only show examples with pretty short Xpaths.
Below is an example of the function and a path that gets me the user names included on the page.
usersID = elTree.xpath('//a[#class="account-group js-account-group js-action-profile js-user-profile-link js-nav"]')
this works fairly well, however I am getting some errors (another post of mine), such as:
TypeError: 'NoneType' object has no attribute 'getitem'
I have also been looking at the Xpaths that Firebug provides. These are of course much longer and very specific. Here is an example for a reoccuring element on the page:
/html/body/div[2]/div[2]/div/div[2]/div[2]/div/div[2]/div/div/div/div/div[2]/ol[1]/li[26]/ol/li/div/div[2]/p
The part towards the end li[26] shows it is the 26th item in a list of the same element, which are found at the same level of the HTML tree.
I would like to know how I might use such firebug-Xpaths with the lxml library, or of anybody knows of a better way to use Xpaths in general?
Using example HTML code and tools like this for test purposes, the Xpaths from Firebug don't work at all. Is that path just ridiculous in people's experience?
Is is very specific to the source code? Are there any other tools like Firebug that produce more reliable output for use with lxml?

FireBug actually generates really poor xpaths. They are long and fragile because they're incredibly non specific beyond hierarchy.
Pages today are incredibly dynamic.
The best way to work with xpath on dynamic pages is to locate common elements as the hook and perform xpath ops from those as your path root.
What I mean here by common elements is stable structural elements that are highly likely or guaranteed to be present. Pick the one closest to your target in terms of containment hierarchy. Shorter paths are faster and clearer.
From there you need to create paths that locate some specific unique attribute or attribute value on the target element.
Sometimes that's not possible so another strategy is to target the closest uniquely identifiable container element then get all elements similar to yours under that and iterate them looking for your goal.
Highly dynamic pages require sophisticated and dynamic approaches.
Facebook changes a lot and will require script maintenance frequently.

I found two things which, together, worked very well for me.
The first thing:
The lxml package allows the usage of some functions in combination with the Xpath. I used the starts-with function, as follows:
tweetID = elTree.xpath("//div[starts-with(#class, 'the-non-varying-part-of-a-very-long-class-name')]")
When exploring the HTML code (tree) using tools such as Firebug/Firepath, everything is shown nice and neatly - for example:
*
*
When I used the highlighted path, i.e. tweet original-tweet js-original-tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable has-cards has-native-media with-media-forward media-forward cards-forward - to search my elTree within the code above, nothing was found.
Having a look at the actual HTML file I was trying to parse, I saw it was really spread over many lines - like this:
this explains why the lxml package was not finding it according to my search.
The second thing:
I know is not generally recommended as a workaround, but the Python approach that it is "easier to ask for forgiveness than permission" applied in my case - The next things I did was to use the python try / except on a TypeError that I kept getting at seemingly arbitrary lines of my code
This may well be specific to my code, but after checking the output on many cases, it seems as though it worked well for me.

Related

Rethinking my approach to scraping dynamic content with Python and Selenium

Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.

Should I force developers to write ID of elements to write proper automated tests?

I am writing now selenium tests for React based webapp and facing some concerns about writing tests in selenium.
It looks like most of elements have no unique ID where I can simply find them using Selenium methods. I have used Selenium IDE and manually create tests and exported them to python but they typically look like this:
browser.find_element(
By.CSS_SELECTOR, ".MuiButton-containedPrimary > .MuiButton-label"
).click()
Which is in my opinion too complex for others to read and maintenance. Even if I write tests by myself they will be too hard to read. (even if they are fast developed)
My question is : What is the best approach to acheive maintability and simplicity of tests? Should I force developers to give unique ids to core elements like buttons etc. ?
If unique ids will be given typical search method could work like this:
element = browser.find_element_by_id('createProjectArea')

Yes:
Should I force developers to give unique ids to core elements
on the last project we did accurately so. Of course there're bunch of elements and it's mess to cover every element, but then you start using css_selectors and xpath, trying not to be related on neighbor elements if it's possible.
And try to use:
driver.find_element(By.CSS_SELECTOR, 'selector')
or
driver.find_element(By.XPATH, 'selector')
in the newest selenium version(4) will start getting warnings about their deprecation:
find_element_by_id
find_element_by_name
find_element_by_tag
find_element_by_xpath
etc
warnings.warn("find_element_by_* commands are deprecated. Please use
find_element() instead")

Assign a known ID to any field or control that you want your Selenium tests to specifically manipulate. This can make the test more resistant to subsequent changes in the topology of the HTML – unlike other alternatives which depend on the surrounding context. That can be a real advantage.
I think that it's a balancing act. If creating IDs helps you write your tests, then it's perfectly fine to do so for that purpose, where and when it actually applies. But I would not go so far as to "force developers to ..." Don't litter the field with IDs that are not actually going to be referred-to by the application or by the tests.

Web scraping: finding element after a DOM Tree change

I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.

I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!

Stripping irrelevant parts of a web page

Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc. One can guess things like that, but is there any smart way of doing it?

There's the approach from the Readability bookmarklet, with at least two Python implementations available:
decruft
python-readability

In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.

One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).
This field is known as wrapper induction. Unfortunately it is harder than it sounds!

This git hub project solves your problem, but it's in Java. May be worth a look: goose

Design help for static content with fixed keywords search framework

I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?

I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping and parsing with Python - lxml with long Xpaths - python

Related

Rethinking my approach to scraping dynamic content with Python and Selenium

Should I force developers to write ID of elements to write proper automated tests?

Web scraping: finding element after a DOM Tree change

Stripping irrelevant parts of a web page

Design help for static content with fixed keywords search framework

Categories

Resources