What is the best way to maintain xpaths in selenium? - python

For now I'm just maintaining it in '.ini' file and accessing via 'configparser'. But the problem is, when we are working with some big applications with so many pages in it, it's very difficult to make any changes.
[login]
login_window=//h4[text()='Login']
username_input=//input[#name='username']
password_input=//input[#name='password']
login_button=//input[#value='Login']

Keeping XPATH externally is a good approach, but XPATH is time-consuming, it lacks performance and as you found out is hard to maintain.
Instead of that use CSS_SELECOTRS, CLASSNAME, or ID - those are rarely changed and it keeps your tests updated with new UI changes. Also - use Page Object pattern - you can keep UI pages mapped in classes and each field will be defined by a selector, it's easier to keep tracking changes.
eg.
username_input=By.CSS_SELECTOR("input[#name='username']")
password_input=By.CSS_SELECTOR("input[#name='password']")
login_button=By.CSS_SELECTOR("input[#value='Login']")
here is a nice introduction for PO pattern

Related

Should I force developers to write ID of elements to write proper automated tests?

I am writing now selenium tests for React based webapp and facing some concerns about writing tests in selenium.
It looks like most of elements have no unique ID where I can simply find them using Selenium methods. I have used Selenium IDE and manually create tests and exported them to python but they typically look like this:
browser.find_element(
By.CSS_SELECTOR, ".MuiButton-containedPrimary > .MuiButton-label"
).click()
Which is in my opinion too complex for others to read and maintenance. Even if I write tests by myself they will be too hard to read. (even if they are fast developed)
My question is : What is the best approach to acheive maintability and simplicity of tests? Should I force developers to give unique ids to core elements like buttons etc. ?
If unique ids will be given typical search method could work like this:
element = browser.find_element_by_id('createProjectArea')
Yes:
Should I force developers to give unique ids to core elements
on the last project we did accurately so. Of course there're bunch of elements and it's mess to cover every element, but then you start using css_selectors and xpath, trying not to be related on neighbor elements if it's possible.
And try to use:
driver.find_element(By.CSS_SELECTOR, 'selector')
or
driver.find_element(By.XPATH, 'selector')
in the newest selenium version(4) will start getting warnings about their deprecation:
find_element_by_id
find_element_by_name
find_element_by_tag
find_element_by_xpath
etc
warnings.warn("find_element_by_* commands are deprecated. Please use
find_element() instead")
Assign a known ID to any field or control that you want your Selenium tests to specifically manipulate. This can make the test more resistant to subsequent changes in the topology of the HTML – unlike other alternatives which depend on the surrounding context. That can be a real advantage.
I think that it's a balancing act. If creating IDs helps you write your tests, then it's perfectly fine to do so for that purpose, where and when it actually applies. But I would not go so far as to "force developers to ..." Don't litter the field with IDs that are not actually going to be referred-to by the application or by the tests.

Web scraping: finding element after a DOM Tree change

I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!

Scraping and parsing with Python - lxml with long Xpaths

I am loading and scrolling on dynamically loading pages. An example is the Facebook "wall", which only loads the next items once you have scrolled to somewhere near the bottom.
I scroll until the page is veeeery long, then I copy the source code, save it as a text file and go on to parsing it.
I would like to extract certain parts of the webpage. I have been using the lxml module in python, but with limited success. On there website they only show examples with pretty short Xpaths.
Below is an example of the function and a path that gets me the user names included on the page.
usersID = elTree.xpath('//a[#class="account-group js-account-group js-action-profile js-user-profile-link js-nav"]')
this works fairly well, however I am getting some errors (another post of mine), such as:
TypeError: 'NoneType' object has no attribute 'getitem'
I have also been looking at the Xpaths that Firebug provides. These are of course much longer and very specific. Here is an example for a reoccuring element on the page:
/html/body/div[2]/div[2]/div/div[2]/div[2]/div/div[2]/div/div/div/div/div[2]/ol[1]/li[26]/ol/li/div/div[2]/p
The part towards the end li[26] shows it is the 26th item in a list of the same element, which are found at the same level of the HTML tree.
I would like to know how I might use such firebug-Xpaths with the lxml library, or of anybody knows of a better way to use Xpaths in general?
Using example HTML code and tools like this for test purposes, the Xpaths from Firebug don't work at all. Is that path just ridiculous in people's experience?
Is is very specific to the source code? Are there any other tools like Firebug that produce more reliable output for use with lxml?
FireBug actually generates really poor xpaths. They are long and fragile because they're incredibly non specific beyond hierarchy.
Pages today are incredibly dynamic.
The best way to work with xpath on dynamic pages is to locate common elements as the hook and perform xpath ops from those as your path root.
What I mean here by common elements is stable structural elements that are highly likely or guaranteed to be present. Pick the one closest to your target in terms of containment hierarchy. Shorter paths are faster and clearer.
From there you need to create paths that locate some specific unique attribute or attribute value on the target element.
Sometimes that's not possible so another strategy is to target the closest uniquely identifiable container element then get all elements similar to yours under that and iterate them looking for your goal.
Highly dynamic pages require sophisticated and dynamic approaches.
Facebook changes a lot and will require script maintenance frequently.
I found two things which, together, worked very well for me.
The first thing:
The lxml package allows the usage of some functions in combination with the Xpath. I used the starts-with function, as follows:
tweetID = elTree.xpath("//div[starts-with(#class, 'the-non-varying-part-of-a-very-long-class-name')]")
When exploring the HTML code (tree) using tools such as Firebug/Firepath, everything is shown nice and neatly - for example:
*
*
When I used the highlighted path, i.e. tweet original-tweet js-original-tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable has-cards has-native-media with-media-forward media-forward cards-forward - to search my elTree within the code above, nothing was found.
Having a look at the actual HTML file I was trying to parse, I saw it was really spread over many lines - like this:
this explains why the lxml package was not finding it according to my search.
The second thing:
I know is not generally recommended as a workaround, but the Python approach that it is "easier to ask for forgiveness than permission" applied in my case - The next things I did was to use the python try / except on a TypeError that I kept getting at seemingly arbitrary lines of my code
This may well be specific to my code, but after checking the output on many cases, it seems as though it worked well for me.

Stripping irrelevant parts of a web page

Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc. One can guess things like that, but is there any smart way of doing it?
There's the approach from the Readability bookmarklet, with at least two Python implementations available:
decruft
python-readability
In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.
One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).
This field is known as wrapper induction. Unfortunately it is harder than it sounds!
This git hub project solves your problem, but it's in Java. May be worth a look: goose

What web programming languages are capable of making this web app?

I'm exploring many technologies, but I would like your input on which web framework would make this the easiest/ most possible. I'm currently looking to JSP/JSF/Primefaces, but I'm not sure if that is capable of this app.
Here's a basic description of the app:
Users log in with their username and password (maybe I can somehow incorporate OPENID)?
With a really nice UI, they will be presented a large list of questions specific to a certain category, for example, "Cooking". (I will manually compile this list and make it available.)
When they click on any of these questions, a little input box opens up below it to allow the user to put in a link/URL.
If the link they enter has the same question on that webpage the URL points to, they will be awarded one point. This question then disappears and gets added to a different page that has a list of all correctly linked questions.
On the right side of the screen, there will be a leaderboard with the usernames of the people with the top ten points.
The idea is relatively simple - to be able to compile links to external websites for specific questions by allowing many people to contribute.
I know I can build the UI easily with Primefaces. [B]What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.[/B] I can do this with python easily by using urllib, but I can't use python for web GUI building (it is very difficult). What is the best approach?
Any help would be appreciated!!! Thanks!
The best approach is whatever is best for you. If Python isn't your strength but Java is, then use Java. If you're a Python expert and know little Java, use Python.
There are so many resources on the Internet supporting so many platforms that the decision really comes down to what works best for you.
For starters, forget about JSP/JSF. This is an old combination that had many problems. Please consider Facelets/JSF. Facelets is the default templating language in the current version of JSF, while JSP is there only for backwards compatibility.
What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.
Yes it does, although the actual fetching of data and parsing of its content will be done by plain Java code. This itself has nothing to do with the JSF APIs.
With JSF you create a Facelet containing your UI (input fields, buttons, etc). Then still using JSF you bind this to a so-called backing bean, which is primarily a normal Java class with only one or two JSF specific annotations applied to it (e.g. #ManagedBean).
When the user enters the URL and presses some button, JSF takes care of calling some action method in your Java class (backing bean). In this action method you now have access to the URL the user entered, and from here on plain Java coding starts and JSF specifics end. You can put the code that fetches the URL and does the parsing you require in a separate helper class (separation of concerns), or at your discretion directly in the backing bean. The choice is yours.
Incidentally we had a very junior programmer at our office use JSF for something not unlike what you are requesting here and he succeeded in doing it in a short time. It thus really isn't that hard ;)
No web technology does what you want. Parsing documents found at certain urls is out of the scope of building web interfaces.
However, each of Java's web technologies will give you, without limits, access to a rich and varied (if not too rich and much too varied) set of libraries and frameworks running on JVM. You could safely say that if there is a library for doing something, there will be a Java version available. Downloading and parsing a document will not require more than what is available in the standard library (unless you insist on injecting your dependencies or crosscutting your concerns), so no problems with doing your project with JSP, or JSF/Primefaces, or whatever.
Since you claim to already know Python, and since you will have to add some HTML/CSS anyway, I suggest you try Django. It's dead simple, has a set of OpenID plugins to choose from, will give you admin interface for free (so you can prime the pump with the first set of links).

Categories

Resources