I am validating some texts on a webpage . There are two piece of text that should be mutually exclusive i,e only one of the texts should be visible at any time.
The element xpath=//*[#id="study_info"] actually retrieves the texts "No records available" or "Showing page x of x" based on the records and only one of this texts shows on the page. if there is no record, It shows "No record Available" otherwise it shows "showing Page x of x"
When I am trying to validate these in Robot Framework, It actually finds both text at a time, although I can see only one text. I do not know what is happening here.. The below code should fail as the texts are mutually exclusive and only one of the texts are visible. But it passes without any problem.
Page Should Contain No records available
Page Should Contain Showing page
Page Should Contain Element xpath=//*[#id="study_info"]
The full code for the element is
"<div class="dataTables_info" id="study_info" role="status" aria-live="polite">No records available</div>
I need to understand what is happening and how to fix it.
Depending on the visualization (js) framework used, the text that is currently NOT visible may very well be in the HTML - the div having it to be hidden, put in a stack of possible values, etc., and to replace the visible element as needed.
The keyword Page Should Contain goes through the whole current html - it actually uses a locator xpath=//*[contains(., %s)], where %s is the string you're after, and if any (hidden/overlaid/js source) element has the text, will return true.
To solve your particular case, I'd suggest a little bit different approach - get the text value of that element, and assert it's the expected one:
${locator}= Set Variable xpath=//*[#id="study_info"]
${current text}= Get Text ${locator}
Should Contain ${current text} Showing page
Should Not Contain ${current text} No records available # and vice-versa
Related
I am working on an automation script for a third-party website which manages credentials for a program. Within the script, I can successfully log into the website and place a search variable within the [Search] field and filter the results on the table within the website. However, I cannot figure out how to iterate over results that appear within the table. The data that returns within rows appears to be a dump of the entire table, not just the filtered results which are represented in the browser.
I'm attempting to come up with the following:
If there are no matches, exit function
If there is at least 1 match, iterate over the rows and load the [Key Name] value into a variable
I have reviewed the locators documentation from Playwright, but the Lists section does not appear to work correctly with the code I have.
def deactivate_current_license(page, email_address):
print(f"Searching for currently assigned licenses for {email_address}")
page.locator('text=Search: >> input[type="search"]').fill(email_address)
page.locator('text=Search: >> input[type="search"]').press("Enter")
# Locate elements, this locator points to a list.
rows = page.locator('table[id="tbl_Customer_Asset__c"]')
# Pattern 3: resolve locator to elements on page and map them to their text content.
# Note: the code inside evaluateAll runs in page, you can call any DOM apis there.
texts = rows.evaluate_all("list => list.map(element => element.textContent)")
Here are images of how the table from the site appears with 3 different scenarios:
Based on your snippet, there's a clear issue on the w3schools link, which is that elements have a style="display: none" attribute set when they're not part of the search results. This hides them from the user's view but doesn't extract their text from the DOM. This is ignored by your scraping logic, which simply pulls all of the text contents regardless of whether it's in a hidden element or not.
In addition to the :visible pseudoselector, select the tr elements rather than the whole table. For example:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
url = "https://www.w3schools.com/howto/howto_js_filter_table.asp"
page.goto(url, wait_until="domcontentloaded")
page.locator("#myInput").fill("t")
page.locator("#myInput").press("Enter")
table = (
page.locator("#myTable tr:visible")
.evaluate_all("""els => els.slice(1).map(el =>
[...el.querySelectorAll('td')].map(e => e.textContent.trim())
)""")
)
for row in table:
print(row)
browser.close()
Output:
['Alfreds Futterkiste', 'Germany']
['Island Trading', 'UK']
['Magazzini Alimentari Riuniti', 'Italy']
['North/South', 'UK']
['Paris specialites', 'France']
However, this might not extrapolate cleanly to your main problem. There are at least few other things that could be responsible for the failure on the other page.
There may be other visibility factors at play, so I would use the inspector to see whether those elements are being hidden, set to display: none (as with the w3schools example above), removed completely, or something else (like a width/height 0 hack--unlikely, but who knows) upon search.
The most likely scenario is that the search triggers an asynchronous HTTP request. You might need to wait for the table contents to change or the HTTP response to arrive before attempting to select the data from the DOM. You can verify this by looking at the network tab in the browser developer tools. Often, you can intercept the response and avoid messing with the DOM entirely. wait_for_function and wait_for_response can be useful here.
Although it's not recommended for the final script, setting a temporary timeout for a generous 5-10 seconds helps give the data a chance to arrive. Once you've detected the general behavior, you can then tighten the predicate with a specific function or repsonse and drop the timeout.
If this doesn't get you unstuck, I would try to create a simple page that captures the table behavior better than w3schools. You can mock API request delay with setTimeout. This might be tricky, but minimizing (isolating) the problem is usually 90% of debugging work.
I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.
I'm very confused by getting text using Selenium.
There are span tags with some text inside them. When I search for them using driver.find_element_by_..., everything works fine.
But the problem is that the text can't be got from it.
The span tag is found because I can't use .get_attribute('outerHTML') command and I can see this:
<span class="branding">ThrivingHealthy</span>
But if I change .get_attribute('outerHTML') to .text it returns empty text which is not correct as you can see above.
Here is the example (outputs are pieces of dictionary):
display_site = element.find_element_by_css_selector('span.branding').get_attribute('outerHTML')
'display_site': u'<span class="branding">ThrivingHealthy</span>'
display_site = element.find_element_by_css_selector('span.branding').text
'display_site': u''
As you can clearly see, there is a text but it does not finds it. What could be wrong?
EDIT: I've found kind of workaround. I've just changed the .text to .get_attribute('innerText')
But I'm still curious why it works this way?
The problem is that there are a LOT of tags that are fetched using span.branding. When I just queried that page using find_elements (plural), it returned 20 tags. Each tag seems to be doubled... I'm not sure why but my guess is that one set is hidden while the other is visible. From what I can tell, the first of the pair is hidden. That's probably why you aren't able to pull text from it. Selenium's design is to not interact with elements that a user can't interact with. That's likely why you can get the element but when you try to pull text, it doesn't work. Your best bet is to pull the entire set with find_elements and then just loop through the set getting the text. You will loop through like 20 and only get text from 10 but it looks like you'll still get the entire set anyway. It's weird but it should work.
I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low
Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.
I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.
Here's a code snipet I'm basing it on :
Example website : http://www.uscho.com/rankings/d-i-mens-poll/
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[#id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
I can't seem to understand any of the code above. Only understood that the URL is being read. :(
Thank you for any help!
Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.
So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).
Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.
You could replace the line
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
with
for row in section.xpath('table/tr'):
Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.
Try this for yourself. Replace the last loop block with this much easier to read alternative:
for row in section.xpath('table/tr'):
print row.xpath('td//text()')
You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.
Here're a few helpful references:
Easy to understand tutorial :
http://www.w3schools.com/xpath/xpath_examples.asp
Stackoverflow question : Extract text between tags with XPath including markup
Another tutorial : http://www.tutorialspoint.com/xpath/
I asked my previous question here:
Xpath pulling number in table but nothing after next span
This worked and i managed to see the number i wanted in a firefox plugin called xpath checker. the results show below.
so I know i can find this number with this xpath, but when trying to run a python scrpit to find and save the number it says it cannot find it.
try:
views = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']/preceding-sibling::text()")
except NoSuchElementException:
print "NO views"
views = 'n/a'
pass
I no that pass is not best practice but i am just testing this at the moment trying to find the number. I'm wondering if i need to change something on the end of the xpath like .text as the xpath checker normally shows a results a little differently. Like below:
i needed to use the xpath i gave rather than the one used in the above picture because i only want the number and not the date. You can see part of the source in my previous question.
Thanks in advance! scratching my head here.
The xpath used in find_element_by_xpath() has to point to an element, not a text node and not an attribute. This is a critical thing here.
The easiest approach here would be to:
get the td's text (parent)
get the span's text (child)
remove child's text from parent's
Code:
span = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']")
td = span.find_element_by_xpath('..')
views = td.text.replace(span.text, '').strip()