How do I ensure there are table rows, and iterate with Playwright

How do I ensure there are table rows, and iterate with Playwright - python

I am working on an automation script for a third-party website which manages credentials for a program. Within the script, I can successfully log into the website and place a search variable within the [Search] field and filter the results on the table within the website. However, I cannot figure out how to iterate over results that appear within the table. The data that returns within rows appears to be a dump of the entire table, not just the filtered results which are represented in the browser.
I'm attempting to come up with the following:
If there are no matches, exit function
If there is at least 1 match, iterate over the rows and load the [Key Name] value into a variable
I have reviewed the locators documentation from Playwright, but the Lists section does not appear to work correctly with the code I have.
def deactivate_current_license(page, email_address):
print(f"Searching for currently assigned licenses for {email_address}")
page.locator('text=Search: >> input[type="search"]').fill(email_address)
page.locator('text=Search: >> input[type="search"]').press("Enter")
# Locate elements, this locator points to a list.
rows = page.locator('table[id="tbl_Customer_Asset__c"]')
# Pattern 3: resolve locator to elements on page and map them to their text content.
# Note: the code inside evaluateAll runs in page, you can call any DOM apis there.
texts = rows.evaluate_all("list => list.map(element => element.textContent)")
Here are images of how the table from the site appears with 3 different scenarios:

Based on your snippet, there's a clear issue on the w3schools link, which is that elements have a style="display: none" attribute set when they're not part of the search results. This hides them from the user's view but doesn't extract their text from the DOM. This is ignored by your scraping logic, which simply pulls all of the text contents regardless of whether it's in a hidden element or not.
In addition to the :visible pseudoselector, select the tr elements rather than the whole table. For example:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
url = "https://www.w3schools.com/howto/howto_js_filter_table.asp"
page.goto(url, wait_until="domcontentloaded")
page.locator("#myInput").fill("t")
page.locator("#myInput").press("Enter")
table = (
page.locator("#myTable tr:visible")
.evaluate_all("""els => els.slice(1).map(el =>
[...el.querySelectorAll('td')].map(e => e.textContent.trim())
)""")
)
for row in table:
print(row)
browser.close()
Output:
['Alfreds Futterkiste', 'Germany']
['Island Trading', 'UK']
['Magazzini Alimentari Riuniti', 'Italy']
['North/South', 'UK']
['Paris specialites', 'France']
However, this might not extrapolate cleanly to your main problem. There are at least few other things that could be responsible for the failure on the other page.
There may be other visibility factors at play, so I would use the inspector to see whether those elements are being hidden, set to display: none (as with the w3schools example above), removed completely, or something else (like a width/height 0 hack--unlikely, but who knows) upon search.
The most likely scenario is that the search triggers an asynchronous HTTP request. You might need to wait for the table contents to change or the HTTP response to arrive before attempting to select the data from the DOM. You can verify this by looking at the network tab in the browser developer tools. Often, you can intercept the response and avoid messing with the DOM entirely. wait_for_function and wait_for_response can be useful here.
Although it's not recommended for the final script, setting a temporary timeout for a generous 5-10 seconds helps give the data a chance to arrive. Once you've detected the general behavior, you can then tighten the predicate with a specific function or repsonse and drop the timeout.
If this doesn't get you unstuck, I would try to create a simple page that captures the table behavior better than w3schools. You can mock API request delay with setTimeout. This might be tricky, but minimizing (isolating) the problem is usually 90% of debugging work.

Related

How to write xpath expression for data-extracting with visibility_of_all_elements_located?

There is a table whose xpath is .//table[#id='target'] in target webpage, I want to get all data in the table (all text in td in the table).
Should i write the wait.until statement
wait.until(EC.visibility_of_all_elements_located(By.XPATH, ".//table[#id='target']")))
or
wait.until(EC.visibility_of_all_elements_located(By.XPATH, ".//table[#id='target']//td")))
?

Both commands will NOT give you what you are looking for.
visibility_of_all_elements_located will NOT really wait for visibility of ALL the elements on the page matching the passed locator.
visibility_of_all_elements_located method actually waits for at least 1 element matching the passed locator to be visible.
So, to make sure all the elements are visible you will have to add some sleep after that command.
Also, I think that waiting for table internal elements visibility should be better than waiting for the table element itself visibility.
So, I would use something like this:
wait.until(EC.visibility_of_all_elements_located(By.XPATH, ".//table[#id='target']//td")))
time.sleep(1)

tds are basically not direct child but descendant :
.//table[#id='target']/descendant::td
should be the right xpath.
all_table_data = wait.until(EC.visibility_of_all_elements_located((By.XPATH, ".//table[#id='target']/descendant::td")))
all_table_data is a list that contain all the web elements.Print like below and that should give you all the data available in selenium view port.
for data in all_table_data:
print(data.text)

Get a page with Selenium but wait for unknown element value to not be empty

Context
This is a repost of Get a page with Selenium but wait for element value to not be empty, which was Closed without any validity so far as I can tell.
The linked answers in the closure reasoning both rely on knowing what the expected text value will be. In each answer, it explicitly shows the expected text hardcoded into the WebDriverWait call. Furthermore, neither of the linked answers even remotely touch upon the final part of my question:
[whether the expected conditions] come before or after the page Get
"Duplicate" Questions
How to extract data from the following html?
Assert if text within an element contains specific partial text
Original Question
I'm grabbing a web page using Selenium, but I need to wait for a certain value to load. I don't know what the value will be, only what element it will be present in.
It seems that using the expected condition text_to_be_present_in_element_value or text_to_be_present_in_element is the most likely way forward, but I'm having difficulty finding any actual documentation on how to use these and I don't know if they come before or after the page Get:
webdriver.get(url)
Rephrase
How do I get a page using Selenium but wait for an unknown text value to populate an element's text or value before continuing?

I'm sure that my answer is not the best one but, here is a part of my own code, which helped me with similar to your question.
In my case I had trouble with loading time of the DOM. Sometimes it took 5 sec sometimes 1 sec and so on.
url = 'www.somesite.com'
browser.get(url)
Because in my case browser.implicitly_wait(7) was not enought. I made a simple for loop to check if the content is loaded.
some code...
for try_html in range(7):
""" Make 7 tries to check if the element is loaded """
browser.implicitly_wait(7)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.find_all('script', type='application/ld+json')
"""if SKU in not found in the html page we skip
for another loop, else we break the
tryes and scrape the page"""
if 'sku' not in html:
continue
else:
scrape(raw_data)
break
It's not perfect, but you can try it.

Unable to get all children (dynamic loading) selenium python

This question has already been answered and one of the easiest ways is to get the tag name, if already known, within the element
child_elements = element.find_elements_by_tag_name("<tag name>")
However, for the following element pasted, only 9 out of 25 instances of the tag name is returned. I am novice in JavaScript and thus, I am not able to zero down on the reason. In this example, I am trying to get the dt tag within the ol element. The code snippet I am using for that is,
par_element = browser.find_element_by_class_name('search-results__result-list')
child_elements = par_element.find_elements_by_tag_name("dt")
The element skeleton/structure from the page source is shown in the image below:
(the structure is the same for all the div tags, as one is expanded to show for example.
I have also tried getting the class name result-lockup__name directly, and it still returns only 9 out of the 25 instances. What could be the reason?
EDIT
Initially,all the elements were not loaded, and thus I had to scroll through the page by
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
When the problem occurred once again, and I was not able to figure out, I raised this question. Apparently, it looks like even the scroll is not helping, as certain elements look hidden
After manually scrolling through them again, keeping the code in pause, I was able to "enable" them.
Is this a type of mask to save sites from being scraped? I feel now that I would probably have to scroll up in increments to reveal them all, but is there a smarter way?

The elements are loading dynamically and you need to scroll the page slowly to get all the child elements.Try the below code hopefully it will work.This is just an workaround.
element_list=[]
while True:
browser.find_element_by_tag_name("body").send_keys(Keys.DOWN)
time.sleep(2)
listlen_before=len(element_list)
par_element = browser.find_element_by_class_name('search-results__result-list')
child_elements = par_element.find_elements_by_tag_name("dt")
for ele in child_elements:
if ele.text in element_list:
continue
else:
element_list.append(ele.text)
listlen_after = len(element_list)
if listlen_before==listlen_after:
break

Extracting HTML tag content with xpath from a specific website

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.

Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()

Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.

Robot Framework finds text that does not exist

I am validating some texts on a webpage . There are two piece of text that should be mutually exclusive i,e only one of the texts should be visible at any time.
The element xpath=//*[#id="study_info"] actually retrieves the texts "No records available" or "Showing page x of x" based on the records and only one of this texts shows on the page. if there is no record, It shows "No record Available" otherwise it shows "showing Page x of x"
When I am trying to validate these in Robot Framework, It actually finds both text at a time, although I can see only one text. I do not know what is happening here.. The below code should fail as the texts are mutually exclusive and only one of the texts are visible. But it passes without any problem.
Page Should Contain No records available
Page Should Contain Showing page
Page Should Contain Element xpath=//*[#id="study_info"]
The full code for the element is
"<div class="dataTables_info" id="study_info" role="status" aria-live="polite">No records available</div>
I need to understand what is happening and how to fix it.

Depending on the visualization (js) framework used, the text that is currently NOT visible may very well be in the HTML - the div having it to be hidden, put in a stack of possible values, etc., and to replace the visible element as needed.
The keyword Page Should Contain goes through the whole current html - it actually uses a locator xpath=//*[contains(., %s)], where %s is the string you're after, and if any (hidden/overlaid/js source) element has the text, will return true.
To solve your particular case, I'd suggest a little bit different approach - get the text value of that element, and assert it's the expected one:
${locator}= Set Variable xpath=//*[#id="study_info"]
${current text}= Get Text ${locator}
Should Contain ${current text} Showing page
Should Not Contain ${current text} No records available # and vice-versa

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.