How do I get the visible text portion of a web page with selenium webdriver without the HTML tags?
I need something equivalent to the function HtmlPage.asText() from Htmlunit.
It is not enough to take the text with the function WebDriver.getSource and parse it with jsoup because there could be in the page hidden elements (by external CSS) which I am not interested in them.
Doing By.tagName("body") (or some other selector to select the top element), then performing getText() on that element will return all of the visible text.
I can help you with C# Selenium.
By using this you can select all the text on that particular page and save it to a text file at your preferred location.
Make sure you are using this stuff:
using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;
After reaching the particular page try using this code.
IWebElement body = driver.FindElement(By.TagName("body"));
var result = driver.FindElement(By.TagName("body")).Text;
// Folder location
var dir = #"C:Textfile" + DateTime.Now.ToShortDateString();
// If the folder doesn't exist, create it
if (!Directory.Exists(dir))
Directory.CreateDirectory(dir);
// Creates a file copiedtext.txt with all the contents on the page.
File.AppendAllText(Path.Combine(dir, "Copiedtext.txt"), result);
I'm not sure what language you're using, but in C# the IWebElement object has a .Text method. That method shows all text that is displayed between the element's opening and closing tag.
I would create an IWebElement using XPath to grab the entire page. In other words, you're grabbing the body element and looking at the text in it.
string pageText = driver.FindElement(By.XPath("//html/body/")).Text;
If the above code does not work for selenium, use this:
string yourtext= driver.findElement(By.tagName("body")).getText();
Related
this the website I am dealing with https://www.bseindia.com/corporates/ann.html?curpg=1&annflag=1&dt=20211021&dur=P&dtto=20211027&cat=Insider%20Trading%20/%20SAST&scrip=&anntype=A
Here I am able to send the code for the "security name" but in-order to submit it I need to click the dropdown element that comes after giving the security name. How do I achieve this with selenium. I used the code below and its not working(StaleElementReferenceException)
security_name = driver.find_element_by_id("scripsearchtxtbx")
security_name.send_keys('INE350H01032')
sec_click = driver.find_element_by_xpath('//*[#id="ulSearchQuote2"]/li')
sec_click.click()
This can also be accomplished using the Keys Library within Selenium. Selenium Keys not only sends input statements like strings, but it can also send commands such as escape, tab, or in this case enter. Your updated code should look as follows:
security_name = driver.find_element_by_id("scripsearchtxtbx")
security_name.send_keys('INE350H01032')
security_name.send_keys(Keys.ENTER)
sec_click = driver.find_element_by_xpath('//*[#id="ulSearchQuote2"]/li')
sec_click.click()
These are called special keys. For more examples and more information on this, see this link
you can paly with Xpath:
security_name = driver.find_element_by_id("scripsearchtxtbx")
security_name.send_keys('INE350H01032')
sec_click = driver.find_element_by_xpath("//ul[#id='ulSearchQuote2']/li//strong[text()='INE350H01032']")
sec_click.click()
same -->("//ul[#id='ulSearchQuote2']//strong[text()='INE350H01032']")
you can check more about xpath here: xpath
I'm trying to scrape a webform for text in specific fields however i can't do it with xpath because some forms are missing fields which won't be included in the page when it loads (i.e. if /html/blah/blah/p[3] is the initials field for one form it might be first name on another form but have the same xpath. The structure for the fields is like this:
<p><strong>Initials:</strong> WT</p>
so using python selenium i'm doing
driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]") which does successfully pull the "Initials:" text between the strong tags but i specifically need the child text after it, in this case WT. It has the attribute "nextSibling.data" which contains the WT value but from my googling i don't think its possible to pull that attribute with python selenium. Does anyone know a way to pull the WT text following the xpath query?
The 'WT' text is in a weird spot. I don't think it is actually a sibling per-se. The only way I know to grab that text would be to use p_element.get_attribute('outerHTML'), which in this instance should grab the string '<p><strong>Initials:</strong> WT</p>'. I doubt this is the cleanest solution, but here's a way to parse that text out:
strong_close_tag = '</strong>'
p_close_tag = '</p>'
p_element = driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]/parent")
print(p_element.get_attribute('outerHTML')[text.index(strong_close_tag)+len(strong_close_tag):text.index(p_close_tag)])
OR -- use p_element.get_attribute('innerHTML'), which should return just <strong>Initials:</strong> WT. Then, similarly, grab the text after the </strong> closing tab, maybe like this:
p_element = driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]/parent")
print p_element.get_attribute('innerHTML').split("</strong>",1)[1]
this problem is really driving me crazy! Here's my code:
list_divs = driver.find_elements_by_xpath("//div[#class='myclass']")
print(f'Number of divs found: {len(list_divs)}') #Correct number displayed
for art in list_divs:
mybtn = art.find_elements_by_xpath('//button') #There are 2 buttons in each div
print(f'Number of buttons found = {len(mybtn)}') #Incorrect number (129 instead of 2)
mybtn[2].click() #Wrong button clicked!
The button clicked IS NOT in the art Html but at the very beginning of webpage!!! Seems like Selenium is parsing the whole document instead of webelement art...
I've printed the outerHTML of variable art and it's correct: only the div code which contains 2 buttons!!!! So why the find_elements_by_xpath() function applied to the webelement art is not parsing the div but the whole html page??!!!
Totally incomprehensible for me!
Because you are using mybtn = art.find_elements_by_xpath('//button') where //button ignores your search context since it starts from //. Change it to:
mybtn = art.find_elements_by_xpath('.//button')
I can't post any html code (the page is about 1,000 lines long).
So far, the only way I saw to go through this is to avoid parsing webelements and make parsing of entire webpage for each element I need:
list_divs = driver.find_elements(By.XPATH, "//div[#class='myclass']")
buttons = driver.find_elements(By.XPATH,"//div[#class='myclass']//button")
and then iterate through the lists to access the button I need for each div. Works perfectly like this. I still don't catch how a xpath applied to a given html code can return something that is not inside this html code...
I'll make other tests with other webpages to see if the problem comes from Selenium.
Thanks for help!
I'm new to Selenium's webdriver and Python. I know about this article about getting the HTML source, but I don't want the entire HTML line for a DOM object, just the content in between the tags. I also looked at webdriver source code for help on finding this button DOM object
for example:
<button id = "picview">Pic View</button>
How do I just get "Pic View"?
Also, using get_attribute("button id"), How would I get this specific button id as there are multiple buttons on the page with button id?
For example:
picbox_elem_attr = picbox_elem.get_attribute('button id')
print picbox_elem_attr
How do I ensure that picbox_elem_attr variable is set to the "picview" button and not some other button?
I don't have a
driver.find_element_by_name("xxx")
or a
driver.find_element_by_id("aaa")
to use to find this button.
To get text of an element use the text property. For example:
text_inside_button_id = driver.find_element_by_id("picview").text
Here is some additional documentation to help you with the webdriver binding library.
http://goldb.org/sst/selenium2_api_docs/html/
I forgot about xpath. Oops!
driver.find_element_by_xpath('//*[#id="picview"]')
and then you can right-click the object and use xpath within the dev tools.
I wanted to access the translation results of the following url
http://translate.google.com/translate?hl=en&sl=en&tl=ar&u=http%3A%2F%2Fwww.saltycrane.com%2Fblog%2F2008%2F10%2Fhow-escape-percent-encode-url-python%2F
the translation is displayed in the bottom content frame out of the two frames. I am interested in retrieving only the bottom content frame to get the translations
selenium for python allows us to fetch page contents via web automation:
browser.get('http://translate.google.com/#en/ar/'+hurl)
The required frame is an iframe :
<div id="contentframe" style="top:160px"><iframe src="/translate_p?hl=en&am... name=c frameborder="0" style="height:100%;width:100%;position:absolute;top:0px;bottom:0px;"></div></iframe>
but how to get the bottom content frame element to retrieve the translations using web automation?
Came to know that PyQuery also allows us to browse the contents using the JQuery formalism
Update:
An answer mentioned that Selenium provides a method where you can do that.
frame = browser.find_element_by_tag_name('iframe')
browser.switch_to_frame(frame)
# get page source
browser.page_source
but it does not work in the above example. It returns an empty page .
You can use driver.switchTo.frame(1); here, the digit 1 inside frame() is the index of frames present in the webpage. as your requirement is to switch to second frame and the index starts with 0, you should use driver.switchTo.frame(1);
But the above code is in Java. In Python, you can use the below line.
driver.switch_to_frame(1);
UPDATE
driver.get("http://translate.google.com/translate?hl=en&sl=en&tl=ar&u=http://www.saltycrane.com/blog/2008/10/how-escape-percent-encode-url-python/");
driver.switchTo().frame(0);
System.out.println(driver.findElement(By.xpath("/html/body/div/div/div[3]/h1/span/a")).getText());
Output: SaltyCrane ???????
I have just tried to print the title name SaltCrane that is present inside the iframe.
It worked for me except for the ? symbols after the SaltCrane. As it was arabic, it was unable to decode the same.
The above code is in Java. Same logic should also work in Python.
Selenium provides a method where you can do that.
frame = browser.find_element_by_tag_name('iframe')
browser.switch_to_frame(frame)
# get page source
browser.page_source