Python Selenium - Find element by class and text - python

I'm trying to paginate through the results of this search: Becoming Amazon search. I get a 'NoSuchElementException'..'Unable to locate element: < insert xpath here >
Here is the html:
<div id="pagn" class="pagnHy">
<span class="pagnLink">
2
</span>
</div>
Here are the xpaths I've tried:
driver.find_element_by_xpath('//*[#id="pagn" and #class="pagnLink" and text()="2"]')
driver.find_element_by_xpath('//div[#id="pagn" and #class="pagnLink" and text()="2"]')
driver.find_element_by_xpath("//*[#id='pagn' and #class='pagnLink' and text()[contains(.,'2')]]")
driver.find_element_by_xpath("//span[#class='pagnLink' and text()='2']")
driver.find_element_by_xpath("//div[#class='pagnLink' and text()='2']")
If I just use find_element_by_link_text(...) then sometimes the wrong link will be selected. For example, if the number of reviews is equal to the page number I'm looking for (in this case, 2), then it will select the product with 2 reviews, instead of the page number '2'.

You're trying to mix attributes and text nodes from different WebElements in the same predicate. You should try to separate them as below:
driver.find_element_by_xpath('//div[#id="pagn"]/span[#class="pagnLink"]/a[text()="2"]')

Sometimes it might be better to take a intermediate step and first to get the element which contains the results.
Afterwards you just search within this element.
Doing it this way you simplify your search terms.
from selenium import webdriver
url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&fieldkeywords=becoming&rh=i%3Aaps%2Ck%3Abecoming'
driver = webdriver.Firefox()
resp = driver.get(url)
results_list_object = driver.find_element_by_id('s-results-list-atf')
results = results_list_object.find_elements_by_css_selector('li[id*="result"]')
for number, article in enumerate(results):
print(">> article %d : %s \n" % (number, article.text))

When I look at the markup, I'm seeing the following:
<span class="pagnLink">
2
</span>
So you want to find a span with class pagnLink that has a child a element with the text 2, or:
'//*[#class="pagnLink"]/a[text()="2"]'

Related

Python Selenium - How do you extract a link from an element with no href? [duplicate]

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

How to get text from a html using Selenium and Python which has two elements with the same classname where I need to extract both

I have a html like:
<div class='mesage-in'> cool text here </div>
<div class='mesage-in'> bad text here </div>
and my python code like:
texto = navegador.find_element_by_class_name('message-in').text
print(texto)
Is it possible make this get all elements with same class name and put on a array or define as different variable like this?
OutPut:
print(texto1)
-> cool text here
print(texto2)
-> bad text here
#or
print(texto[0])
-> cool text here
print(texto[1])
-> bad text here
Actualy my code only get the first one
As per the HTML:
<div class='mesage-in'> cool text here </div>
<div class='mesage-in'> bad text here </div>
The following line line of code:
texto = navegador.find_element_by_class_name('message-in').text
will always identify the first matching element, extract the text and assign it to texto. So when you try to print texto, the text of the very first element i.e. cool text here is printed.
Solution
You can get all elements with same classname i.e. mesage-in and put on a list as follows:
from selenium.webdriver.common.by import By
texto = navegador.find_elements(By.CLASS_NAME, 'message-in')
Now you can print the desired texts with respect to their index as follows:
To print cool text here:
print(texto[0].text) # prints-> cool text here
To print bad text here:
print(texto[1].text) # prints-> bad text here
Outro
You can also crate a list of the texts using List Comprehension and print them as follows:
texto = [my_elem.text for my_elem in driver.find_elements(By.CLASS_NAME, "message-in")]
print(texto[0]) # prints-> cool text here
print(texto[1]) # prints-> bad text here
You can achieve this by using BeautifulSoup library.
example output :
[' cool text here ', ' bad text here ']
from bs4 import BeautifulSoup
def get_class_texts(html_text: str, class_name: str):
soup = BeautifulSoup(html_text, features="html.parser")
return [tag.text for tag in soup.select(f".{class_name}")]
print(get_class_texts("<div class='mesage-in'> cool text here </div> <div class ='mesage-in'> bad text here </div>", "mesage-in"))
To get multiple elements into one array you need to use find_elements. In your case I would use xpath like so:
eleArray = self.driver.find_elements(By.XPATH, '//div[#class='mesage-in']');
Then, you can loop over the array like so:
for element in eleArray:
print(element.text)
Here is a similar example where I get all latin encodeable span elements from wikipedia and log them to a console. Feel free to run it and see the results (this product is free to use by the way, so you can transfer the test case and make your own account via google login):
https://mx1.maxtaf.com/cases/a320a3ad-9949-4bce-87fa-7a0980df8f1f?projectId=bugtestproject2
You can store them into a list. That list will be a list of web elements.
As I see, you are using navegador.find_element which will return a single web element.
Whereas navegador.find_elements will return a list of web elements.
Also, in latest Selenium find_element_by_class_name have been deprecated therefore I would suggest you to use navegador.find_element(By.CLASS_NAME, "")
Code:
texto = navegador.find_elements(By.CLASS_NAME, 'message-in')
print(texto[0])
print(texto[1])
or
for txt in texto:
print(txt.text)

Using BeautifulSoup to scrape specific element within a CSS class

I'm trying to use BeautifulSoup in Python to scrape the 3rd li element within a CSS class. That said, i'm pretty new to this, and am not sure the best way to go about this.
Within the below example, what i'm trying to do is to scrape the 170 votes from this list (**in the real world example there are hundreds of these on a page that i'm looking to scrape, but they're all nested under the same CSS class within the 3rd li element)
<ul class="example-ul-class">
<li class="example-li-class">EXAMPLE NAME</li>
<li><i class="example-li-class">12 hours ago</time></li>
<li><i class="example-li-class"> 170 votes</li>
<li><i class="example-li-class">3 min read</li>
</ul>
I tried using something like the below but am getting the error found after the code
subtext = soup.select('.example-ul-class > li[2]')
print(subtext)
Error:
in selector_iter
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 29
line 1:
.example-ul-class > li[2]
**Again, the desired output would be to return just the string '170 votes'
Appreciate the help!
Instead of a CSS selector, try selecting using normal BS methods:
print(soup.find('ul',class_='example-ul-class').find_all('li')[2].text.strip())

python selenium search query provides multiple results. how choose which is mine search query?

I made a search query with selenium. after that i get multiple results. Now the problem is, only one link is right.how can i select the link from multiple results? and parse the data from the results.
i have a list and every time a search query is made results may change sometimes its 10 or 15.
the following code will select the first result always, But in this case i am looking for 4 link:
code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get("url")
#time.sleep(5)
username = driver.find_element_by_name("p_name")
#time.sleep(1)
username.send_keys("xxxxx)
#username.clear()
driver.find_element_by_xpath("html/body/form/table[6]/tbody/tr/td[2]/input").click()
driver.find_element_by_xpath("html/body/form/table[3]/tbody/tr[2]/td[4]/a").click()
html = driver.page_source
soup =BeautifulSoup(html)
for tag in soup.find_all('table'):
print tag.text
You know the general form of entries in the search results page, that is, they're capitalised and shorn of special characters. Assuming that you have such a search page you can use this knowledge and selenium to search for text containing what you want, in this way, with an xpath expression.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://egov.sos.state.or.us/br/pkg_web_name_srch_inq.do_name_srch?p_name=OREGON%20BUD%20COMPANY%2C%20LLC&p_regist_nbr=&p_srch=PHASE1&p_print=FALSE&p_entity_status=ACTINA')
>>> driver.find_element_by_xpath('.//*[contains(text(),"OREGON BUD COMPANY LLC")]/../..').text
' 4 DLLC ACT 1097010-94 CUR OREGON BUD COMPANY LLC Search'
I've simply dumped the text for the entire row. You'll need to extract the text items you actually want from the parent tr element.
PS: There's a good page of xpath expressions at https://gist.github.com/LeCoupa/8c305ec8c713aad07b14.
I think you can use
driver.find_element_by_partial_link_text("OREGON BUD COMPANY LLC")
instead of
driver.find_element_by_xpath('.//*[contains(text(),"OREGON BUD COMPANY LLC")]/../..').text
This will follow the exact match and will get you to the next page.
The answer is to use better selectors that only return one result. I prefer CSS Selectors but the process is largely the same for XPath if you prefer.
To get a CSS Selector in Chrome:
Right-click on the element and select 'Inspect'
Right-click on the element in the DOM explorer of DevTools
Select "Copy" > "Copy selector" (Alternatively you could get the XPath here too)
driver.find_element_by_css_selector(("body > form > table:nth-child(4) > tbody > tr:nth-child(2) > td:nth-child(2) > input[type='text']")).sendKeys("Timothy")
driver.find_element_by_css_selector(("body > form > table:nth-child(5) > tbody > tr > td:nth-child(2) > input[type='text']")).sendKeys("Cope")

How can I get text of an element in Selenium WebDriver, without including child element text?

Consider:
<div id="a">This is some
<div id="b">text</div>
</div>
Getting "This is some" is nontrivial. For instance, this returns "This is some text":
driver.find_element_by_id('a').text
How does one, in a general way, get the text of a specific element without including the text of its children?
Here's a general solution:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element passed to the function can be something obtained from the find_element...() methods (i.e., it can be a WebElement object).
Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:
return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
ret += child.textContent;
child = child.nextSibling;
}
return ret;
""", element)
I'm actually using this code in a test suite.
In the HTML which you have shared:
<div id="a">This is some
<div id="b">text</div>
</div>
The text This is some is within a text node. To depict the text node in a structured way:
<div id="a">
This is some
<div id="b">text</div>
</div>
This use case
To extract and print the text This is some from the text node using Selenium's python client, you have two ways as follows:
Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:
using xpath:
print(driver.find_element_by_xpath("//div[#id='a']").get_attribute("innerHTML").splitlines()[0])
using css_selector:
print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:
using xpath and firstChild:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
using xpath and childNodes[n]:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
Use:
def get_true_text(tag):
children = tag.find_elements_by_xpath('*')
original_text = tag.text
for child in children:
original_text = original_text.replace(child.text, '', 1)
return original_text
You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.
Unfortunately, Selenium was only built to work with Elements, not Text nodes.
If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.
One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.
import bs4
from bs4 import BeautifulSoup
inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')
outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')
From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.
Here's a simple one-liner that may be sufficient:
inner_soup.find(text=True)
If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.
Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.
contents = inner_soup.contents
for bs4_object in contents:
if (type(bs4_object) == bs4.Tag):
print("This object is an Element.")
elif (type(bs4_object) == bs4.NavigableString):
print("This object is a Text node.")
Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.

Categories

Resources