Selenium: Get text in an element with child containing a specific class

Selenium: Get text in an element with child containing a specific class - python

My html page looks like this:
<div class="some class">
<p>
<i class="class1"></i>
Some Text
</p>
<p>
<i class="class2"></i>
Some Text
</p>
. . .
. . .
. . .
</div
I want to get Some text. Currently I am trying:
elem = browser.find_element_by_xpath("//div[#class='some class']")
text = elem.find_element_by_xpath("//p/i[#class='class1']").text
But it returns an empty string. I cant understand why. I am new to selenium. Please help.

You use xpath below:
# Find "i" element with "class1" css class and get first parent "p" element
elem = browser.find_element_by_xpath("//i[#class='class1']/ancestor::p[1]")
# Same as previous with added "div"
elem = browser.find_element_by_xpath("//div[#class='some class']//i[#class='class1']/ancestor::p[1]")
# Find "p" element with child "i" element with "class1" css class
elem = browser.find_element_by_xpath("//p[./i[#class='class1']]")
# Same as previous with added "div"
elem = browser.find_element_by_xpath("//div[#class='some class']//p[./i[#class='class1']]")

Your selector is grabbing the element i that has attribute class="class1". i has no text, which is why it's an empty string, so to fix that:
elem = browser.find_element_by_xpath("//div[#class='some class']")
# Now let's find the i element you want
i_elem = elem.find_element_by_xpath("//i[#class='class1']")
# Now find the parent of that i_elem, which is p
p_elem = [p for p in i_elem.iterancestors() if p.tag=='p'][0]
txt = p_elem.text

you can use execute_script
xPath = "//div[#class='some class']"
try:
element = driver.find_element_by_xpath(xPath)
b1Text = driver.execute_script("return arguments[0].childNodes[2].textContent", element);
print(b1Text)
except:
print()
try changing the value inside childNodes[N] for example childNodes[2], childNodes[1]

Assuming that your class1 and class2 are different, you can use this css selector
div.some class > p:nth-child(1) to get the text inside it. Since the text is inside the <p> para tag, you can get the text from the first <p> tag.
elem = browser.find_element_by_css_selector("div.some class > p:nth-child(1)")
text = elem.text
This should get you the text inside the element.

Related

Scraping the attribute of the first child from multiple div (selenium)

I'm trying to scrap the class name of the first child (span) from multiple div.
Here is the html code:
<div class="ui_column is-9">
<span class="name1></span>
<span class="...">...</span>
...
<div class ="ui_column is-9">
<span class="name2></span>
<span class="...">...</span>
...
<div class ..
URL of the page for the complete code.
I'm achieving this task with this code for the first five div:
i=0
liste=[]
while i <= 4:
parent= driver.find_elements_by_xpath("//div[#class='ui_column is-9']")[i]
child= parent.find_element_by_xpath("./child::*")
class_name= child.get_attribute('class')
i = i+1
liste.append(nom_classe)
But do you know if there is an easier way to do it ?

You can directly get all these first span elements and then extract their class attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[1]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)
You can also extract the class attribute values from 5 first elements only by limiting the loop for 5 iterations
UPD
Well, after updating your question the answer becomes different and much simpler.
You can get the desired elements directly and extract their class name attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[contains(#class,'ui_bubble_rating')]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)

how to get desired text from element in linkedin scraping using python and selenium

Below 'a' element has two text strings "First Name" and "View First Names's profile". With below python code using get_text() I am getting both the text strings. However I want to get only first i.e. "First Name". Pl let me know code to drop 2nd string i.e. "View First Names's profile"
all_classes = src.find_all('div', {'class':'mb1'})
for linkClass in all_classes:
linkClass = linkClass.find_all('a', {'class': 'app-aware-link'})
for element in linkClass:
name = element.get_text().strip()
Name.append(name)
HTML
<a class="app-aware-link" href="https://www.linkedin.com/in/shreyansjain-iitdhn?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABpqUi4Bg1wC5QB22-ydCRRB580Zd4gutQ8">
<span dir="ltr">
<span aria-hidden="true"><!-- -->First Name<!-- --></span><span class="visually-hidden"><!-- -->View First Names’s profile<!-- --></span>
</span>
</a>

To extract the first name in Selenium I would do this :
use the below CSS_SELECTOR for First name :
.app-aware-link span[dir='ltr'] span:first-of-type
Profile name :
.app-aware-link span[dir='ltr'] span:last-of-type
and extract the text b/w them like this :
first-name :
for name in driver.find_elements(By.CSS_SELECTOR, " .app-aware-link span[dir='ltr'] span:first-of-type"):
print(name.text)
Profile_name :
for profile_name in driver.find_elements(By.CSS_SELECTOR, ".app-aware-link span[dir='ltr'] span:last-of-type"):
print(profile_name.text)

Try this:
all_classes = src.find_all('div', {'class':'mb1'})
for linkClass in all_classes:
linkClass = linkClass.find_all('a', {'class': 'app-aware-link'})
for element in linkClass:
if element is not None:
first_name = element.find_elements_by_xpath('./span/span')[0]
if first_name is not None:
name = first_name.get_text().strip()
Name.append(name)

Python selenium webdriver. Find elements with specified class name

I am using Selenium to parse a page containing markup that looks a bit like this:
<html>
<head><title>Example</title></head>
<body>
<div>
<span class="Fw(500) D(ib) Fz(42px)">1</span>
<span class="Fw(500) D(ib) Fz(42px) Green XYZ">2</span>
</div>
</body>
</html>
I want to fetch all span elements that contain the class foobar.
I have tried both of this (the variable wd is an instance of selenium.webdriver):
elem = wd.find_elements_by_css_selector("span[class='Fw(500) D(ib) Fz(42px).']")
elem = wd.find_element_by_xpath("//span[starts-with(#class, 'Fw(500) D(ib) Fz(42px))]")
NONE OF WHICH WORK.
How can I select only the elements that start with Fw(500) D(ib) Fz(42px)
i.e. both span elements in the sample markup given.

Try as below :-
elem = wd.find_elements_by_css_selector("span.foobar")
If there is space between class foo and bar then try as below :-
elem = wd.find_elements_by_css_selector("span.foo.bar")
Edited : If your class contains with non alphabetical charactor and you want to find element which starts with Fw(500) D(ib) Fz(42px) then try as below :-
elem = wd.find_elements_by_css_selector("span[class ^= 'Fw(500) D(ib) Fz(42px)']")

Try to find elements by XPath:
//span[#class='foobar']
This should work.

How to use beautifulsoup to get node text and children tag separately

My html is like:
<a class="title" href="">
<b>name
<span class="c-gray">position</span>
</b>
</a>
I want to get name and position string separately. So my script is like:
lia = soup.find('a',attrs={'class':'title'})
pos = lia.find('span').get_text()
lia.find('span').replace_with('')
name = lia.get_text()
print name.strip()+','+pos
Although it can do the job, I don't think is a beautiful way. Any brighter idea?

You can use .contents method this way:
person = lia.find('b').contents
name = person[0].strip()
position = person[1].text

The idea is to locate the a element, then, for the name - get the first text node from an inner b element and, for the position - get the span element's text:
>>> a = soup.find("a", class_="title")
>>> name, position = a.b.find(text=True).strip(), a.b.span.get_text(strip=True)
>>> name, position
(u'name', u'position')

BeautifulSoup: How to skip a child node within a find_all?

I have the following code to scrape this page:
soup = BeautifulSoup(html)
result = u''
# Find Starting point
start = soup.find('div', class_='main-content-column')
if start:
news.image_url_list = []
for item in start.find_all('p'):
The problem I'm facing is that it also grabs the <p> inside <div class="type-gallery">, which I would like to avoid. But can't find a way to achieve it.
Any ideas please?

You want direct children, not just any descendant, which is what element.find_all() returns. Your best bet here is to use a CSS selector instead:
for item in soup.select('div.main-content-column > div > p'):
The > operator limits this to p tags that are a direct child nodes of div tags within the div with the given class. You can make this as specific as you like; adding in the itemprop attribute for example:
for item in soup.select('div.main-content-column > div[itemprop="articleBody"] > p'):
The alternative is to loop over the element.children iterable:
start = soup.find('div', class_='main-content-column')
if start:
news.image_url_list = []
for item in start.children:
if item.name != 'div':
# skip children that are not <div> tags
continue
for para in item.children:
if item.name != 'p':
# skip children that are not <p> tags
continue

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium: Get text in an element with child containing a specific class - python

Related

Scraping the attribute of the first child from multiple div (selenium)

how to get desired text from element in linkedin scraping using python and selenium

Python selenium webdriver. Find elements with specified class name

How to use beautifulsoup to get node text and children tag separately

BeautifulSoup: How to skip a child node within a find_all?

Categories

Resources