Scraping the attribute of the first child from multiple div (selenium)

Scraping the attribute of the first child from multiple div (selenium) - python

I'm trying to scrap the class name of the first child (span) from multiple div.
Here is the html code:
<div class="ui_column is-9">
<span class="name1></span>
<span class="...">...</span>
...
<div class ="ui_column is-9">
<span class="name2></span>
<span class="...">...</span>
...
<div class ..
URL of the page for the complete code.
I'm achieving this task with this code for the first five div:
i=0
liste=[]
while i <= 4:
parent= driver.find_elements_by_xpath("//div[#class='ui_column is-9']")[i]
child= parent.find_element_by_xpath("./child::*")
class_name= child.get_attribute('class')
i = i+1
liste.append(nom_classe)
But do you know if there is an easier way to do it ?

You can directly get all these first span elements and then extract their class attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[1]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)
You can also extract the class attribute values from 5 first elements only by limiting the loop for 5 iterations
UPD
Well, after updating your question the answer becomes different and much simpler.
You can get the desired elements directly and extract their class name attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[contains(#class,'ui_bubble_rating')]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)

Related

BS4 text inside <span> which has no class

i am trying to scrape this 4.1 rating in span tag using this python code but it is returning empty.
for item in soup.select("._9uwBC wY0my"):
n = soup.find("span").text()
print(n)
---------------------------------------
<div class="_9uwBC wY0my">
<span class="icon-star _537e4"></span>
<span>4.1</span>
</div>

#Aditya, I think soup.find("span") will only return the first "span" and you want the text from the second one.
I would try:
for item in soup.select("div._9uwBC.wY0my"):
spans = item.find_all("span")
for span in spans:
n = span.text
if n != '':
print(n)
Which should print the text of the non-empty span tags, under the you specified.
Does accomplish what you want?

OK, here's one approach for getting the names and stars for each restaurant on the page. It's not necessarily the most elegant way to do it, but I've tried it a couple of times and it seems to work:
divs = soup.find_all('div')
for div in divs:
if div.has_attr('class'):
if div['class'] == ['nA6kb']: ## the class of the divs with the name
name = div.text
k = div.find_next('div') ## the next div
l = k.find_next('div') ## the div with the stars
spans = l.find_all('span') ## this part is same as the answer above
for span in spans:
n = span.text
if n != '':
print(name, n)
This assumes that the div that contains the stars span is always the second div after the div that contains the restaurant name. It looks like that's always the case, but I'm not positive that it never changes.

Remove redundant class names in HTML using BeautifulSoup

I want to convert:
<span class = "foo">data-1</span>
<span class = "foo">data-2</span>
<span class = "foo">data-3</span>
to
<span class = "foo"> data-1 data-2 data-3 </span>
Using BeautifulSoup in Python. This HTML part exists in multiple areas of the page body, hence I want to minimize this part and scrap it. Actually the mid span was with em class hence originally separated.

Adapted from this answer to show how this could be used for your span tags:
span_tags = container.find_all('span')
# combine all the text from b tags
text = ''.join(span.get_text(strip=True) for span in span_tags)
# here you choose a tag you want to preserve and update its text
span_main = span_tags[0] # you can target it however you want, I just take the first one from the list
span_main.span.string = text # replace the text
for tag in span_tags:
if tag is not span_main:
tag.decompose()

Selenium: Get text in an element with child containing a specific class

My html page looks like this:
<div class="some class">
<p>
<i class="class1"></i>
Some Text
</p>
<p>
<i class="class2"></i>
Some Text
</p>
. . .
. . .
. . .
</div
I want to get Some text. Currently I am trying:
elem = browser.find_element_by_xpath("//div[#class='some class']")
text = elem.find_element_by_xpath("//p/i[#class='class1']").text
But it returns an empty string. I cant understand why. I am new to selenium. Please help.

You use xpath below:
# Find "i" element with "class1" css class and get first parent "p" element
elem = browser.find_element_by_xpath("//i[#class='class1']/ancestor::p[1]")
# Same as previous with added "div"
elem = browser.find_element_by_xpath("//div[#class='some class']//i[#class='class1']/ancestor::p[1]")
# Find "p" element with child "i" element with "class1" css class
elem = browser.find_element_by_xpath("//p[./i[#class='class1']]")
# Same as previous with added "div"
elem = browser.find_element_by_xpath("//div[#class='some class']//p[./i[#class='class1']]")

Your selector is grabbing the element i that has attribute class="class1". i has no text, which is why it's an empty string, so to fix that:
elem = browser.find_element_by_xpath("//div[#class='some class']")
# Now let's find the i element you want
i_elem = elem.find_element_by_xpath("//i[#class='class1']")
# Now find the parent of that i_elem, which is p
p_elem = [p for p in i_elem.iterancestors() if p.tag=='p'][0]
txt = p_elem.text

you can use execute_script
xPath = "//div[#class='some class']"
try:
element = driver.find_element_by_xpath(xPath)
b1Text = driver.execute_script("return arguments[0].childNodes[2].textContent", element);
print(b1Text)
except:
print()
try changing the value inside childNodes[N] for example childNodes[2], childNodes[1]

Assuming that your class1 and class2 are different, you can use this css selector
div.some class > p:nth-child(1) to get the text inside it. Since the text is inside the <p> para tag, you can get the text from the first <p> tag.
elem = browser.find_element_by_css_selector("div.some class > p:nth-child(1)")
text = elem.text
This should get you the text inside the element.

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text

I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

Python selenium webdriver. Find elements with specified class name

I am using Selenium to parse a page containing markup that looks a bit like this:
<html>
<head><title>Example</title></head>
<body>
<div>
<span class="Fw(500) D(ib) Fz(42px)">1</span>
<span class="Fw(500) D(ib) Fz(42px) Green XYZ">2</span>
</div>
</body>
</html>
I want to fetch all span elements that contain the class foobar.
I have tried both of this (the variable wd is an instance of selenium.webdriver):
elem = wd.find_elements_by_css_selector("span[class='Fw(500) D(ib) Fz(42px).']")
elem = wd.find_element_by_xpath("//span[starts-with(#class, 'Fw(500) D(ib) Fz(42px))]")
NONE OF WHICH WORK.
How can I select only the elements that start with Fw(500) D(ib) Fz(42px)
i.e. both span elements in the sample markup given.

Try as below :-
elem = wd.find_elements_by_css_selector("span.foobar")
If there is space between class foo and bar then try as below :-
elem = wd.find_elements_by_css_selector("span.foo.bar")
Edited : If your class contains with non alphabetical charactor and you want to find element which starts with Fw(500) D(ib) Fz(42px) then try as below :-
elem = wd.find_elements_by_css_selector("span[class ^= 'Fw(500) D(ib) Fz(42px)']")

Try to find elements by XPath:
//span[#class='foobar']
This should work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping the attribute of the first child from multiple div (selenium) - python

Related

BS4 text inside <span> which has no class

Remove redundant class names in HTML using BeautifulSoup

Selenium: Get text in an element with child containing a specific class

How to find text of <div><span>text</span></div> in beautifulsoup?

Python selenium webdriver. Find elements with specified class name

Categories

Resources