How to select the html attribute inside a selenium object

How to select the html attribute inside a selenium object - python

I am learning web scraping using selenium and I've come into an issue when trying to select an attribute inside of a selenium object. I can get the broader data if I just print elems.text inside the loop (this outputs the whole paragraph for each listing) however when I try to access the xpath of the h2 title tag of all the listings inside this broader element, it only appends the first listing to the titles array, whereas I want all of them. I checked the XPATH and they are the same for each listing. How can I get all of the listings instead of just the first one?
titles = []
driver.get("https://www.sellmytimesharenow.com/timeshare/All+Timeshare/vacation/buy-timeshare/")
results = driver.find_elements(By.CLASS_NAME, "results-list")
for elems in results:
print(elems.text) #this prints out full description paragraphs
elem_title = elems.find_element(By.XPATH, '//*[#id="search-page"]/div[3]/div/div/div[2]/div/div[2]/div/a[1]/div/div[1]/div/h2')
titles.append(elem_title.text)

If you aren't limited to accessing the elements by XPATH only, then here is my solution:
results = driver.find_elements(By.CLASS_NAME, "result-box")
for elems in results:
titles.append(elems.text.split("\n")[0])
When you try getting the listings, you use find_elements(By.CLASS_NAME, "results-list"), but on the website, there is only one element with the class name "results-list". This aggregates all the text in this div into one long string and therefore you can't get the heading.
However, there are multiple elements with the class name "result-box", so find_elements will store each as its own item in "results". Because the title of each listing is on the first line, you can split the text of each element by the newline.

Related

Webscraping specific sections of page without 'class' or 'id' identifiers

I am having issues web-scraping a tag element while using BeautifulSuop4 in Python. Typically the elements are given a class or id identifier where I can use:
.find_all(<p>, class_ = 'class-name')
to find the element however the elements I am trying to isolate are in a consecutive list of tags all of which have no identifier for their element.
Is there a way to choose every tag after a tag that has an identifier? Or maybe a way to isolate the specific tags I want without them having any shared class/id?

You could use find_next_sibling to find the classless next sibling of an element.
Consider this example HTML. The first div has the class "blah". The second div has no class but is beside the first div.
html='<div><div class="blah">1</div><div>no class</div></div>'
import bs4
soup = bs4.BeautifulSoup(html,'html.parser')
soup.find('div',{'class':"blah"}).find_next_sibling()
#outputs second div without a class
<div>no class</div>
See this and this for more details.

how do i access nested html elements using selenium?

i am using a school class schedule website and i want to access the div element that contains info on how many seats are in a class and who is teaching it in order to scrape it. i first find the element which contains the div element i want, after that i try to find the div element i want by using xpaths. the problem i face is when i try to use either the find_element_by_xpath or find_elements_by_xpath to get the div i want i get this error:
'list' object has no attribute 'find_element_by_xpath'
is this error happening because the div element i want to find is nested? is there a way to get nested elements using a div tag?
here is the code i have currently :
driver = webdriver.Chrome(ChromeDriverManager().install())
url = "https://app.testudo.umd.edu/soc/202008/INST"
driver.get(url)
section_container = driver.find_elements_by_id('INST366')
sixteen_grid = section_container.find_element_by_xpath(".//div[#class = 'sections sixteen colgrid']").text
the info i want is this:
<div class = "sections sixteen colgrid"</div>
its currently inside this id tag:
<div id="INST366" class="course"</div>
greatly appreciated if anyone could help me out with this

From documentation of find_elements_by_id:
Returns : list of WebElement - a list with elements if any was found. An empty list if not
Which means section_container is a list. You can't call find_element_by_xpath on a list but you can on each element within the list because they are WebElement.
What says the documentation about find_element_by_id?
Returns : WebElement - the element if it was found
In this case you can use find_element_by_xpath directly. Which one you should use? Depends on your need, if need to find the first match to keep digging for information or you need to use all the matches.
After fixing that you will encounter a second problem: your information is displayed after executing javascript code when clicking on "Show Sections", so you need to do that before locating what you want. For that go get the a href and click on it.
The new code will look like this:
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()
url = "https://app.testudo.umd.edu/soc/202008/INST"
driver.get(url)
section_container = driver.find_element_by_id('INST366')
section_container.find_element_by_xpath(".//a[#class='toggle-sections-link']").click()
sleep(1)
section_info = section_container.find_element_by_xpath(".//div[#class='sections sixteen colgrid']").text
driver.quit()

Python/Selenium Finding a specific class element, analyzing if it contains a specific span class, if it does, copy the link

Trying to create a script that loops through my inbox and find all div classes that contain "relative flex", if the div class contains a span class labelled "dn dib-1" then it copies and saves the following href link to my list and moves onto the next div.
Here is the html code:
<div class="relative flex">
<span class="dn dib-l" style="left: -16px;"</span>
hey how are you?
Here is the code I have now:
link_list = []
sex_list = []
message = browser.find_elements_by_xpath('//*[#class="relative flex"]')
message_new = browser.find_elements_by_xpath('//*[#class="dn dib-l"]')
for item in message:
link = item.find_element_by_xpath('.//a').get_attribute('href')
if message_new in message:
link_list.append(link)
Issue:
message, message_new all contain data when requested, however despite there being multiple messages with these classes, link variable only contains one element and link_list contains no elements. What changes do I need to make in my code in order for it to save all links within div classes that contain this span class?

I would restructure this code a bit to make it more efficient. To me, it sounds like you want to analyze all of the div elements that have class relative flex. Then, if the div contains a certain span element, you want to save the href tag of the following a item. Here's how I would write this:
# locate the span elements which exist under your desired div
spans_to_iterate = browser.find_elements_by_xpath("//div[contains(#class, 'relative flex')]/span[contains(#class, 'dn dib-1')]")
link_list = []
# iterate span elements to save the href attribute of a element
for span in spans_to_iterate:
# get the href element, where 'a' element is following sibling of span.
link_text = span.find_element_by_xpath("following-sibling::a").get_attribute("href")
link_list.append(link_text)
The idea behind this code is that we first retrieve the span elements that exist in your desired div. In your problem description, you mentioned you only wanted to save the link if the div and span elements contained specific class names. So, we query directly on the elements that you have mentioned, rather than find div first then find span.
Then, we iterate these span elements and use XPath's following-sibling notation to grab the a element that appears right after. We can get get_attribute to grab the href tag, and then append the link to the list.

Try this:
xpth = "//div[#class='relative flex' and /span[#class='dn dib-l']]//#href"
links = browser.find_elements_by_xpath(xpth)

get the second element from a list of elements with selenium in python

I want to get the inner html of an element (with get_attribute('innerHTML')) but it doesnt have and id or class and there are multiple elements with the same tag name
test1=driver.find_elements_by_tag_name("td")
This gets the whole list of elements with the same tag name but this doesnt work because get_attribute doesnt work with multiple elements
test2=driver.find_element_by_tag_name("td")
this works but gets the very first td elements but i want the second td element
How do i do this correctly?

you can use xpath like below to get the second td of every row in a table.
driver.find_elements_by_xpath("//table/tr/td[2]")
modify xpath to go to your required table if you need it from a particular table.

As per your question the following line of code is returning you the very first td element :
test2=driver.find_element_by_tag_name("td")
To retrieve the text within the second td element you can use either of the following lines of code :
xpath :
test2 = driver.find_element_by_xpath("//table//tr//following::td[2]").get_attribute("innerHTML")
css_selector :
test2 = driver.find_element_by_css_selector("//table > tr > td:nth-last-child(2)").get_attribute("innerHTML")
Note : The last part of the xpath and the css_selector will definitely identify the second <td> element but you may have to require to adjust the initial part as per your HTML DOM

Accessing content of all divs having same class name but different xpaths

I am trying to extract data from two divisions in XHTML having the same class name using python but when I try to take out their xpaths, they are different. I tried using
driver = webdriver.Chrome()
content = driver.find_element_by_class_name("abc")
print content.text
but it gives only the content of first div. I heard this can be done using xpath. The xpaths of the divs are as follows:
//*[#id="u_jsonp_2_o"]/div[2]/div[1]/div[3]
//*[#id="tl_unit_-5698935668596454905"]/div/div[2]/div[1]/div[3]
//*[#id="u_jsonp_3_c"]/div[2]/div[1]/div[3]
What I thought, since each xpath has same ending, how can we use this similarity in ending and then access the divisions in python by writing [1],[2],[3].... at the end of the xpath?
Also, I want make content an array containing all the content of classes named abc. Moreover, I don't know how many abcs exist! How to integrate the data of all of them in one content array?

In your case it doesn't matter if you use class name or css, you are only searching for "one" element with find_element but you want to find several elements:
you need to use find_element**s**_by_class_name
content = driver.find_elements_by_class_name("abc")
for element in content:
// here you can work with every single element that has class "abc"
// do whatever you want
print element.text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select the html attribute inside a selenium object - python

Related

Webscraping specific sections of page without 'class' or 'id' identifiers

how do i access nested html elements using selenium?

Python/Selenium Finding a specific class element, analyzing if it contains a specific span class, if it does, copy the link

get the second element from a list of elements with selenium in python

Accessing content of all divs having same class name but different xpaths

Categories

Resources