I'm trying to iterate through multiple nodes and receive various child nodes from the parent nodes. Assuming that I've something like the following structure:
<div class="wrapper">
<div class="item">
<div class="item-footer">
<div class="item-type">Some data in here</div>
</div>
</div>
<!-- More items listed here -->
</div>
I'm able to receive all child nodes of the wrapper container by using the following.
wrapper = driver.find_element(By.XPATH, '/html/body/div')
items = wrapper.find_elements(By.XPATH, './/*')
Anyways I couldn't figure out how I can now receive the inner HTML of the container containing the information about the item type. I've tried this, but this didn't work.
for item in items:
item_type = item.item.find_element(By.XPATH, './/div/div').get_attribute('innerHTML')
print(item_type)
This results in the following error:
NoSuchElementException: Message: Unable to locate element:
Does anybody knows how I can do that?
In case all the elements their content you want to get are div with class attribute value item-type located inside divs with class attribute value item-footer you can simply do the following:
elements = driver.find_element(By.XPATH, '//div[#class="item-footer"]//div[#class="item-type"]')
for element in elements:
data = element.get_attribute('innerHTML')
print(data)
You can use BeautifulSoup after getting page source from selenium to easily scrape the HTML data.
from bs4 import BeautifulSoup
# selenium code part
# ....
# ....
# driver.page_source is the HTML result from selenium
html_doc = BeautifulSoup(driver.page_source, 'html.parser')
items = html_doc.find_all('div', attrs={'class':'item'})
for item in items:
text = item.find('div', attrs={'class':'item-type'}).text
print(text)
Output:
Some data in here
You need to just find the relative xpath to identify each element and then iterate it.
items = driver.find_elements(By.XPATH, "//div[#class='wrapper']//div[#class='item']//div[#class='item-type']")
for item in items:
print(item.text)
print(item.get_attribute('innerHTML'))
Or use the css selector
items = driver.find_elements(By.CSS_SELECTOR, ".wrapper >.item .item-type")
for item in items:
print(item.text)
print(item.get_attribute('innerHTML'))
Related
I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))
I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?
This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()
I have the following HTML page. I want to get all the links inside a specific div. Here is my HTML code:
<div class="rec_view">
<a href='www.xyz.com/firstlink.html'>
<img src='imga.png'>
</a>
<a href='www.xyz.com/seclink.html'>
<img src='imgb.png'>
</a>
<a href='www.xyz.com/thrdlink.html'>
<img src='imgc.png'>
</a>
</div>
I want to get all the links that are present on the rec_view div. So those links that I want are,
www.xyz.com/firstlink.html
www.xyz.com/seclink.html
www.xyz.com/thrdlink.html
Here is the Python code which I tried with
from selenium import webdriver;
webpage = r"https://www.testurl.com/page/123/"
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get(webpage)
element = driver.find_element_by_css_selector("div[class='rec_view']>a")
link = element.get_attribute("href")
print(link)
How can I get those links using selenium on Python?
As per the HTML you have shared to get the list of all the links that are present on the rec_view div you can use the following code block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.testurl.com/page/123/')
elements = driver.find_elements_by_css_selector("div.rec_view a")
for element in elements:
print(element.get_attribute("href"))
Note : As you need to collect all the href attributes from the div tag so instead of find_element_* you need to use find_elements_*. Additionally, > refers to immediate <a> child node where as you need to traverse all the <a> child nodes so the desired css_selector will be div.rec_view a
I'm trying to use Python and Selenium to scrape multiple links on a web page. I'm using find_elements_by_xpath and I'm able to locate a list of elements but I'm having trouble changing the list that is returned to the actual href links. I know find_element_by_xpath works, but that only works for one element.
Here is my code:
path_to_chromedriver = 'path to chromedriver location'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get("file:///path to html file")
all_trails = []
#finds all elements with the class 'text-truncate trail-name' then
#retrieve the a element
#this seems to be just giving us the element location but not the
#actual location
find_href = browser.find_elements_by_xpath('//div[#class="text truncate trail-name"]/a[1]')
all_trails.append(find_href)
print all_trails
This code is returning:
<selenium.webdriver.remote.webelement.WebElement
(session="dd178d79c66b747696c5d3750ea8cb17",
element="0.5700549730549636-1663")>,
<selenium.webdriver.remote.webelement.WebElement
(session="dd178d79c66b747696c5d3750ea8cb17",
element="0.5700549730549636-1664")>,
I expect the all_trails array to be a list of links like: www.google.com, www.yahoo.com, www.bing.com.
I've tried looping through the all_trails list and running the get_attribute('href') method on the list but I get the error:
Does anyone have any idea how to convert the selenium WebElement's to href links?
Any help would be greatly appreciated :)
Let us see what's happening in your code :
Without any visibility to the concerned HTML it seems the following line returns two WebElements in to the List find_href which are inturn are appended to the all_trails List :
find_href = browser.find_elements_by_xpath('//div[#class="text truncate trail-name"]/a[1]')
Hence when we print the List all_trails both the WebElements are printed. Hence No Error.
As per the error snap shot you have provided, you are trying to invoke get_attribute("href") method over a List which is Not Supported. Hence you see the error :
'List' Object has no attribute 'get_attribute'
Solution :
To get the href attribute, we have to iterate over the List as follows :
find_href = browser.find_elements_by_xpath('//your_xpath')
for my_href in find_href:
print(my_href.get_attribute("href"))
If you have the following HTML:
<div class="text-truncate trail-name">
Link 1
</div>
<div class="text-truncate trail-name">
Link 2
</div>
<div class="text-truncate trail-name">
Link 3
</div>
<div class="text-truncate trail-name">
Link 4
</div>
Your code should look like:
all_trails = []
all_links = browser.find_elements_by_css_selector(".text-truncate.trail-name>a")
for link in all_links:
all_trails.append(link.get_attribute("href"))
Where all_trails -- is a list of links (Link 1, Link 2 and so on).
Hope it helps you!
Use it in Singular form as find_element_by_css_selector instead of using find_elements_by_css_selector as it returns many webElements in List. So you need to loop through each webElement to use Attribute.
find_href = browser.find_elements_by_xpath('//div[#class="text truncate trail-name"]/a[1]')
for i in find_href:
all_trails.append(i.get_attribute('href'))
get_attribute works on elements of that list, not list itself.
get_attribute works on elements of that list only, not list itself. For eg :-
def fetch_img_urls(search_query: str):
driver.get('https://images.google.com/')
search = driver.find_element(By.CLASS_NAME, "gLFyf.gsfi")
search.send_keys(search_query)
search.send_keys(Keys.RETURN)
links=[]
try:
time.sleep(5)
urls = driver.find_elements(By.CSS_SELECTOR,'a.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')
for url in urls:
#print(url.get_attribute("href"))
links.append(url.get_attribute("href"))
print(links)
except Exception as e:
print(f'error{e}')
driver.quit()
I want to get the url of the link of tag. I have attached the class of the element to type selenium.webdriver.remote.webelement.WebElement in python:
elem = driver.find_elements_by_class_name("_5cq3")
and the html is:
<div class="_5cq3" data-ft="{"tn":"E"}">
<a class="_4-eo" href="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1" rel="theater" ajaxify="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1&src=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-xfp1%2Ft31.0-8%2F11894571_10153954245456840_9038620401603938613_o.jpg&smallsrc=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-prn2%2Fv%2Ft1.0-9%2F11903991_10153954245456840_9038620401603938613_n.jpg%3Foh%3D0c837ce6b0498cd833f83cfbaeb577e7%26oe%3D567D8819&size=651%2C1000&fbid=10153954245456840&player_origin=profile" style="width:256px;">
<div class="uiScaledImageContainer _4-ep" style="width:256px;height:394px;" id="u_jsonp_2_r">
<img class="scaledImageFitWidth img" src="https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/v/t1.0-0/s526x395/11903991_10153954245456840_9038620401603938613_n.jpg?oh=15f59e964665efe28943d12bd00cefd9&oe=5667BDBA&__gda__=1448928574_a7c6da855842af4c152c2fdf8096e1ef" alt="9GAG's photo." width="256" height="395">
</div>
</a>
</div>
I want the href value of the a tag falling inside the class _5cq3.
Why not do it directly?
url = driver.find_element_by_class_name("_4-eo").get_attribute("href")
And if you need the div element first you can do it this way:
divElement = driver.find_elements_by_class_name("_5cq3")
url = divElement.find_element_by_class_name("_4-eo").get_attribute("href")
or another way via xpath (given that there is only one link element inside your 5cq3 Elements:
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a").get_attribute("href")
You can use xpath for same
If you want to take href of "a" tag, 2nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a[#class='_4-eo']").get_attribute("href")
If you want to take href of "img" tag, 4nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a/div/img[#class='scaledImageFitWidth img']").get_attribute("href")
Use:
1)
xpath to specify the path to the href first.
x = '//a[#class="_4-eo"]'
k = driver.find_elements_by_xpath(x).get_attribute("href")
for url in k:
print url
2) Use #drkthng's solution(the simplest).
3)You can use:
parentElement = driver.find_elements_by_class("_4-eo")
elementList = parentElement.find_elements_by_tag_name("href")
You can use whatever you want in Selenium. there are 2-3 more ways to find the same.
And for image src use below xpath:
img_path = '//div[#class="uiScaledImageContainer _4-ep"]//img[#src]'