Get links from a certain div using Selenium in Python - python

I have the following HTML page. I want to get all the links inside a specific div. Here is my HTML code:
<div class="rec_view">
<a href='www.xyz.com/firstlink.html'>
<img src='imga.png'>
</a>
<a href='www.xyz.com/seclink.html'>
<img src='imgb.png'>
</a>
<a href='www.xyz.com/thrdlink.html'>
<img src='imgc.png'>
</a>
</div>
I want to get all the links that are present on the rec_view div. So those links that I want are,
www.xyz.com/firstlink.html
www.xyz.com/seclink.html
www.xyz.com/thrdlink.html
Here is the Python code which I tried with
from selenium import webdriver;
webpage = r"https://www.testurl.com/page/123/"
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get(webpage)
element = driver.find_element_by_css_selector("div[class='rec_view']>a")
link = element.get_attribute("href")
print(link)
How can I get those links using selenium on Python?

As per the HTML you have shared to get the list of all the links that are present on the rec_view div you can use the following code block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.testurl.com/page/123/')
elements = driver.find_elements_by_css_selector("div.rec_view a")
for element in elements:
print(element.get_attribute("href"))
Note : As you need to collect all the href attributes from the div tag so instead of find_element_* you need to use find_elements_*. Additionally, > refers to immediate <a> child node where as you need to traverse all the <a> child nodes so the desired css_selector will be div.rec_view a

Related

How to recieve inner HTML of a child node in selenium (python)?

I'm trying to iterate through multiple nodes and receive various child nodes from the parent nodes. Assuming that I've something like the following structure:
<div class="wrapper">
<div class="item">
<div class="item-footer">
<div class="item-type">Some data in here</div>
</div>
</div>
<!-- More items listed here -->
</div>
I'm able to receive all child nodes of the wrapper container by using the following.
wrapper = driver.find_element(By.XPATH, '/html/body/div')
items = wrapper.find_elements(By.XPATH, './/*')
Anyways I couldn't figure out how I can now receive the inner HTML of the container containing the information about the item type. I've tried this, but this didn't work.
for item in items:
item_type = item.item.find_element(By.XPATH, './/div/div').get_attribute('innerHTML')
print(item_type)
This results in the following error:
NoSuchElementException: Message: Unable to locate element:
Does anybody knows how I can do that?
In case all the elements their content you want to get are div with class attribute value item-type located inside divs with class attribute value item-footer you can simply do the following:
elements = driver.find_element(By.XPATH, '//div[#class="item-footer"]//div[#class="item-type"]')
for element in elements:
data = element.get_attribute('innerHTML')
print(data)
You can use BeautifulSoup after getting page source from selenium to easily scrape the HTML data.
from bs4 import BeautifulSoup
# selenium code part
# ....
# ....
# driver.page_source is the HTML result from selenium
html_doc = BeautifulSoup(driver.page_source, 'html.parser')
items = html_doc.find_all('div', attrs={'class':'item'})
for item in items:
text = item.find('div', attrs={'class':'item-type'}).text
print(text)
Output:
Some data in here
You need to just find the relative xpath to identify each element and then iterate it.
items = driver.find_elements(By.XPATH, "//div[#class='wrapper']//div[#class='item']//div[#class='item-type']")
for item in items:
print(item.text)
print(item.get_attribute('innerHTML'))
Or use the css selector
items = driver.find_elements(By.CSS_SELECTOR, ".wrapper >.item .item-type")
for item in items:
print(item.text)
print(item.get_attribute('innerHTML'))

Extracting just the link in the <a> tag using selenium in Python

Im using selenium in python to do some webscraping and I want to get just the links here.
<ul class="liste-sous-menu">
<li class="target Menu" id="summary1">
<a href="../associations/formalites-administratives-association">Formalités administratives d'une
association</a>
<ul class="ul-dossier">
<li>
Création</li>
</ul>
</li>
</ul>
And I'm interested just in links in the tag with id summary1 and not the other links mentioned in the second unordered list
and since I have a long list with id that starts with summary I did this code , but while refering to it I get just the texts and not the link , do you have any other suggestion?
list_of_services = driver.find_elements_by_class_name("liste-sous-menu")
for service in list_of_services:
# In each element, select the tags
atags = service.find_elements_by_xpath("//li[starts-with(#id,'summary')]")
for atag in atags:
# In each atag, select the href
href = atag.get_attribute('href')
# Open a new window
driver.execute_script("window.open('');")
# Switch to the new window and open URL
driver.switch_to.window(driver.window_handles[1])
driver.get(href)
sleep(3)
So when I want to get the link I get this error
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string
(Session info: chrome=88.0.4324.104)
What happens!
You are just selecting the <li> and not the <a>, so you wont get a href - Try to use the following xpath to select the href
//li[starts-with(#id,'summary')]/a

How do I extract text from a button using Beautiful Soup?

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.
These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856
get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

Can't grab next sibling using css selector within scrapy

I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?
This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()

Getting href value of a tag of selenium web element

I want to get the url of the link of tag. I have attached the class of the element to type selenium.webdriver.remote.webelement.WebElement in python:
elem = driver.find_elements_by_class_name("_5cq3")
and the html is:
<div class="_5cq3" data-ft="{"tn":"E"}">
<a class="_4-eo" href="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1" rel="theater" ajaxify="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1&src=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-xfp1%2Ft31.0-8%2F11894571_10153954245456840_9038620401603938613_o.jpg&smallsrc=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-prn2%2Fv%2Ft1.0-9%2F11903991_10153954245456840_9038620401603938613_n.jpg%3Foh%3D0c837ce6b0498cd833f83cfbaeb577e7%26oe%3D567D8819&size=651%2C1000&fbid=10153954245456840&player_origin=profile" style="width:256px;">
<div class="uiScaledImageContainer _4-ep" style="width:256px;height:394px;" id="u_jsonp_2_r">
<img class="scaledImageFitWidth img" src="https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/v/t1.0-0/s526x395/11903991_10153954245456840_9038620401603938613_n.jpg?oh=15f59e964665efe28943d12bd00cefd9&oe=5667BDBA&__gda__=1448928574_a7c6da855842af4c152c2fdf8096e1ef" alt="9GAG's photo." width="256" height="395">
</div>
</a>
</div>
I want the href value of the a tag falling inside the class _5cq3.
Why not do it directly?
url = driver.find_element_by_class_name("_4-eo").get_attribute("href")
And if you need the div element first you can do it this way:
divElement = driver.find_elements_by_class_name("_5cq3")
url = divElement.find_element_by_class_name("_4-eo").get_attribute("href")
or another way via xpath (given that there is only one link element inside your 5cq3 Elements:
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a").get_attribute("href")
You can use xpath for same
If you want to take href of "a" tag, 2nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a[#class='_4-eo']").get_attribute("href")
If you want to take href of "img" tag, 4nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a/div/img[#class='scaledImageFitWidth img']").get_attribute("href")
Use:
1)
xpath to specify the path to the href first.
x = '//a[#class="_4-eo"]'
k = driver.find_elements_by_xpath(x).get_attribute("href")
for url in k:
print url
2) Use #drkthng's solution(the simplest).
3)You can use:
parentElement = driver.find_elements_by_class("_4-eo")
elementList = parentElement.find_elements_by_tag_name("href")
You can use whatever you want in Selenium. there are 2-3 more ways to find the same.
And for image src use below xpath:
img_path = '//div[#class="uiScaledImageContainer _4-ep"]//img[#src]'

Categories

Resources