Get href from link above another element with Selenium - python

I'm using selenium and I need to get an href from a link that is above many tags!
But the only information that I can use and I have for sure, is the text "Test text!" from the h3 tag!
Here is the example:
<a href="/link/post" class="link" >
<div class="inner">
<div class="header flex">
<h3 class="mb-0">
Test text!
</h3>
</div>
</div>
</a>

Try using the following xpath to locate the desired element:
//a[#href and .//h3[contains(text(),'Test text!')]]
So, to get the href value you have to
from selenium.webdriver.common.by import By
href = driver.find_element(By.XPATH, '//a[#href and .//h3[contains(text(),'Test text!')]]')

An alternative to the approach in Prophet's answer would be to use a XPATH like
//h3[contains(text(),"Test text!")]]//ancestor::a
i.e. first search for the h3 tag and then for an a tag above.
Prophet's answer uses the opposite approach, first find all a tags and then only keep the one with the correct h3 tag below.

Related

Python Selenium How to find elements by XPATH with info from TAG and SUB TAG

HTML:
<div id="related">
<a class="123" href="url">
<h3 class="456">
<span id="id00" aria-label="TEXT HERE">
</span>
</h3>
</a>
<a class="123" href="url">
<h3 class="456">
<span id="id00" aria-label="NOT HERE">
</span>
</h3>
</a>
</div>
I'm trying to find & click on <a (inside the div id="related" with class="123" AND where SPAN aria-label contains "TEXT"
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123'][contains(#href, 'url')]//span[contains(#aria-label, 'TEXT']")
But it's not finding the href, it's only finding the span.
then I want to do:
items[3].click()
How can I do that.
Your XPath has some typo problems.
Try this:
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123'][contains(#href,'watch?v=')]//span[contains(#aria-label,'TEXT')]")
This will give you the span element inside the presented block.
To locate the a element you should use another XPath.
UPD
To find all the a elements inside div with #id='related' and containing span with specific aria-label attribute can be clearly translated to XPath like this:
items = driver.find_elements(By.XPATH, "//div[#id='related']//a[#class='123' and .//span[contains(#aria-label,'TEXT')]]")

Find tag <a> and tag <img> when using bs4

I have the following source code:
code
<div class='aaa'>
<div class='aaa-child'>
<a>
<img></img>
</a>
</div>
</div>
code
So the structure is an image inside a hyperlink.
I would like to find if tags "a" and "img" exists inside the above divs. Any ideas? I tried with find_all but I get too many results that don't match my expectations.
Yeah use descendant CSS selector with a class selector:
soup.select('.aaa a,img')

Python webscraping with Selenium chrome driver

I'm trying to get the number of publications of an instagram account which is in a span tag by using Python Selenium with Chrome driver this is a part of the html code:
<!doctype html>
<html lang="fr" class="js logged-in client-root js-focus-visible sDN5V">
<head>-</head>
<body class style>
<div id="react-root"> == 50
<form enctype^murtipart/form-data" method="POST" role="presentation">_</form>
<section class=”_9eogI E3X2T">
<div></div>
<main class="SCxLW o64aR " role=”main">
<div class=”v9tJq AAaSh VfzDr">
<header class=" HVbuG">_</header>
► <div class="-vDIg">_</div>
► <div class="_4bSq7">_</div>
▼ <ul class=” _3dEHb">
▼ <li class=” LH36I">
▼ <span class=" _81NM2">
<span class="g47SY 10XF2">6 588</span>
"publications"
</span>
</li>
THE PYTHON CODE
def get_publications_number(self, user):
self.nav_user(user)
sleep(16)
publication = self.driver.find_element_by_xpath('//div[contains(id,"react-root")]/section/main/div/ul/li[1]/span/span')
THE ERROR MESSAGE
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element:
{"method":"xpath","selector":"//div[contains(id,"react-root")]/section/main/div/ul/li[1]/span/span"}
(Session info: chrome=80.0.3987.149)
IMPORTANT:
This xpath is pasted from the Chrome element inspector so I don't think it's the problem. When I put self.driver.find_elements_by_xpath() (with 's') there will be no error and if I do:
for value in publication:
print(value.text)
there will be no error too but nothing will be printed
SO THE QUESTION IS:
Why am I getting this error while the Xpath exists?
Try
'//div[#id="react-root"]//ul/li//span[contains(., "publications")]/span'
Explanation:
//div[#id="react-root"] << find the element which has the id of "react-root"
//ul/li << inside the found react root find elements anywhere (//) which are li elements which are children of an ul tagged element
//span[contains(., "publications")] << in the found li elements find span elements anywhere which contain publications as text
/span get span elements of the found span
One more thing: find_element_by_xpath returns the first element which matches. In case you have more than one 'publications', you can collect them all with the xpath above (if you want to ) if you just use find_elements_by_xpath instead of find_element_by_xpath in selenium.
Recently I found this page which is a quite good read to start mastering Xpath, check it out if you want to know more.
//div[contains(id,"react-root")]/section/main/div/ul/li[1]/span/span
Use this Xpath. It might work. I think you made a coma error there.

Get all links from DOM except from a certain div tag selenium python

How to get all links of the DOM except from a certain div tag??
This is the div I don't want links from:
<div id="yii-debug-toolbar">
<div class="yii-debug-toolbar_bar">
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
</div>
</div>
I get the links in my code lke this:
links = driver.find_elements_by_xpath("//a[#href]")
But I don't want to get the ones from that div, how can I do that?
I'm not sure if there is a simple way to do this with just seleniums xpath capabilities. However, a simple solution could be to parse the HTML with something like BeautifulSoup, get rid of all the <div id="yii-debug-toolbar">...</div> Elements, and then select the remaining links.
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(wd.page_source)
for div in soup.find_all("div", {'id':'yii-debug-toolbar'}):
div.decompose()
soup.find_all('a', href=True)

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

Categories

Resources