lxml: get element with a particular child element? - python

Working in lxml, I want to get the href attribute of all links with an img child that has title="Go to next page".
So in the following snippet:
<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>
I'd like to get StdResults.aspx back.
I've got this far:
next_link = doc.xpath("//a/img[#title='Go to next page']")
print next_link[0].attrib['href']
But next_link is the img, not the a tag - how can I get the a tag?
Thanks.

Just change a/img... to a[img...]: (the brackets sort of mean "such that")
import lxml.html as lh
content='''<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>'''
doc=lh.fromstring(content)
for elt in doc.xpath("//a[img[#title='Go to next page']]"):
print(elt.attrib['href'])
# StdResults.aspx
Or, you could go even farther and use
"//a[img[#title='Go to next page']]/#href"
to retrieve the values of the href attributes.

You can also select the parent node or arbitrary ancestors by using //a/img[#title='Go to next page']/parent::a or //a/img[#title='Go to next page']/ancestor::a respectively as XPath expressions.

Related

Checking If Attribute Exists In Any Of Html Tags Selenium Python

I want to check out if any html tags have <style> attribute like <a style = ..> or <h1 style = ...> or <div style = ..> etc.
I used below code but it could not be run:
driver = webdriver.Chrome(web_driver_address, options=op)
driver.get(url)
elems = driver.find_elements_by_xpath("[#style]")
How can i fix this?
Your XPath is missing element tag name.
In your case it can be any tag name, but it still should be there as a part of syntax, so you should use * like any there.
Also, you are missing the // that means the element can be anywhere on the page.
So the correct XPath expression will be something like this:
elems = driver.find_elements_by_xpath("//*[#style]")
Don't forget to add some wait / delay to let page load all the elements before you get them
xpath needs tag to be valid. If you don't want a specific tag use *
find_elements_by_xpath("//*[#style]")
Or with css_selector
find_elements_by_css_selector("[style]")

How to skip over child element with Scrapy

I'm looking to scrape just the job description from this page: https://www.aha.io/company/careers/current-openings/customer_success_specialist_project_management_us
I'd like to get all of the text and HTML inside the div with the class of "container py2 content job", EXCEPT the button. It's in an <a> tag with the class of "btn btn-large btn-secondary".
I've got two different xpath selectors that I thought should work, but don't. The first doesn't exclude the button and the second gets rid of all of the other HTML, which I'd like to keep.
response.xpath('//div[#class ="container py2 content job"]
[not(parent::a/#class="btn btn-large btn-secondary")]').extract()
response.xpath('//div[#class ="container py2 content
job"]/descendant::text()[not(parent::a/#class="btn btn-large btn-
secondary")]').extract()
Neither is scraping all of the HTML in the div minus what's inside the a tag. I'm hoping there's something simple that I'm missing, but I can't find what I'm looking for in the documentation.
job_html = response.css('div.content *').extract()
job_html = [x for x in job_html if "Apply now" not in x]
print(job_html)

Get links from a certain div using Selenium in Python

I have the following HTML page. I want to get all the links inside a specific div. Here is my HTML code:
<div class="rec_view">
<a href='www.xyz.com/firstlink.html'>
<img src='imga.png'>
</a>
<a href='www.xyz.com/seclink.html'>
<img src='imgb.png'>
</a>
<a href='www.xyz.com/thrdlink.html'>
<img src='imgc.png'>
</a>
</div>
I want to get all the links that are present on the rec_view div. So those links that I want are,
www.xyz.com/firstlink.html
www.xyz.com/seclink.html
www.xyz.com/thrdlink.html
Here is the Python code which I tried with
from selenium import webdriver;
webpage = r"https://www.testurl.com/page/123/"
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get(webpage)
element = driver.find_element_by_css_selector("div[class='rec_view']>a")
link = element.get_attribute("href")
print(link)
How can I get those links using selenium on Python?
As per the HTML you have shared to get the list of all the links that are present on the rec_view div you can use the following code block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.testurl.com/page/123/')
elements = driver.find_elements_by_css_selector("div.rec_view a")
for element in elements:
print(element.get_attribute("href"))
Note : As you need to collect all the href attributes from the div tag so instead of find_element_* you need to use find_elements_*. Additionally, > refers to immediate <a> child node where as you need to traverse all the <a> child nodes so the desired css_selector will be div.rec_view a

Finding an href link using Python, Selenium, and XPath

I want to get the href from a <p> tag using an XPath expression.
I want to use the text from <h1> tag ('Cable Stripe Knit L/S Polo') and simultaneously text from the <p> tag ('White') to find the href in the <p> tag.
Note: There are more colors of one item (more articles with different <p> tags, but the same <h1> tag)!
HTML source
<article>
<div class="inner-article">
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" style="height:150px;">
</a>
<h1>
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" class="name-link">Cable Stripe Knit L/S Polo
</a>
</h1>
<p>
White
</p>
</div>
</article>
I've tried this code, but it didn't work.
specificProductColor = driver.find_element_by_xpath("//div[#class='inner-article' and contains(text(), 'White') and contains(text(), 'Cable')]/p")
driver.get(specificProductColor.get_attribute("href"))
As per the HTML source, the XPath expression to get the href tags would be something like this:
specificProductColors = driver.find_elements_by_xpath("//div[#class='inner-article']//a[contains(text(), 'White') or contains(text(), 'Cable')]")
specificProductColors[0].get_attribute("href")
specificProductColors[1].get_attribute("href")
Since there are two hyperlink tags, you should be using find_elements_by_xpath which returns a list of elements. In this case it would return two hyperlink tags, and you could get their href using the get_attribute method.
I've got working code. It's not the fastest one - this part takes approximately 550 ms, but it works. If someone could simplify that, I'd be very thankful :)
It takes all products with the specified keyword (Cable) from the product page and all products with a specified color (White) from the product page as well. It compares href links and matches wanted product with wanted color.
I also want to simplify the loop - stop both for loops if the links match.
specificProduct = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + productKeyword[arrayCount] + "')]")
specificProductColor = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + desiredColor[arrayCount] + "')]")
for i in specificProductColor:
specProductColor = i.get_attribute("href")
for i in specificProduct:
specProduct = i.get_attribute("href")
if specProductColor == specProduct:
print(specProduct)
wantedProduct = specProduct
driver.get(wantedProduct)

Getting href value of a tag of selenium web element

I want to get the url of the link of tag. I have attached the class of the element to type selenium.webdriver.remote.webelement.WebElement in python:
elem = driver.find_elements_by_class_name("_5cq3")
and the html is:
<div class="_5cq3" data-ft="{"tn":"E"}">
<a class="_4-eo" href="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1" rel="theater" ajaxify="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1&src=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-xfp1%2Ft31.0-8%2F11894571_10153954245456840_9038620401603938613_o.jpg&smallsrc=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-prn2%2Fv%2Ft1.0-9%2F11903991_10153954245456840_9038620401603938613_n.jpg%3Foh%3D0c837ce6b0498cd833f83cfbaeb577e7%26oe%3D567D8819&size=651%2C1000&fbid=10153954245456840&player_origin=profile" style="width:256px;">
<div class="uiScaledImageContainer _4-ep" style="width:256px;height:394px;" id="u_jsonp_2_r">
<img class="scaledImageFitWidth img" src="https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/v/t1.0-0/s526x395/11903991_10153954245456840_9038620401603938613_n.jpg?oh=15f59e964665efe28943d12bd00cefd9&oe=5667BDBA&__gda__=1448928574_a7c6da855842af4c152c2fdf8096e1ef" alt="9GAG's photo." width="256" height="395">
</div>
</a>
</div>
I want the href value of the a tag falling inside the class _5cq3.
Why not do it directly?
url = driver.find_element_by_class_name("_4-eo").get_attribute("href")
And if you need the div element first you can do it this way:
divElement = driver.find_elements_by_class_name("_5cq3")
url = divElement.find_element_by_class_name("_4-eo").get_attribute("href")
or another way via xpath (given that there is only one link element inside your 5cq3 Elements:
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a").get_attribute("href")
You can use xpath for same
If you want to take href of "a" tag, 2nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a[#class='_4-eo']").get_attribute("href")
If you want to take href of "img" tag, 4nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a/div/img[#class='scaledImageFitWidth img']").get_attribute("href")
Use:
1)
xpath to specify the path to the href first.
x = '//a[#class="_4-eo"]'
k = driver.find_elements_by_xpath(x).get_attribute("href")
for url in k:
print url
2) Use #drkthng's solution(the simplest).
3)You can use:
parentElement = driver.find_elements_by_class("_4-eo")
elementList = parentElement.find_elements_by_tag_name("href")
You can use whatever you want in Selenium. there are 2-3 more ways to find the same.
And for image src use below xpath:
img_path = '//div[#class="uiScaledImageContainer _4-ep"]//img[#src]'

Categories

Resources