I can't extract href from html webpage - python

I'm using scrapy to fetch data from this webpage.
I'm relatively new to this. I need to get the href link of the next page button > but can't find the solution.
Please help
Tried this in the terminal
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#href').extract()
but it just gives me [].
This is the html code of the button:
data-page="2" data-url="http://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" class="btn--pagination btn--pag-next pag-control" href="//www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=2&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" style="">
>
</a>

The issue is that the link element doesn't have an href attribute. use the data-url instead
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#data-url').get()
OUTPUT
'http://worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-21'

Related

Python Selenium - get href

I want to get some links in the url but they I get all links instead. How can I pull the links by specifying the selector?
For ex:
I'm using:
ids = browser.find_elements_by_xpath('//a[#href]')
for ii in ids:
print(ii.get_attribute('href'))
Result: All Links
But I want just some selector
<a class="classifiedTitle" title="MONSTER TULPAR T5V13.1+ 15,6 EKRAN+İ7+6GB GTX1060+16RAM+256GB SS" href="/ilan/ikinci-el-ve-sifir-alisveris-bilgisayar-dizustu-notebook-monster-tulpar-t5v13.1-plus-15%2C6-ekran-plusi7-plus6gb-gtx1060-plus16ram-plus256gb-ss-793070526/detay">
MONSTER TULPAR T5V13.1+ 15,6 EKRAN+İ7+6GB GTX1060+16RAM+256GB SS</a>
So how can I add some selectors?
Thanks & Regards
Try the following css selector to get the specific link.
print(browser.find_element_by_css_selector("a.classifiedTitle[title='MONSTER TULPAR T5V13.1+ 15,6 EKRAN+İ7+6GB GTX1060+16RAM+256GB SS'][href*='/ilan/ikinci-el-ve-sifir-alisveris-bilgisayar-dizustu-notebook-monster-tulpar-']").get_attribute("href"))
If you want just the item in your example:
href=browser.find_element_by_xpath("//a[#title='MONSTER TULPAR T5V13.1+ 15,6 EKRAN+İ7+6GB GTX1060+16RAM+256GB SS" href="/ilan/ikinci-el-ve-sifir-alisveris-bilgisayar-dizustu-notebook-monster-tulpar-t5v13.1-plus-15%2C6-ekran-plusi7-plus6gb-gtx1060-plus16ram-plus256gb-ss-793070526/detay']").get_attribute('href')
There are obviously more than one way of identifying your element, this is simply an example using xpath.

Href not visible in scrapy result but visible in html

Set-up
I have the next-page button element from this page,
<li class="Pagination-item Pagination-item--next Pagination-item--nextSolo ">
<button type="button" class="Pagination-link js-veza-stranica kist-FauxAnchor" data-page="2" data-href="https://www.njuskalo.hr/prodaja-kuca?page=2" role="link">Sljedeća <span aria-hidden="true" role="presentation">»</span></button>
</li>
I need to obtain the url in the data-href attribute.
Code
Using the following simple xpath to the button element in scrapy shell,
response.xpath('//*[#id="form_browse_detailed_search"]/div/div[1]/div[5]/div[1]/nav/ul/li[8]/button').extract_first()
I retrieve,
'<button type="button" class="Pagination-link js-veza-stranica" data-page="2">Sljedeća\xa0<span aria-hidden="true" role="presentation">»</span></button>'
Question
Where did the data-href attribute go to?
How do I obtain the url?
The data-href attribute is most likely being calculated by some JavaScript code running in your browser. If you look at the raw source code of this page ("view source code" option in your browser), you won't find that attribute there.
The output you see on developer tools is the DOM rendered by your browser, so you can expect differences between your browser view and what Scrapy actually fetches (which is the raw HTML source). Keep in mind that Scrapy doesn't execute any JavaScript code.
Anyway, a way to solve this would be building the pagination URL based on the data-page attribute:
from w3lib.url import add_or_replace_parameter
...
next_page = response.css('.Pagination-item--nextSolo button::attr(data-page)').get()
next_page_url = add_or_replace_parameter(response.url, 'page', next_page)
w3lib is an open source library: https://github.com/scrapy/w3lib

how to get anchor tag href attribute when it's hidden with selenium python

I'm currently using Selenium to make a simple python crawler.
Is there any method that I can get a link address(url) of a anchor tag? When I see the html source, it's hidden just like :
<a id='foo' href='#'></a>
I can actually click and load the page and get the url address, but then I need to wait a while.
For HTML you have shared :
<a id='foo' href='#'></a>
You can have a web element like this :
element = driver.find_element_by_id('foo')
once you have the web element, you can get the attribute like this :
href_val = element.get_attribute("href")
print(href_val)

Python selenium - get element by its xPath and access it

I am using Python and Selenium to scrape a webpage, in some cases, I can't get it to work,
*
I would like to access the element with text 'PInt', wich is the second link in the below code.
The xPath for it (copied from Developer console) is: //[#id="submenu1"]/a[2]
<div id="divTest" onscroll="SetDivPosition();" style="height: 861px;">
<div class="menuTitle" id="title1">
</div>
<div class="subtitle" id="submenu1">
<img src="images/spacer.gif" border="0" width="2px" height="12px">
Mov<br>
<img src="images/spacer.gif" border="0" width="2px" height="12px">
PInt<br>
<img src="images/spacer.gif" border="0" width="2px" height="12px">
SWAM / SWIF<br>
</div>
...
A snipped of my code is:
try:
res = driver.find_elements_by_link_text('PInt')
print("res1:{}".format(res))
res = driver.find_element(By.XPATH,'//*[#id="submenu1"]/a[3]')
print("res:{} type[0]:{}".format(res,res[0]))
itm1 = res[0]
itm1.click()
I get the error:
Unable to locate element:
{"method":"xpath","selector":"//*[#id="submenu1"]/a[2]"}
My question is, how can I get the right xPath of the element or any other way to access the element?
UPDATE:
This might be important, the issue with
Message: invalid selector: Unable to locate an element with the xpath expression (and I've tried all proposed solutions) might be that this is after authenticate in the webpage (User + Pwd) before, everything works.
I noticed that the url driver.current_url after login is static (asp page).
Also this part I am trying to access in a frameset and frame
html > frameset > frameset > frame:nth-child(1)
Thanks to #JeffC to point me in the right direction.
since the page has some frames, I manage to access the element first by switching to the right frame (using xPath)
and then access the element.
driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element_by_xpath('html / frameset / frameset / frame[1]'))
driver.find_element_by_xpath("//a[contains(text(),'PInt')]").click()
BTW, in case you want to run the script from a crontab,you need to setup a display:
30 5 * * * export DISPLAY=:0; python /usr/.../main.py
To see a full list of all the ways of selecting elements using selenium, you can read all about it in the documentation.
Using xpath:
res = driver.find_element_by_xpath(u'//*[#id="submenu1"]/a[2]')
Using css selector:
res = driver.find_element_by_css_selector('#submenu1 a:nth-of-type(2)')
Try with any of the below xpath. Sometimes automatically generated xpath does not work.
//a[contains(text(),'PInt')]
or
//div[#id='submenu1']//a[contains(text(),'PInt')]
Also I would suggest you to set some wait time before clicking on above link in case if above xpath does not work
To find xPath in chrome:
Right click the element you want
Inspect, which will open the developer window and highlight the selected element.
Right click the highlighted element and choose copy > copy by xpath
Here is a list of all the different ways to locate an element Locating Elements

Find nested divs scrapy

I am trying to get the text from a div that is nested. Here is the code that I currently have:
sites = hxs.select('/html/body/div[#class="content"]/div[#class="container listing-page"]/div[#class="listing"]/div[#class="listing-heading"]/div[#class="price-container"]/div[#class="price"]')
But it is not returning a value. Is my syntax wrong? Essentially I just want the text out of <div class="price">
Any ideas?
The URL is here.
The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978
Once you request this url:
hxs.select('//div[#class="price"]/text()').extract()[0]

Categories

Resources