Href not visible in scrapy result but visible in html

Href not visible in scrapy result but visible in html - python

Set-up
I have the next-page button element from this page,
<li class="Pagination-item Pagination-item--next Pagination-item--nextSolo ">
<button type="button" class="Pagination-link js-veza-stranica kist-FauxAnchor" data-page="2" data-href="https://www.njuskalo.hr/prodaja-kuca?page=2" role="link">Sljedeća <span aria-hidden="true" role="presentation">»</span></button>
</li>
I need to obtain the url in the data-href attribute.
Code
Using the following simple xpath to the button element in scrapy shell,
response.xpath('//*[#id="form_browse_detailed_search"]/div/div[1]/div[5]/div[1]/nav/ul/li[8]/button').extract_first()
I retrieve,
'<button type="button" class="Pagination-link js-veza-stranica" data-page="2">Sljedeća\xa0<span aria-hidden="true" role="presentation">»</span></button>'
Question
Where did the data-href attribute go to?
How do I obtain the url?

The data-href attribute is most likely being calculated by some JavaScript code running in your browser. If you look at the raw source code of this page ("view source code" option in your browser), you won't find that attribute there.
The output you see on developer tools is the DOM rendered by your browser, so you can expect differences between your browser view and what Scrapy actually fetches (which is the raw HTML source). Keep in mind that Scrapy doesn't execute any JavaScript code.
Anyway, a way to solve this would be building the pagination URL based on the data-page attribute:
from w3lib.url import add_or_replace_parameter
...
next_page = response.css('.Pagination-item--nextSolo button::attr(data-page)').get()
next_page_url = add_or_replace_parameter(response.url, 'page', next_page)
w3lib is an open source library: https://github.com/scrapy/w3lib

Related

I can't extract href from html webpage

I'm using scrapy to fetch data from this webpage.
I'm relatively new to this. I need to get the href link of the next page button > but can't find the solution.
Please help
Tried this in the terminal
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#href').extract()
but it just gives me [].
This is the html code of the button:
data-page="2" data-url="http://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" class="btn--pagination btn--pag-next pag-control" href="//www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=2&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" style="">
>
</a>

The issue is that the link element doesn't have an href attribute. use the data-url instead
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#data-url').get()
OUTPUT
'http://worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-21'

How to get the download link of html a tag which has no explicit true link?

I encountered a web page that has many download sign like
If I click on each of these download sign, the browser will start downloading a zip file.
However, it seems that these download sign are just images with no explicit download links can be copied.
I looked into the source of html. I figured out each download sign belong to a tr tag block as below.
<tr>
<td title="aaa">
<span class="centerFile">
<img src="/images/downloadCenter/pic.png" />
</span>aaa
</td>
<td>2021-09-10 13:42</td>
<td>bbb</td>
<td><a id="4099823" data='{"clazzId":37675171,"libraryId":"98689831","relationId":1280730}' recordid="4099823" target="_blank" class="download_ic checkSafe" style="line-height:54px;"><img src="/images/down.png" /></a></td>
</tr>
Click this link will download a zip file with download link
So my problem is how to get download links of these download sign without actually clicking them in the browser. In particular, I want to know how to do this using python by analyzing the source html so I could to do batch downloading?

If you want to do the batch download of those files, and are not able to find out links by analysis of html and javascript (because it's probably javascript function that creates this link, or javascript call to backend) then you can use selenium to simulate you acting as user.
You will need to do something like code below, where I'm using class name from html you present, where I think is call to javascript download function:
from selenium import webdriver
driver = webdriver.Chrome()
# URL of website
url = "https://www.yourwebsitelinkhere.com/"
driver.get(url)
# use class name to find anchor link
download_links = driver.find_elements_by_css_selector(".download_ic.checkSafe")
for link in download_links:
link.click()
Example how it works for stackoverflow (in the day of writing this answer)
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
elements = driver.find_elements_by_css_selector('.-marketing-link.js-gps-track')
elements[0].click()
And this should lead you to stackoverflow about site.
[EDIT] Answer edited, as it seems compound classes are not supported by selenium, example for stackoverflow added

How do I extract text from a button using Beautiful Soup?

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.

These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856

get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>

To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

How to make click using Selenium?

I got stuck with extracting href="/ttt/play" from the following HTML code.
<div class="collection-list-wrapper-2 w-dyn-list">
<div class="w-dyn-items">
<div typeof="ListItem" class="collection-item-2 w-clearfix w-dyn-item">
<div class="div-block-38 w-hidden-medium w-hidden-small w-hidden-tiny"><img src="https://global-uploads.webflow.com/59cf_home.svg" width="16" height="16" alt="Official Link" class="image-28">
<a property="url" href="/ttt/play" class="link-block-4 w-inline-block">
<div class="row-7 w-row"><div class="column-10 w-col w-col-2"><img height="25" property="image" src="https://global-fb0edc0001b4b11d/5a77ba9773fd490001ddaaaa_play.png" alt="Play" class="image-23"><h2 property="name" class="heading-34">Play</h2><div style="background-color:#d4af37;color:white" class="text-block-28">GOLD LEVEL</div><div class="text-block-30">HOT</div><div style="background-color:#d4af37;color:white" class="text-block-28 w-condition-invisible">SILVER LEVEL</div></div></div></a>
</div>
<div typeof="ListItem" class="collection-item-2 w-clearfix w-dyn-item">
This is my code in Python:
driver = webdriver.PhantomJS()
driver.implicitly_wait(20)
driver.set_window_size(1120, 550)
driver.get(website_url)
tag = driver.find_elements_by_class_name("w-dyn-item")[0]
tag.find_element_by_tag_name("a").click()
url = driver.current_url
print(url)
driver.quit()
When I print url using print(url), I want to see url equal to website_url/ttt/play, but instead of it I get website_url.
It looks like the click event does not work and the new link is not really opened.

When using .click() it must be "visible" (you using PhantomJS) and not hidden, in a drop-down for example.
Also make sure the page is completely loaded.
As i see it you have two options:
Ether use selenium to revile it, and then click.
Use java script to do the actual click
I strongly suggest to click with javascript, its much faster and more reliable.
Here is a little wrapper to make things easier:
def execute_script(driver, xpath):
""" wrapper for selenium driver execute_script
:param driver: selenium driver
:param xpath: (str) xpath to the element
:return: execute_script result
"""
execute_string = "window.document.evaluate('{}', document, null, 9, null).singleNodeValue.click();".format(xpath)
return driver.execute_script(execute_string)
The wrapper basically implement this technique to click on elements with javascript.
then in your selenium script use the wrapper like so:
execute_script(driver, element_xpath)
you can also make it more general to not only do clicks, but scrolls and other magic..
ps. in my example i use xpath, but you can also use css_path basically, what-ever runs in javascript.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Href not visible in scrapy result but visible in html - python

Related

I can't extract href from html webpage

How to get the download link of html a tag which has no explicit true link?

How do I extract text from a button using Beautiful Soup?

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

How to make click using Selenium?

Categories

Resources