How to see hidden content in html selectors? - python

When I want to show view source it looks like this:
<li class="results__list-container-item"></li>
But when I click Inspect Element in Firefox I see something like this:
<li class="results__list-container-item"><div class="offer offer--normal"><a class="offer__click-area" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566"></a><div class="offer__info"><div class="offer-details"><div class="offer-logo"><img src="https://i.gpcdn.pl/oferty-loga-firm/wyniki-wyszukiwania/14032.png" alt="logo" class="offer-logo__image"></div><div class="offer-details__text"><h3 class="offer-details__title"><a class="offer-details__title-link" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566">Data Engineer for Bixby Voice Assistant</a></h3><p class="offer-company"><span class="offer-company__link-wrapper"></li>
And it's possible to extract hidden content by web scraper(BeautifulSoup4)?

Hidden content is usually generated via JS. If you make a request to the webpage, it will not contain hidden HTML because the page has to be loaded in a browser for the hidden content to be loaded. We can get around this by using selenium web browser to actually open the page and then get the HTML from the rendered page.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('example-url.com')
html = browser.page_source
soup = BeautifulSoup(html,features='html.parser')
hidden_divs = soup.find_all('div', {'class':'offer offer--normal'})
Of course, we would need the URL you are looking at to actually test this but this is how it generally works.

Related

Why is the html content I got from inspector different from what I got from Request?

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

How to get the download link of html a tag which has no explicit true link?

I encountered a web page that has many download sign like
If I click on each of these download sign, the browser will start downloading a zip file.
However, it seems that these download sign are just images with no explicit download links can be copied.
I looked into the source of html. I figured out each download sign belong to a tr tag block as below.
<tr>
<td title="aaa">
<span class="centerFile">
<img src="/images/downloadCenter/pic.png" />
</span>aaa
</td>
<td>2021-09-10 13:42</td>
<td>bbb</td>
<td><a id="4099823" data='{"clazzId":37675171,"libraryId":"98689831","relationId":1280730}' recordid="4099823" target="_blank" class="download_ic checkSafe" style="line-height:54px;"><img src="/images/down.png" /></a></td>
</tr>
Click this link will download a zip file with download link
So my problem is how to get download links of these download sign without actually clicking them in the browser. In particular, I want to know how to do this using python by analyzing the source html so I could to do batch downloading?
If you want to do the batch download of those files, and are not able to find out links by analysis of html and javascript (because it's probably javascript function that creates this link, or javascript call to backend) then you can use selenium to simulate you acting as user.
You will need to do something like code below, where I'm using class name from html you present, where I think is call to javascript download function:
from selenium import webdriver
driver = webdriver.Chrome()
# URL of website
url = "https://www.yourwebsitelinkhere.com/"
driver.get(url)
# use class name to find anchor link
download_links = driver.find_elements_by_css_selector(".download_ic.checkSafe")
for link in download_links:
link.click()
Example how it works for stackoverflow (in the day of writing this answer)
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
elements = driver.find_elements_by_css_selector('.-marketing-link.js-gps-track')
elements[0].click()
And this should lead you to stackoverflow about site.
[EDIT] Answer edited, as it seems compound classes are not supported by selenium, example for stackoverflow added

Selenium not able to load the given URL instead loading the main page

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\Users\chromedriver.exe")
driver.get("https://www.bestbuy.com/site/promo/health-fitness-deals")
tag = driver.find_elements_by_tag_name('h4')
for a in tag:
for link in a.find_elements_by_tag_name('a'):
print(link.get_attribute("href"))
Main Page that is being loaded by the website :
The page that I want to scrape :
https://www.bestbuy.com/site/promo/health-fitness-deals
you can use this link instead https://www.bestbuy.com/site/promo/health-fitness-deals?intl=nosplash so basically you'll add intl=nosplash to the link

How to render long html string with selenium

I want to use selenium to render the html of a request without having to save it on a file. I'm open to other ideas to solve this problem.
I need to render the html to save it with generated tags like <tbody> for example.
I tried to inject the html like data to the driver, but for long html code it doesn't work.
from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
html_test = '''<body>
<div>
<table>
<td><tr>
Hellow work
</tr></td>
</table>
</div>
</body>'''
driver.get(f'data:text/html;charset=utf-8,{html_test}')
rendered_html = driver.page_source
driver.close()
This code works with small samples of html, but not with a full page code.

Download html of a webpage thats already loaded

I am writing a program using Python and selenium to automate logging into a website. The website asks a security question for additional verification. Clearly the answer I would send using "send_keys" would depend on the question asked so I need to figure out what is being asked based on the text. BeautifulSoup can be used to parse through the HTML but in all the examples I have seen you have to give a URL to then read the page content. How do I read the content of a page that's already open? The code I am using is:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
loginElem = browser.find_element_by_id('bbbb')
loginElem.send_keys('cccc')
passwordElem = browser.find_element_by_id('dddd')
passwordElem.send_keys('eeee')
passwordElem.send_keys(Keys.RETURN)
The page with the security questions loads after this and that's the page I want the URL of.
I also tried finding by element but for some reason it wasnt working which is why I am trying a workaround. Below is the HTML for the entire div class where the question is. Alternatively maybe you can help me search for the right one.
<div class="answer-section">
<p> Please answer your challenge question so we can help
verify your identity.
</p> <label for="tlpvt-challenge-answer"> What is the name of your dog?
</label>
<input type="text" id="tlpvt-challenge-answer" class="tl-private gis- mask"
name="challengeQuestionAnswer" value=""/>
</div>
well if you want to use BeautifulSoup you can retrieve the source code from the webdriver and then parse it:
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
# call page_source attr from a webdriver instance to
# retrieve HTML source code
html = browser.page_source
# parse it with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
label = soup.find('label', {'for': 'tlpvt-challenge-answer'})
print label.get_text()
output:
$ What is the name of your dog?

Categories

Resources