How to download complete webpage using Python Selenium - python

I have to write a Python code that will get URL, open a Chrome/Firefox browser using Selenium and will download it as a "Complete Webpage", mean with the CSS assets for example.
I know the basis of using Selenium, like:
from selenium import webdriver
ff = webdriver.firefox()
ff.get(URL)
ff.close
How can I perform the downloading action (Like clicking automatically in the browser CTRL+S)?

You can try following code to get HTML page as file:
from selenium import webdriver
ff = webdriver.Firefox()
ff.get(URL)
with open('/path/to/file.html', 'w') as f:
f.write(ff.page_source)
ff.close

Related

Python - Element item Xpath with Selenium

I'm creating a bot to download a pdf from a website. I used selenium to open google chrome and I can open the website window but I select the Xpath of the first item in the grid, but the click to download the pdf does not occur. I believe I'm getting the wrong Xpath.
I leave the site I'm accessing and my code below. Could you tell me what am I doing wrong? Am I getting the correct Xpath? Thank you very much in advance.
This site is an open government data site from my country, Brazil, and for those trying to access from outside, maybe the IP is blocked, but the page would be this:
Image site
Source site
Edit
Page source code
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
service = Service(ChromeDriverManager().install())
navegador = webdriver.Chrome(service=service)
try:
navegador.get("https://www.tce.ce.gov.br/cidadao/diario-oficial-eletronico")
time.sleep(2)
elem = navegador.find_element(By.XPATH, '//*[#id="formUltimasEdicoes:consultaAvancadaDataTable:0:j_idt101"]/input[1]')
elem.click()
time.sleep(2)
navegador.close()
navegador.quit()
except:
navegador.close()
navegador.quit()
I think you'll need this PDF, right?:
<a class="maximenuck " href="https://www.tce.ce.gov.br/downloads/Jurisdicionado/CALENDARIO_DAS_OBRIGACOES_ESTADUAIS_2020_N.pdf" target="_blank"><span class="titreck">Estaduais</span></a>
You'll need to locate that element by xpath, and then download the pdf's using the "href" value requests.get("Your_href_url")
The XPATH in your source-code is //*[#id="menu-principal"]/div[2]/ul/li[5]/div/div[2]/div/div[1]/ul/li[14]/div/div[2]/div/div[1]/ul/li[3]/a but that might not always be the same.

Is it possible to scrape HTML from Inspect Element in Python?

I am trying to scrape a site that attempts to block scraping. Viewing the source code through Chrome, or requests, or requests_html results it not showing the correct source code.
Here is an example:
from requests_html import HTMLSession
session = HTMLSession()
content = session.get('website')
content.html.render()
print(content.html.html)
It gives this page:
It looks like JavaScript is disabled or not supported by your browser.
Even though Javascript is enabled. Same thing happens on an actual browser.
However, on my actual browser, when I go to inspect element, I can see the source code just fine. Is there a way to extract the HTML source from inspect element?
Thanks!
The issue you are facing is that it's a page which is rendered by the Javascript on the front-end. In this case, you would require a javasacript enabled browser-engine and then you can easily read the HTML source.
Here's a working code of how I would do it (using selenium):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome(chrome_options=chrome_options)
# Ensure that the full URL path is given
URL = 'https://proper_url'
# The following step will launch a browser.
driver.get(URL)
# Now you can easily read the source HTML
HTML = driver.page_source
You will have to figure out the details of installing and setting up Selenium and the webdriver. Here's a good place to start.

Selenium + Chrome + Python

I want to save a webpage. This looks simples. I used the code below. This opens the browser but the page is not saved.
Why?
When this works, where the file will be saved?
Thanks
Detais:
Chrome 68.0.3440.106 - 64 bits
ChromeDriver 2.41
Code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Selenium\chromedriver.exe")
browser.get('https://automatetheboringstuff.com')
ActionChains(browser).key_down(Keys.CONTROL).send_keys('s').key_up(Keys.CONTROL).perform()
If you are looking to save the html of a page, you can get that from the page source.
html = browser.page_source
and if you want to write this to a file, you can do this:
html_file = open('some_file_name.html', 'w')
html_file.write(html)

Open website and save as image

I am currently using the following code to open a website:
import webbrowser
webbrowser.open('http://test.com')
I am now trying to save the open webpage as a .gif, any advise on how to do this please?
How about selenium and its save_screenshot()?
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://test.com")
driver.save_screenshot("screenshot.png")
driver.close()

Download the whole html page content using selenium

I need to download the whole content of html pages images , css , js.
First Option:
Download the page by urllib or requests,
Extract the page info. by beutiful soup or lxml,
Download all links and
Edit the links in original page torelative.
Disadvantages
multiple steps.
The downloaded page will never be identical to remote page. may be due to js or ajax content
Second option
Some authors recommend automating the webbrowser to download the page; so the java scrip and ajax will executed before download.
scraping ajax sites and java script
I want to use this option.
First attempt
So I have copied this piece of selenium code to do 2 steps:
Open the URL in firefox browser
Download the page.
The code
import os
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False )
profile.set_preference('browser.download.dir', os.environ["HOME"])
profile.set_preference("browser.helperApps.alwaysAsk.force", False )
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/html,text/webviewhtml,text/x-server-parsed-html,text/plaintext,application/octet-stream');
browser = webdriver.Firefox(profile)
def open_new_tab(url):
ActionChains(browser).send_keys(Keys.CONTROL, "t").perform()
browser.get(url)
return browser.current_window_handle
# call the function
open_new_tab("https://www.google.com")
# Result: the browser is opened t the given url, no download occur
Result
unfortunately no download occurs, it just opens the browser at the url provided (first step).
Second attempt
I think in downloading the page by separate function; so I have added this function.
The function added
def save_current_page():
ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()
# call the function
open_new_tab("https://www.google.com")
save_current_page()
Result
# No more; the browser is opened at the given url, no download occurs.
Question
How to automate downloading webpages by selenium ??

Categories

Resources