I need to download the whole content of html pages images , css , js.
First Option:
Download the page by urllib or requests,
Extract the page info. by beutiful soup or lxml,
Download all links and
Edit the links in original page torelative.
Disadvantages
multiple steps.
The downloaded page will never be identical to remote page. may be due to js or ajax content
Second option
Some authors recommend automating the webbrowser to download the page; so the java scrip and ajax will executed before download.
scraping ajax sites and java script
I want to use this option.
First attempt
So I have copied this piece of selenium code to do 2 steps:
Open the URL in firefox browser
Download the page.
The code
import os
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False )
profile.set_preference('browser.download.dir', os.environ["HOME"])
profile.set_preference("browser.helperApps.alwaysAsk.force", False )
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/html,text/webviewhtml,text/x-server-parsed-html,text/plaintext,application/octet-stream');
browser = webdriver.Firefox(profile)
def open_new_tab(url):
ActionChains(browser).send_keys(Keys.CONTROL, "t").perform()
browser.get(url)
return browser.current_window_handle
# call the function
open_new_tab("https://www.google.com")
# Result: the browser is opened t the given url, no download occur
Result
unfortunately no download occurs, it just opens the browser at the url provided (first step).
Second attempt
I think in downloading the page by separate function; so I have added this function.
The function added
def save_current_page():
ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()
# call the function
open_new_tab("https://www.google.com")
save_current_page()
Result
# No more; the browser is opened at the given url, no download occurs.
Question
How to automate downloading webpages by selenium ??
Related
A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.
I am trying to scrape a site that attempts to block scraping. Viewing the source code through Chrome, or requests, or requests_html results it not showing the correct source code.
Here is an example:
from requests_html import HTMLSession
session = HTMLSession()
content = session.get('website')
content.html.render()
print(content.html.html)
It gives this page:
It looks like JavaScript is disabled or not supported by your browser.
Even though Javascript is enabled. Same thing happens on an actual browser.
However, on my actual browser, when I go to inspect element, I can see the source code just fine. Is there a way to extract the HTML source from inspect element?
Thanks!
The issue you are facing is that it's a page which is rendered by the Javascript on the front-end. In this case, you would require a javasacript enabled browser-engine and then you can easily read the HTML source.
Here's a working code of how I would do it (using selenium):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome(chrome_options=chrome_options)
# Ensure that the full URL path is given
URL = 'https://proper_url'
# The following step will launch a browser.
driver.get(URL)
# Now you can easily read the source HTML
HTML = driver.page_source
You will have to figure out the details of installing and setting up Selenium and the webdriver. Here's a good place to start.
I'm implementing a TikTok crawler using selenium and scrapy
start_urls = ['https://www.tiktok.com/trending']
....
def parse(self, response):
options = webdriver.ChromeOptions()
from fake_useragent import UserAgent
ua = UserAgent()
user_agent = ua.random
options.add_argument(f'user-agent={user_agent}')
options.add_argument('window-size=800x841')
driver = webdriver.Chrome(chrome_options=options)
driver.get(response.url)
The crawler open Chrome but it does not load videos.
Image loading
The same problem happens also using Firefox
No loading page using Firefox
The same problem using a simple script using Selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
driver = webdriver.Chrome()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
Did u try to navigate further within the selenium browser window? If an error 404 appears on following sites, I have a solution that worked for me:
I simply changed my User-Agent to "Naverbot" which is "allowed" by the robots.txt file from Tik Tok
(Robots.txt)
After changing that all sites and videos loaded properly.
Other user-agents that are listed under the "allow" segment should work too, if you want to add a rotation.
You can use Windows IE. Instead of chrome or firefox
Videos will load in IE but IE's Layout of showing feed is somehow different from chrome and firefox.
Reasons, why your page, is not loading.
Few advance web apps check your browser history, profile data and cached to check the authentication of the user.
One other thing you can do is run your default profile within your selenium It would be helpfull.
I have to write a Python code that will get URL, open a Chrome/Firefox browser using Selenium and will download it as a "Complete Webpage", mean with the CSS assets for example.
I know the basis of using Selenium, like:
from selenium import webdriver
ff = webdriver.firefox()
ff.get(URL)
ff.close
How can I perform the downloading action (Like clicking automatically in the browser CTRL+S)?
You can try following code to get HTML page as file:
from selenium import webdriver
ff = webdriver.Firefox()
ff.get(URL)
with open('/path/to/file.html', 'w') as f:
f.write(ff.page_source)
ff.close
I am having a strange issue with PhantomJS or may be I am newbie. I am trying to login on NewEgg.com via Selenium by using PhantomJS. I am using Python for it. Issue is, when I use Firefox as a driver it works well but as soon as I set PhantomJS as a driver it does not go to next page hence give message:
Exception Message: u'{"errorMessage":"Unable to find element with id \'UserName\'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"89","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:55372","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\\"using\\": \\"id\\", \\"sessionId\\": \\"aaff4c40-6aaa-11e4-9cb1-7b8841e74090\\", \\"value\\": \\"UserName\\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/aaff4c40-6aaa-11e4-9cb1-7b8841e74090/element"}}' ; Screenshot: available via screen
The reason I found after taking screenshot that phantom could not navigate the page and script got finished. How do I sort this out? Code Snippet I tried given below:
import requests
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
import datetime
my_username = "user#mail.com"
my_password = "password"
driver = webdriver.PhantomJS('/Setups/phantomjs-1.9.7-macosx/bin/phantomjs')
firefox_profile = webdriver.FirefoxProfile()
#firefox_profile.set_preference('permissions.default.stylesheet', 2)
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
#driver = webdriver.Firefox(firefox_profile)
driver.set_window_size(1120, 550)
driver.get('http://newegg.com')
driver.find_element_by_link_text('Log in or Register').click()
driver.save_screenshot('screen.png')
I even put sleep but it is not making any difference.
I experienced this with PhantomJS when the content type of the second page is not correct. A normal browser would just interpret the content dynamically, but Phantom just dies, silently.