A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.
Related
I am trying to get some information from a website. The Web Inspector shows the html source, with what JavaScript rendered into it. So I wanted to use chromedriver to render it for the purpose of extracting certain information, which cannot be accessed by simply requesting the website.
Now what seems confusing, is that even the driver is not returning anything.
My code looks like this:
driver = webdriver.Chrome('path/Chromedriver')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
And the website is:
https://www.amundietf.co.uk/professional/product/view/LU1681038243
Is there anything else that gets rendered into the html, when the Web Inspector is opened, which Chromedriver is not able to handle?
Thanks for your answers in advance!
At least you need to accept privacy settings, than click validateDisclaimer to site:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
url = "https://www.amundietf.co.uk/professional/product/view/LU1681038243"
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_id("footer_tc_privacy_button_3").click()
driver.find_element_by_id("validateDisclaimer").click()
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fpFrame.fpBannerMore #blockleft>#part_principale_1")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
print(results)
After it you need to wait for your page to load and to define elements you are looking for correctly.
Your question really contains many questions, that should be solved one by one.
I just pointed out the first of the problems.
Update
I solved the issue.
You will need to parse result by yourself.
So, you had problems:
Did not click two buttons.
Did not wait for a table you need to load.
Did not have any waits. In Selenium you must use them.
driver = webdriver.Ie("C:\\IEDriverServer.exe")
driver.get(testurl)
driver.refresh()
time.sleep(5)
data = driver.find_element_by_id("__content0-value-scr")
So I'm trying to find an element by it's id using Selenium (Python) and Internet Explorer, because I'm limited to Internet Explorer due to company regulations.
My problem is as follows:
on driver.get(testurl), selenium loads the page but IE first starts up with the IEDriver landing page.
Only after that, it loads the requested url.
The problem here is that Selenium recognizes the IE Driver landing page as the url to be loaded and therefore ignores the page I want to search on, which gets loaded after that.
Has anyone got an idea on how to work around this?
When you use Selenium, IEDriverServer and Internet Explorer, while IEDriverServer initiates a new IE Browser session, IE Browser first starts up with the IEDriver landing page and then loads the requested url.
Incase Selenium recognizes the IEDriverServer's landing page as the url to be loaded, in that case the solution would be to induce WebDriverWait for the Page Title to be equivalent to the actual page title of the AUT (Application Under Test).
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
testurl = "https://www.facebook.com/" # replace with the url of the AUT
driver = webdriver.Ie(executable_path=r'C:\path\to\IEDriverServer.exe')
driver.get(testurl)
WebDriverWait(driver, 10).until(EC.title_contains("Facebook")) # # replace with the title of the AUT
data = driver.find_element_by_id("__content0-value-scr")
I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()
You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)
It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source
This question already has answers here:
How to scrape dynamic webpages by Python
(2 answers)
Closed 6 years ago.
I am using Ghost and BeautifulSoup to parse a HTML page. The problem that I have, is that the content of this page is dynamic (created with angularJS). At the beginning the html only shows something like "please wait! page loading". After a few seconds the content of the html appears. Using Ghost and BeatifulSoup I just get the HTML code of the loading page whith only 2 small divs. The URL stays the same. Is there a possibility to wait until the "real" content is loaded?
Load the page in a real browser (headless like PhantomJS is also an option) automated by selenium, wait for the desired contents to appear, get the .page_source and pass it to BeautifulSoup:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("your url here")
# waiting for the page to load - TODO: change
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "content")))
data = driver.page_source
driver.close()
soup = BeautifulSoup(data, "html.parser")
Use phantomjs to open the page.
Save it as a local file using phantomjs File System Module Api.
Later use this local file handle to create BeautifulSoup object and then parse the page.
See http://www.kochi-coders.com/2014/05/06/scraping-a-javascript-enabled-web-page-using-beautiful-soup-and-phantomjs/
I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.