When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()
The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.
Related
For a fun webscraping project, I want to collect NHL data from ttps://www.nhl.com/stats/teams.
There is a clickable Excel Export tag which I can find using selenium and bs4.
Unfortunately, this is where it ends:
Since there is no href attribute it seems that I cannot access the data.
I got what I wanted by using pynput to simulatie a mouseclick, but I wonder:
Could I do that differently? If feels so clumsy.
-> the tag with the Export Icon can be found here :
a class="styles__ExportIcon-sc-16o6kz0-0 dIDMgQ"
-> Here is my code
`import pynput
from pynput.mouse import Button, Controller
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path = 'somepath\chromedriver.exe')
URL = 'https://www.nhl.com/stats/teams'
driver.get(URL)
html = driver.page_source # DOM with JavaScript execution complete
soup = BeautifulSoup(html)
body = soup.find('body')
print(body.prettify())
mouse = Controller()
time.sleep(5) # Sleep for 5 seconds until page is loaded
mouse.position = (1204, 669) # thats where the icon is on my screen
mouse.click(Button.left, 1) # executes download`
There is no href attribute, download is triggert via JS. While working with selenium find your element and use .click() to download the file:
driver.find_element(By.CSS_SELECTOR,'h2>a').click()
Used css selectors here to get direct children <a> of the <h2> or select it directly by class starting with styles__ExportIcon:
driver.find_element(By.CSS_SELECTOR,'a[class^="styles__ExportIcon"]').click()
Example
You may have to deal with onetrust banner, so click it first and than download the sheet.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://www.nhl.com/stats/teams'
driver.get(url)
driver.find_element(By.CSS_SELECTOR,'#onetrust-reject-all-handler').click()
driver.find_element(By.CSS_SELECTOR,'h2>a').click()
A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.
I'm running into an issue when web-scraping a large web page, my scrape works fine for the first 30 href links however runs into a KeyError: 'href' at around 25% into the page contents.
The elements remain the same for the entire web page i.e there is no difference between the last scraped element and the next element that stops the script. Is this caused by the driver not loading the entire web page in time for the scrape to complete or only partially loading the web page ?
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
from random import randint
chromedriver_path = "C:\Program Files (x86)\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
# options.headless = True
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
url = 'https://hackerone.com/bug-bounty-programs'
driver.get(url)
sleep(randint(15,20))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source,'html.parser')
# driver.quit()
links = soup.find_all("a")
for link in links:
print(link['href'])
There is no need for selenium if wishing to retrieve the bounty links. That seems more desirable than grabbing all links off the page. It also removes the duplicates you get with scraping all links.
Simply use the queryString construct that returns bounties as json. You can update the urls to include the protocol and domain.
import requests
import pandas as pd
data = requests.get('https://hackerone.com/programs/search?query=bounties:yes&sort=name:ascending&limit=1000').json()
df = pd.DataFrame(data['results'])
df['url'] = 'https://hackerone.com' + df['url']
print(df.head())
I'm new to selenium and I wrote this code that gets user input and searches in ebay but I want to save the new link of the search so I can pass it on to BeautifulSoup.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
search_ = input()
browser = webdriver.Chrome(r'C:\Users\Leila\Downloads\chromedriver_win32')
browser.get("https://www.ebay.com.au/sch/i.html?_from=R40&_trksid=p2499334.m570.l1311.R1.TR12.TRC2.A0.H0.Xphones.TRS0&_nkw=phones&_sacat=0")
Search = browser.find_element_by_id('kw')
Search.send_keys(search_)
Search.send_keys(Keys.ENTER)
#how do you write a code that gets the link of the new page it loads
To extract a link from a webpage, you need to make use of the HREF attribute and use the get_attribute() method.
This example from here illustrates how it would work.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.w3.org/')
for a in driver.find_elements_by_xpath('.//a'):
print(a.get_attribute('href'))
In your case, do:
Search = browser.find_element_by_id('kw')
page_link = Search.get_attribute('href')
I was trying to fill a form using mechanize. But the problem is the webpage needs javascript. So whenever I try to make an access to the page, the page redirects to an error page saying javascript needed. Is there a way to enable javascript when using mechanize browser?
Here is the code
import mechanize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("https://192.168.10.3/connect/PortalMain")
for f in br.forms():
print f
Also when I tried to extract the webpage using BeautifulSoup that 'works fine on my browser' I got the same problem. It redirects to a new page.
(I tried disabling javascript on my browser and got the page which beautiful soup was showing me.)
Here is the code of BeautifulSoup if it helps
import ssl
import urllib2
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
page = urllib2.urlopen("https://192.168.10.3/connect/PortalMain")
soup = BeautifulSoup(page,'html.parser')
print soup
You could just go ahead and use Selenium instead:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
usernameStr = 'putYourUsernameHere'
passwordStr = 'putYourPasswordHere'
browser = webdriver.Chrome()
browser.get('https://192.168.10.3/connect/PortalMain')
# fill in username and hit the next button (replace selectors!)
username = browser.find_element_by_id('Username')
username.send_keys(usernameStr)
password = browser.find_element_by_id('Password')
password.send_keys(passwordStr)
loginButton = browser.find_element_by_id('login')
loginButton.click()
This will use the Chrome web driver to open your browser and login, you can switch it to use any other driver Selenium supports e.g. Firefox.
Source: https://www.hongkiat.com/blog/automate-create-login-bot-python-selenium/
Remember you might need to make adjustments if the site is using a self-signed certificate.