KeyError 'href' - Python / Selenium / Beautiful Soup - python

I'm running into an issue when web-scraping a large web page, my scrape works fine for the first 30 href links however runs into a KeyError: 'href' at around 25% into the page contents.
The elements remain the same for the entire web page i.e there is no difference between the last scraped element and the next element that stops the script. Is this caused by the driver not loading the entire web page in time for the scrape to complete or only partially loading the web page ?
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
from random import randint
chromedriver_path = "C:\Program Files (x86)\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
# options.headless = True
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
url = 'https://hackerone.com/bug-bounty-programs'
driver.get(url)
sleep(randint(15,20))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source,'html.parser')
# driver.quit()
links = soup.find_all("a")
for link in links:
print(link['href'])

There is no need for selenium if wishing to retrieve the bounty links. That seems more desirable than grabbing all links off the page. It also removes the duplicates you get with scraping all links.
Simply use the queryString construct that returns bounties as json. You can update the urls to include the protocol and domain.
import requests
import pandas as pd
data = requests.get('https://hackerone.com/programs/search?query=bounties:yes&sort=name:ascending&limit=1000').json()
df = pd.DataFrame(data['results'])
df['url'] = 'https://hackerone.com' + df['url']
print(df.head())

Related

web scraping table with selenium gets only html elements but no content

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you
The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)
BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)
You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

Python web scrape from asx - could not get the announcement table

I am trying to scrape announcement table from asx page, however, when I use BeautifulSoup to parse html, this table is not there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://www2.asx.com.au/markets/trade-our-cash-market/announcements.cba'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')
the above code returns nothing in table, but this is a announcement table in the webpage, how to scrape the table?
The data is dynamically loaded. Use selenium or another program that will allow the content to load, then pass to bs4. You'll need to load selenium and download chromedriver.exe
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://www2.asx.com.au/markets/trade-our-cash-market/announcements.cba'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all('table') # you should get a list of size one; if there is only one table, you might just want to use find instead

beautifulsoup scrape realtime values

i am trying to scrape the currency rates for a personal project, i used css selector to get the class where the values are. There's a javascript providing those values on the website and it seems i am noot too connversant with the developers console, i checked it out and i could not see anything running in real time in the networks section. This is the code i wrote, so far, it brings out a long list of dashes. surprisingly, the dashes match the source code for those parts were the rates are supposed to show.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.ig.com/en/forex/markets-forex")
soup = BeautifulSoup(r.content, "html.parser")
results = soup.findAll("span",attrs={"data-field": "CPT"})
for span in results:
print(span.text)
Span-elements filling via JS, dynamic values. On start each span-element contains '-'.
You need js driver for wait to fill elements and then get values from spans.
With selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
print(elm, elm.text)
chromedriver download from https://sites.google.com/a/chromium.org/chromedriver/home
Also, dryscrape + bs4, but dryscrape seems outdated. Example here
Modified:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
time.sleep(2) # Maybe more or less, how much faster page load
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text:
print(elm, elm.text)
or
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
data = []
while not data:
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text and elm.text != '-': # Maybe check on contains digit
data.append(elm.text)
time.sleep(1)
print(data)

Trying to Get Selenium to Download Data Based on JavaScript...I think

I am trying to download data from the following URL.
https://www.nissanusa.com/dealer-locator.html
I came up with this, but it doen't actually grab any of the data.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
I've done this a couple times before, and it has always worked in the past. I'm guessing the data is dynamically generated by JavaScript, based on the filters that a user selects, but I don't know for sure. I've read that Selenium can be used to automate a web browser, but I have never used it, and I'm not really sure where to start. Ultimately, I am trying to get the data in this format, in the image below. Either printed in the Console Window, or downloaded to a CSV, would be fine.
Finally, how the heck does the site get the data? Whether I enter New York City or San Francisco, the map and the data set changes relative to the filter that is applied, but the URL does not change at all. Thanks in advance.
Use selenium to open/navigate to the page, then pass the page source to BeautifulSoup.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
url = 'https://www.nissanusa.com/dealer-locator.html'
browser.get(url)
time.sleep(10) // wait page open complete
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])

Can't View Complete Page Source in Selenium

When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()
The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.

Categories

Resources