Is it possible to go from selenium to beautifulsoup? - python

I would like to scrape a web site. I had to use selenium to pass a login form and I was asking myself whether there was a way to use beautifulSoup to scrape the web site now that I've used selenium?

simple combination
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url = "url"
browser = webdriver.Firefox()
browser.get(url)
# login/scroll/etc
full_page = browser.page_source
page_soup = soup(full_page, "html.parser")
# parse/find

Related

Why is my beautiful soup scraper returning no text? The script doesnt throw an error

#scraping ESPN
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.espn.com/womens-college-basketball/scoreboard/_/date/20221107').text
soup = BeautifulSoup(html_text, 'lxml')
game = soup.find('ul', class_= "ScoreCell__Competitors").text
[enter image description here][1]print(game)
#the text "Cleveland State" should be returned. I am a web scraping novice, any help is appreciated.
Try using Selenium with chrome
Download Chrome and Chromedrive
Install selenium
pip install selenium
from selenium import webdriver
DRIVER_PATH = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')
Get the element using your class name using the driver
h1 = driver.find_element(By.CLASS_NAME, 'ScoreCell__Competitors')

KeyError 'href' - Python / Selenium / Beautiful Soup

I'm running into an issue when web-scraping a large web page, my scrape works fine for the first 30 href links however runs into a KeyError: 'href' at around 25% into the page contents.
The elements remain the same for the entire web page i.e there is no difference between the last scraped element and the next element that stops the script. Is this caused by the driver not loading the entire web page in time for the scrape to complete or only partially loading the web page ?
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
from random import randint
chromedriver_path = "C:\Program Files (x86)\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
# options.headless = True
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
url = 'https://hackerone.com/bug-bounty-programs'
driver.get(url)
sleep(randint(15,20))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source,'html.parser')
# driver.quit()
links = soup.find_all("a")
for link in links:
print(link['href'])
There is no need for selenium if wishing to retrieve the bounty links. That seems more desirable than grabbing all links off the page. It also removes the duplicates you get with scraping all links.
Simply use the queryString construct that returns bounties as json. You can update the urls to include the protocol and domain.
import requests
import pandas as pd
data = requests.get('https://hackerone.com/programs/search?query=bounties:yes&sort=name:ascending&limit=1000').json()
df = pd.DataFrame(data['results'])
df['url'] = 'https://hackerone.com' + df['url']
print(df.head())

Python web scrape from asx - could not get the announcement table

I am trying to scrape announcement table from asx page, however, when I use BeautifulSoup to parse html, this table is not there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://www2.asx.com.au/markets/trade-our-cash-market/announcements.cba'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')
the above code returns nothing in table, but this is a announcement table in the webpage, how to scrape the table?
The data is dynamically loaded. Use selenium or another program that will allow the content to load, then pass to bs4. You'll need to load selenium and download chromedriver.exe
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://www2.asx.com.au/markets/trade-our-cash-market/announcements.cba'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all('table') # you should get a list of size one; if there is only one table, you might just want to use find instead

Can't View Complete Page Source in Selenium

When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()
The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.

Crawling iframe using beautifulsoup and selenium in python

I would crawl the web site where it has the iframe.
see http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20150515001896. It has 2 options at chrome browser.("view page source" and "view frame source" either.)
But accessing the url using Beautiful Soup, urllib2 or selenium gave me only the page source without iframe
How could I access to iframe source which can be seen at the chrome?
The below code is for accessing page source of that website.
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup
url = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20150515001896"
f = urllib2.urlopen(url)
#or
browser = webdriver.Chrome()
browser.get(url)
html_source = browser.page_source
#show only the page sources
It simply solved by accessing below url.
http://dart.fss.or.kr/report/viewer.do?rcpNo=20150515001896&dcmNo=4671059&eleId=17&offset=1015699&length=132786&dtd=dart3.xsd

Categories

Resources