I want to get data from a website and store the html code using selenium. I wrote the following code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get(r'http://www.example.com')
driver.page_source #get the html code
What should I do?
Thank you.
Try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(r'http://www.example.com')
driver.page_source #get the html code
elem = driver.find_element_by_xpath("//*")
source = elem.get_attribute("outerHTML")
driver.quit()
Related
#scraping ESPN
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.espn.com/womens-college-basketball/scoreboard/_/date/20221107').text
soup = BeautifulSoup(html_text, 'lxml')
game = soup.find('ul', class_= "ScoreCell__Competitors").text
[enter image description here][1]print(game)
#the text "Cleveland State" should be returned. I am a web scraping novice, any help is appreciated.
Try using Selenium with chrome
Download Chrome and Chromedrive
Install selenium
pip install selenium
from selenium import webdriver
DRIVER_PATH = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')
Get the element using your class name using the driver
h1 = driver.find_element(By.CLASS_NAME, 'ScoreCell__Competitors')
I am working on an office project to get data to check active status on different websites but whenever I want to get data sometimes it shows none and sometimes it shows this Attribute error, I follow youtube videos steps but still get this error. help, please.
//Python Code
from bs4 import BeautifulSoup
import requests
html_text = requests.get(
"https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw").text
soup = BeautifulSoup(html_text, 'lxml')
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)
The url is dynamic meaning data is populated by javascript. So you need automation tool something like selenium.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
#driver.close()
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)
Output:
Active
You have the most common problem - modern pages use JavaScript to add elements but requests/BeautifulSoup can't run JavaScript.
So soup.find('div',...) gives None instead expected element and later it makes problem with None.find('p')
You may use Selenium to control real web browser which can run JavaScript.
from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
url = "https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw"
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
#status = driver.find_element(By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')
wait = WebDriverWait(driver, 10)
status = wait.until(EC.visibility_of_element_located((By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')))
print(status.text)
Eventually you should check if page gives some API to get data.
You may also use DevTools (tab: Network) to check if JavaScript reads data from some URL and you may try to use this URL with requests. It could work faster than with Selenium but server may detect script/bot and block it.
JavaScript usually get data as JSON so it may not need to scrape HTML with BeautifulSoup
I am using the below code to print the soup variable that is nothing but the source code of the page.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re, time
yes_url = "https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card"
driver = webdriver.Chrome(executable_path="C:\\Users\\Hari\\Downloads\\chromedriver.exe")
driver.get(yes_url)
time.sleep(3)
# r = requests.get(yes_url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
driver.close()
Link I am scraping the page source from is : https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card
After running the above code the code keeps running till hours and hours but I don't get the output.
Please help me in scraping the page source, so that I get some output after I run the code.
Issue: You are dealing with a modern website which check the browser itself if it's controlled or not using robust.
How That can be done?
Simply, open your browser console and then type the following:
navigator.webdriver
if it's false so your browser isn't contorlled by any robust program such as selenium.
if it's true so it's controlled.
In your case, you've to disable it in order to trick the website checking mechanism.
Below you can achieve your goal:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
options.set_preference("dom.webdriver.enabled", False)
driver = webdriver.Firefox(options=options)
driver.get('https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card')
try:
element = WebDriverWait(driver, 10).until(
EC.title_contains('YES'))
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify())
finally:
driver.quit()
I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal html websites. But when I tried to get some data out of websites where the data loads after some delay, I found that I get an empty value. An example would be
from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "https://www.example.com/;1"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('span', 'buy')
print(a)
I am trying to grab the from here:
(value)
I have already referred a similar topic and tried executing my code on similar lines as the solution provided here. But somehow it doesnt seem to work. I am a novice here so need help getting this work.
How to scrape html table only after data loads using Python Requests?
The table (content) is probably generated by JavaScript and thus can't be "seen". I am using python3.6 / PhantomJS / Selenium as proposed by a lot of answers here.
You have to run headless browser to run delayed scraping. Please use selenium.
Here is sample code. Code is using chrome browser as driver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome(<chromedriver path here>)
browser.set_window_size(1120, 550)
browser.get(link)
element = WebDriverWait(browser, 3).until(
EC.presence_of_element_located((By.ID, "blabla"))
)
data = element.get_attribute('data-blabla')
print(data)
browser.quit()
You can access desired values by requesting it directly from API and analyze JSON response.
import requests
import json
res = request.get('https://api.example.com/api/')
d = json.loads(res.text)
print(d['market'])
I am trying to get the source code for a couple of links using selenium and BeautifulSoup. I open the first tab to get the source code which works fine, but the second tab gets stuck. I think it's something with BeautifulSoup. Does anyone know why or of an alternative for BeautifulSoup? Here is the code:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
links = []
driver = webdriver.Firefox()
driver.get('about:blank')
for link in links:
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
browser.get(link)
source = str(BeautifulSoup(browser.page_source))
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
driver.close()