Web scraping not getting complete source code data via Selenium/BS4

Web scraping not getting complete source code data via Selenium/BS4 - python

How do I scrape the data in the input tag's value attributes from the source I inspect as shown in image?
I have tried using BeautifulSoup and Selenium, and neither of them works for me.
Partial code is below:
html=driver.page_source
output=driver.find_element_by_css_selector('#bookingForm > div:nth-child(1) > div.bookingType > div:nth-child(15) > div.col-md-9 > input').get_attribute("value")
print(output)
This returns a NoSuchElementException error.
In fact when I try to print(html), a lot of source code data appear to be missing. I suspect it could be JS related issues, but Selenium - which works most of the time rendering JS - is not working for me on this site. Any idea why?
I tried these as well:
html=driver.page_source
soup=bs4.BeautifulSoup(html,'lxml')
test = soup.find("input",{"class":"inputDisable"})
print(test)
print(soup)
print(test) returns None, and print(soup) returns the source with most input tags entirely missing.

Check if this element is present on this site by inspecting the page.
If its there , many times selenium is too fast and the page sometimes doesn't manage to load completely.try the WAIT funtion of selenium.Many times thats the case.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"

Try to use find or find_all functions. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
from requests import get
from bs4 import BeautifulSoup
url = 'your url'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
bs = BeautifulSoup(response.text,"lxml")
test = bs.find("input",{"class":"inputDisable"})
print(test)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib.request
import time
from bs4 import BeautifulSoup
from datetime import date
URL="https://yourUrl.com"
# Chrome session
driver = webdriver.Chrome("PathOfTheBrowserDriver")
driver.get(URL)
driver.implicitly_wait(100)
time.sleep(5)
soup=bs4.BeautifulSoup(driver.page_source,"html.parser")
Try, BEFORE making the soup, to create a break with your code, in order to give the requests the to do their job (some late requests may contain what you're looking for)

Related

AttributeError: 'NoneType' object has no attribute 'find' Web Scraping Python

I am working on an office project to get data to check active status on different websites but whenever I want to get data sometimes it shows none and sometimes it shows this Attribute error, I follow youtube videos steps but still get this error. help, please.
//Python Code
from bs4 import BeautifulSoup
import requests
html_text = requests.get(
"https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw").text
soup = BeautifulSoup(html_text, 'lxml')
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)

The url is dynamic meaning data is populated by javascript. So you need automation tool something like selenium.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
#driver.close()
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)
Output:
Active

You have the most common problem - modern pages use JavaScript to add elements but requests/BeautifulSoup can't run JavaScript.
So soup.find('div',...) gives None instead expected element and later it makes problem with None.find('p')
You may use Selenium to control real web browser which can run JavaScript.
from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
url = "https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw"
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
#status = driver.find_element(By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')
wait = WebDriverWait(driver, 10)
status = wait.until(EC.visibility_of_element_located((By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')))
print(status.text)
Eventually you should check if page gives some API to get data.
You may also use DevTools (tab: Network) to check if JavaScript reads data from some URL and you may try to use this URL with requests. It could work faster than with Selenium but server may detect script/bot and block it.
JavaScript usually get data as JSON so it may not need to scrape HTML with BeautifulSoup

web scraping table with selenium gets only html elements but no content

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you

The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)

You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

Not able to scrape details using bs4, python, selenium

I am using the below code to print the soup variable that is nothing but the source code of the page.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re, time
yes_url = "https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card"
driver = webdriver.Chrome(executable_path="C:\\Users\\Hari\\Downloads\\chromedriver.exe")
driver.get(yes_url)
time.sleep(3)
# r = requests.get(yes_url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
driver.close()
Link I am scraping the page source from is : https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card
After running the above code the code keeps running till hours and hours but I don't get the output.
Please help me in scraping the page source, so that I get some output after I run the code.

Issue: You are dealing with a modern website which check the browser itself if it's controlled or not using robust.
How That can be done?
Simply, open your browser console and then type the following:
navigator.webdriver
if it's false so your browser isn't contorlled by any robust program such as selenium.
if it's true so it's controlled.
In your case, you've to disable it in order to trick the website checking mechanism.
Below you can achieve your goal:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
options.set_preference("dom.webdriver.enabled", False)
driver = webdriver.Firefox(options=options)
driver.get('https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card')
try:
element = WebDriverWait(driver, 10).until(
EC.title_contains('YES'))
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify())
finally:
driver.quit()

Selenium and BeautifulSoup can't fetch all HTML content

I'm scraping the bottom table labeled "Capacity : Operationally Available - Evening" on https://lngconnection.cheniere.com/#/ccpl
I am able to get all the HTML and everything shows up when I prettify() print the HTML but the parsers can't find it when I give a command to find the specific information I need.
Here's my script:
cc_driver = webdriver.Chrome('/Users/.../Desktop/chromedriver')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
now when I do the
print(cc_table.prettify())
The output is everything except the actual table data. Is there some error in my code or in their HTML that is hiding the actual table values? I'm able to see it when I print everything Selenium captures on the page. The HTML also doesn't have specific ID tags for any of the cell values.

You are looking into the HTML which is not yet complete. All the elements have not yet returned from the javascript. So you can do a webdriver wait.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
cc_driver = webdriver.Chrome(r"path for driver")
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
WebDriverWait(cc_driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,
'#capacityGrid > table > tbody')))
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
This will wait for the element to be present.

This should help you getting table html
from selenium import webdriver
from bs4 import BeautifulSoup as bs
cc_driver = webdriver.Chrome('../chromedriver_win32/chromedriver.exe')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = bs(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', attrs={'class':'k-selectable'})
#print(cc_content.prettify())
print(cc_table.prettify())

Trouble Parsing Text using BeautifulSoup and Python

I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.
I am using BeautifulSoup and Python and have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder
When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.
Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?

You need to let PhantomJS explicitly wait for the element to become present before reading the page_source. Worked for me:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping not getting complete source code data via Selenium/BS4 - python

Related

AttributeError: 'NoneType' object has no attribute 'find' Web Scraping Python

web scraping table with selenium gets only html elements but no content

Not able to scrape details using bs4, python, selenium

Selenium and BeautifulSoup can't fetch all HTML content

Trouble Parsing Text using BeautifulSoup and Python

Categories

Resources