web scraping table with selenium gets only html elements but no content - python

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you

The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)

You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

Related

Webdriver not returning some data

I am trying to get some information from a website. The Web Inspector shows the html source, with what JavaScript rendered into it. So I wanted to use chromedriver to render it for the purpose of extracting certain information, which cannot be accessed by simply requesting the website.
Now what seems confusing, is that even the driver is not returning anything.
My code looks like this:
driver = webdriver.Chrome('path/Chromedriver')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
And the website is:
https://www.amundietf.co.uk/professional/product/view/LU1681038243
Is there anything else that gets rendered into the html, when the Web Inspector is opened, which Chromedriver is not able to handle?
Thanks for your answers in advance!
At least you need to accept privacy settings, than click validateDisclaimer to site:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
url = "https://www.amundietf.co.uk/professional/product/view/LU1681038243"
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_id("footer_tc_privacy_button_3").click()
driver.find_element_by_id("validateDisclaimer").click()
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fpFrame.fpBannerMore #blockleft>#part_principale_1")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
print(results)
After it you need to wait for your page to load and to define elements you are looking for correctly.
Your question really contains many questions, that should be solved one by one.
I just pointed out the first of the problems.
Update
I solved the issue.
You will need to parse result by yourself.
So, you had problems:
Did not click two buttons.
Did not wait for a table you need to load.
Did not have any waits. In Selenium you must use them.

beautifulsoup scrape realtime values

i am trying to scrape the currency rates for a personal project, i used css selector to get the class where the values are. There's a javascript providing those values on the website and it seems i am noot too connversant with the developers console, i checked it out and i could not see anything running in real time in the networks section. This is the code i wrote, so far, it brings out a long list of dashes. surprisingly, the dashes match the source code for those parts were the rates are supposed to show.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.ig.com/en/forex/markets-forex")
soup = BeautifulSoup(r.content, "html.parser")
results = soup.findAll("span",attrs={"data-field": "CPT"})
for span in results:
print(span.text)
Span-elements filling via JS, dynamic values. On start each span-element contains '-'.
You need js driver for wait to fill elements and then get values from spans.
With selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
print(elm, elm.text)
chromedriver download from https://sites.google.com/a/chromium.org/chromedriver/home
Also, dryscrape + bs4, but dryscrape seems outdated. Example here
Modified:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
time.sleep(2) # Maybe more or less, how much faster page load
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text:
print(elm, elm.text)
or
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
data = []
while not data:
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text and elm.text != '-': # Maybe check on contains digit
data.append(elm.text)
time.sleep(1)
print(data)

Selenium and BeautifulSoup can't fetch all HTML content

I'm scraping the bottom table labeled "Capacity : Operationally Available - Evening" on https://lngconnection.cheniere.com/#/ccpl
I am able to get all the HTML and everything shows up when I prettify() print the HTML but the parsers can't find it when I give a command to find the specific information I need.
Here's my script:
cc_driver = webdriver.Chrome('/Users/.../Desktop/chromedriver')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
now when I do the
print(cc_table.prettify())
The output is everything except the actual table data. Is there some error in my code or in their HTML that is hiding the actual table values? I'm able to see it when I print everything Selenium captures on the page. The HTML also doesn't have specific ID tags for any of the cell values.
You are looking into the HTML which is not yet complete. All the elements have not yet returned from the javascript. So you can do a webdriver wait.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
cc_driver = webdriver.Chrome(r"path for driver")
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
WebDriverWait(cc_driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,
'#capacityGrid > table > tbody')))
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
This will wait for the element to be present.
This should help you getting table html
from selenium import webdriver
from bs4 import BeautifulSoup as bs
cc_driver = webdriver.Chrome('../chromedriver_win32/chromedriver.exe')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = bs(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', attrs={'class':'k-selectable'})
#print(cc_content.prettify())
print(cc_table.prettify())

Web scraping not getting complete source code data via Selenium/BS4

How do I scrape the data in the input tag's value attributes from the source I inspect as shown in image?
I have tried using BeautifulSoup and Selenium, and neither of them works for me.
Partial code is below:
html=driver.page_source
output=driver.find_element_by_css_selector('#bookingForm > div:nth-child(1) > div.bookingType > div:nth-child(15) > div.col-md-9 > input').get_attribute("value")
print(output)
This returns a NoSuchElementException error.
In fact when I try to print(html), a lot of source code data appear to be missing. I suspect it could be JS related issues, but Selenium - which works most of the time rendering JS - is not working for me on this site. Any idea why?
I tried these as well:
html=driver.page_source
soup=bs4.BeautifulSoup(html,'lxml')
test = soup.find("input",{"class":"inputDisable"})
print(test)
print(soup)
print(test) returns None, and print(soup) returns the source with most input tags entirely missing.
Check if this element is present on this site by inspecting the page.
If its there , many times selenium is too fast and the page sometimes doesn't manage to load completely.try the WAIT funtion of selenium.Many times thats the case.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"
Try to use find or find_all functions. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
from requests import get
from bs4 import BeautifulSoup
url = 'your url'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
bs = BeautifulSoup(response.text,"lxml")
test = bs.find("input",{"class":"inputDisable"})
print(test)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib.request
import time
from bs4 import BeautifulSoup
from datetime import date
URL="https://yourUrl.com"
# Chrome session
driver = webdriver.Chrome("PathOfTheBrowserDriver")
driver.get(URL)
driver.implicitly_wait(100)
time.sleep(5)
soup=bs4.BeautifulSoup(driver.page_source,"html.parser")
Try, BEFORE making the soup, to create a break with your code, in order to give the requests the to do their job (some late requests may contain what you're looking for)

Trying to Get Selenium to Download Data Based on JavaScript...I think

I am trying to download data from the following URL.
https://www.nissanusa.com/dealer-locator.html
I came up with this, but it doen't actually grab any of the data.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
I've done this a couple times before, and it has always worked in the past. I'm guessing the data is dynamically generated by JavaScript, based on the filters that a user selects, but I don't know for sure. I've read that Selenium can be used to automate a web browser, but I have never used it, and I'm not really sure where to start. Ultimately, I am trying to get the data in this format, in the image below. Either printed in the Console Window, or downloaded to a CSV, would be fine.
Finally, how the heck does the site get the data? Whether I enter New York City or San Francisco, the map and the data set changes relative to the filter that is applied, but the URL does not change at all. Thanks in advance.
Use selenium to open/navigate to the page, then pass the page source to BeautifulSoup.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
url = 'https://www.nissanusa.com/dealer-locator.html'
browser.get(url)
time.sleep(10) // wait page open complete
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])

Categories

Resources