beautifulsoup scrape realtime values - python

i am trying to scrape the currency rates for a personal project, i used css selector to get the class where the values are. There's a javascript providing those values on the website and it seems i am noot too connversant with the developers console, i checked it out and i could not see anything running in real time in the networks section. This is the code i wrote, so far, it brings out a long list of dashes. surprisingly, the dashes match the source code for those parts were the rates are supposed to show.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.ig.com/en/forex/markets-forex")
soup = BeautifulSoup(r.content, "html.parser")
results = soup.findAll("span",attrs={"data-field": "CPT"})
for span in results:
print(span.text)

Span-elements filling via JS, dynamic values. On start each span-element contains '-'.
You need js driver for wait to fill elements and then get values from spans.
With selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
print(elm, elm.text)
chromedriver download from https://sites.google.com/a/chromium.org/chromedriver/home
Also, dryscrape + bs4, but dryscrape seems outdated. Example here
Modified:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
time.sleep(2) # Maybe more or less, how much faster page load
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text:
print(elm, elm.text)
or
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
data = []
while not data:
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text and elm.text != '-': # Maybe check on contains digit
data.append(elm.text)
time.sleep(1)
print(data)

Related

KeyError 'href' - Python / Selenium / Beautiful Soup

I'm running into an issue when web-scraping a large web page, my scrape works fine for the first 30 href links however runs into a KeyError: 'href' at around 25% into the page contents.
The elements remain the same for the entire web page i.e there is no difference between the last scraped element and the next element that stops the script. Is this caused by the driver not loading the entire web page in time for the scrape to complete or only partially loading the web page ?
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
from random import randint
chromedriver_path = "C:\Program Files (x86)\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
# options.headless = True
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
url = 'https://hackerone.com/bug-bounty-programs'
driver.get(url)
sleep(randint(15,20))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source,'html.parser')
# driver.quit()
links = soup.find_all("a")
for link in links:
print(link['href'])
There is no need for selenium if wishing to retrieve the bounty links. That seems more desirable than grabbing all links off the page. It also removes the duplicates you get with scraping all links.
Simply use the queryString construct that returns bounties as json. You can update the urls to include the protocol and domain.
import requests
import pandas as pd
data = requests.get('https://hackerone.com/programs/search?query=bounties:yes&sort=name:ascending&limit=1000').json()
df = pd.DataFrame(data['results'])
df['url'] = 'https://hackerone.com' + df['url']
print(df.head())

web scraping table with selenium gets only html elements but no content

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you
The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)
BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)
You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

Web scraping not getting complete source code data via Selenium/BS4

How do I scrape the data in the input tag's value attributes from the source I inspect as shown in image?
I have tried using BeautifulSoup and Selenium, and neither of them works for me.
Partial code is below:
html=driver.page_source
output=driver.find_element_by_css_selector('#bookingForm > div:nth-child(1) > div.bookingType > div:nth-child(15) > div.col-md-9 > input').get_attribute("value")
print(output)
This returns a NoSuchElementException error.
In fact when I try to print(html), a lot of source code data appear to be missing. I suspect it could be JS related issues, but Selenium - which works most of the time rendering JS - is not working for me on this site. Any idea why?
I tried these as well:
html=driver.page_source
soup=bs4.BeautifulSoup(html,'lxml')
test = soup.find("input",{"class":"inputDisable"})
print(test)
print(soup)
print(test) returns None, and print(soup) returns the source with most input tags entirely missing.
Check if this element is present on this site by inspecting the page.
If its there , many times selenium is too fast and the page sometimes doesn't manage to load completely.try the WAIT funtion of selenium.Many times thats the case.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"
Try to use find or find_all functions. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
from requests import get
from bs4 import BeautifulSoup
url = 'your url'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
bs = BeautifulSoup(response.text,"lxml")
test = bs.find("input",{"class":"inputDisable"})
print(test)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib.request
import time
from bs4 import BeautifulSoup
from datetime import date
URL="https://yourUrl.com"
# Chrome session
driver = webdriver.Chrome("PathOfTheBrowserDriver")
driver.get(URL)
driver.implicitly_wait(100)
time.sleep(5)
soup=bs4.BeautifulSoup(driver.page_source,"html.parser")
Try, BEFORE making the soup, to create a break with your code, in order to give the requests the to do their job (some late requests may contain what you're looking for)

Trying to Get Selenium to Download Data Based on JavaScript...I think

I am trying to download data from the following URL.
https://www.nissanusa.com/dealer-locator.html
I came up with this, but it doen't actually grab any of the data.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
I've done this a couple times before, and it has always worked in the past. I'm guessing the data is dynamically generated by JavaScript, based on the filters that a user selects, but I don't know for sure. I've read that Selenium can be used to automate a web browser, but I have never used it, and I'm not really sure where to start. Ultimately, I am trying to get the data in this format, in the image below. Either printed in the Console Window, or downloaded to a CSV, would be fine.
Finally, how the heck does the site get the data? Whether I enter New York City or San Francisco, the map and the data set changes relative to the filter that is applied, but the URL does not change at all. Thanks in advance.
Use selenium to open/navigate to the page, then pass the page source to BeautifulSoup.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
url = 'https://www.nissanusa.com/dealer-locator.html'
browser.get(url)
time.sleep(10) // wait page open complete
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])

Scrape html only after data loads with delay using Python Requests?

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal html websites. But when I tried to get some data out of websites where the data loads after some delay, I found that I get an empty value. An example would be
from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "https://www.example.com/;1"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('span', 'buy')
print(a)
I am trying to grab the from here:
(value)
I have already referred a similar topic and tried executing my code on similar lines as the solution provided here. But somehow it doesnt seem to work. I am a novice here so need help getting this work.
How to scrape html table only after data loads using Python Requests?
The table (content) is probably generated by JavaScript and thus can't be "seen". I am using python3.6 / PhantomJS / Selenium as proposed by a lot of answers here.
You have to run headless browser to run delayed scraping. Please use selenium.
Here is sample code. Code is using chrome browser as driver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome(<chromedriver path here>)
browser.set_window_size(1120, 550)
browser.get(link)
element = WebDriverWait(browser, 3).until(
EC.presence_of_element_located((By.ID, "blabla"))
)
data = element.get_attribute('data-blabla')
print(data)
browser.quit()
You can access desired values by requesting it directly from API and analyze JSON response.
import requests
import json
res = request.get('https://api.example.com/api/')
d = json.loads(res.text)
print(d['market'])

Categories

Resources