Find href using Beautiful Soup - Python - python

I'm trying to extract the first link from a page search, using beautiful soup, but it can't find the link for some reason.
from requests import get
from bs4 import BeautifulSoup
import requests
band = "it's my life bon jovi"
url = f'https://www.letras.mus.br/?q={band}'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
linkurl = soup.find_all("div", class_="wrapper")
for urls in linkurl:
print(urls.get('href'))
#print(soup.a['href']) -- return /
#print(soup.a['data-ctorig]) -- return nothing
I would like to get the link of the data-ctorig or the href, does this link have a script that is preventing me from looking for this information, or is it a problem with my code?

The website uses google programmable search engine (CSE) to return cached results. This required JavaScript to run in a browser which doesn't happen with requests.
It is far easier to use selenium and a more targeted css selector list to retrieve results.
While the wait doesn't seem to be needed in this case I have added it for good measure.
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
band = "it's my life bon jovi"
url = f'https://www.letras.mus.br/?q={band}'
d = webdriver.Chrome()
d.get(url)
links = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".gsc-thumbnail-inside .gs-title[target]")))
links = [link.get_attribute('href') for link in links]
print(links[0])

Related

web scraping table with selenium gets only html elements but no content

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you
The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)
BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)
You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

Scraping: cannot extract content from webpage

I am trying to scrape the news content from the following page, but with no success.
https://www.business-humanrights.org/en/latest-news/?&search=nike
I have tried with Beautifulsoup :
r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup
but the content that I am looking for - the bits of news that are tagged as div class = 'card__content', do not appear in the soup output.
I also checked, but I could not find any frames to switch to.
Finally, I tried with phantomjs and the following code but with no success:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'card__content'})
print(container)
I am running out of options, anyone can help?
Use API
import requests
r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")
print(r.json())
I didn't understand why you're facing this. I tried the same above but not with requests and bs4. I used requests_html. xpaths can be used directly in this library without any other libraries.
import requests_html
session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)
divs_with_required_class = res.html.xpath(r'//div[#class="card__content"]')
for item in divs_with_required_class:
print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')
driver.page_source returns initial HTML-doc content no matter how long you wait (time.sleep(7) has no effect).
Try below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()

beautifulsoup scrape realtime values

i am trying to scrape the currency rates for a personal project, i used css selector to get the class where the values are. There's a javascript providing those values on the website and it seems i am noot too connversant with the developers console, i checked it out and i could not see anything running in real time in the networks section. This is the code i wrote, so far, it brings out a long list of dashes. surprisingly, the dashes match the source code for those parts were the rates are supposed to show.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.ig.com/en/forex/markets-forex")
soup = BeautifulSoup(r.content, "html.parser")
results = soup.findAll("span",attrs={"data-field": "CPT"})
for span in results:
print(span.text)
Span-elements filling via JS, dynamic values. On start each span-element contains '-'.
You need js driver for wait to fill elements and then get values from spans.
With selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
print(elm, elm.text)
chromedriver download from https://sites.google.com/a/chromium.org/chromedriver/home
Also, dryscrape + bs4, but dryscrape seems outdated. Example here
Modified:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
time.sleep(2) # Maybe more or less, how much faster page load
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text:
print(elm, elm.text)
or
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.ig.com/en/forex/markets-forex')
data = []
while not data:
for elm in driver.find_elements(By.CSS_SELECTOR, "span[data-field=CPT]"):
if elm.text and elm.text != '-': # Maybe check on contains digit
data.append(elm.text)
time.sleep(1)
print(data)

Selenium and BeautifulSoup can't fetch all HTML content

I'm scraping the bottom table labeled "Capacity : Operationally Available - Evening" on https://lngconnection.cheniere.com/#/ccpl
I am able to get all the HTML and everything shows up when I prettify() print the HTML but the parsers can't find it when I give a command to find the specific information I need.
Here's my script:
cc_driver = webdriver.Chrome('/Users/.../Desktop/chromedriver')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
now when I do the
print(cc_table.prettify())
The output is everything except the actual table data. Is there some error in my code or in their HTML that is hiding the actual table values? I'm able to see it when I print everything Selenium captures on the page. The HTML also doesn't have specific ID tags for any of the cell values.
You are looking into the HTML which is not yet complete. All the elements have not yet returned from the javascript. So you can do a webdriver wait.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
cc_driver = webdriver.Chrome(r"path for driver")
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
WebDriverWait(cc_driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,
'#capacityGrid > table > tbody')))
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', class_='k-selectable')
#print(cc_content.prettify())
print(cc_table.prettify())
This will wait for the element to be present.
This should help you getting table html
from selenium import webdriver
from bs4 import BeautifulSoup as bs
cc_driver = webdriver.Chrome('../chromedriver_win32/chromedriver.exe')
cc_driver.get('https://lngconnection.cheniere.com/#/ccpl')
cc_html = cc_driver.page_source
cc_content = bs(cc_html, 'html.parser')
cc_driver.close()
cc_table = cc_content.find('table', attrs={'class':'k-selectable'})
#print(cc_content.prettify())
print(cc_table.prettify())

Trying to Get Selenium to Download Data Based on JavaScript...I think

I am trying to download data from the following URL.
https://www.nissanusa.com/dealer-locator.html
I came up with this, but it doen't actually grab any of the data.
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
I've done this a couple times before, and it has always worked in the past. I'm guessing the data is dynamically generated by JavaScript, based on the filters that a user selects, but I don't know for sure. I've read that Selenium can be used to automate a web browser, but I have never used it, and I'm not really sure where to start. Ultimately, I am trying to get the data in this format, in the image below. Either printed in the Console Window, or downloaded to a CSV, would be fine.
Finally, how the heck does the site get the data? Whether I enter New York City or San Francisco, the map and the data set changes relative to the filter that is applied, but the URL does not change at all. Thanks in advance.
Use selenium to open/navigate to the page, then pass the page source to BeautifulSoup.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
url = 'https://www.nissanusa.com/dealer-locator.html'
browser.get(url)
time.sleep(10) // wait page open complete
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])

Categories

Resources