Having trouble with web scraping using Beautiful Soup - python

I'm having hard time web scraping this web page top-programming-guru.
I was looking to have retrieve a list of all the youtube channels listed in the page.
I'm using BeautifulSoup, I took a look at source code of page, Then and try this code:
import requests
from bs4 import BeautifulSoup
URL = 'https://noonies.tech/award/top-programming-guru'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'lxml')
resluts = soup.find_all('div', class_='sc-jhAzac dldLgq')
resluts
But I always get an empty list.
Any ideas how to do this correctly?
this the tag I was looking for
<div class="sc-jhAzac dldLgq">
<p>
šŸ„
<em>And the winner is...</em>
</p>
<div class="sc-gZMcBi kTYIfA">
<div class="nomination-info">
<h3><i class="fad fa-trophy"></i>Programming with Mosh</h3>

The data is dynamically loaded. Use selenium or similar to allow javascript to load then scrape.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://noonies.tech/award/top-programming-guru'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(href=re.compile('youtube.com'))
Outputs a list of href with youtube.com in it. You may have to clean up the list if it captures youtube.com links you don't want or go back to your class search.
[Programming with Mosh,
Traversy Media,
Corey Schafer,
Tech With Tim,
Krish Naik,
freeCodeCamp.org,
Hitesh Choudhary,
Clever Programmer,
Caleb Curry,....

Related

Beautiful Soup Link Scraping [duplicate]

I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?
The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)
Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>AĀ·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

Can selenium automation be used with BS4?

I am using selenium for both automation and scraping. Now I found that it's too slow on some of the sites. If i use beautifulSoup then I can scrape them faster but the automation can't be done.
Is there anyway where I can automate the website (button click events etc.) and can also scrape websites with it on beautifulSoup?
Can you give me an example of button/search automation with bs4 + selenium?
Any help would be appreciated...
Example
from bs4 import BeautifulSoup as Soup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/questions/tagged/beautifulsoup+selenium")
page = Soup(driver.page_source, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
Or Just
import requests
from bs4 import BeautifulSoup as Soup
url = 'https://stackoverflow.com/questions/tagged/beautifulsoup+selenium'
response = requests.get(url=url)
page = Soup(response.text, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
Remember to read https://stackoverflow.com/robots.txt
Absolutely . you can do all the rendering using selenium and pass on the page source to beautifulsoup as follows :
from bs4 import BeautifulSoup as bs
soup = bs(driver.page_source,'html.parser')
This how-to make it live DOM and loaded js so, Enjoy and Save your time searching, the idea is to get the whole body, if you want also head do replace the body, It will be exactly as selenium get, I hope you like it all.
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
dri = webdriver.Chrome(options=options)
html = dri.find_element_by_tag_name("body").get_attribute('innerHTML')
soup = BeautifulSoup(html, features="lxml")

BeautifulSoup4 not finding div

I've been experimenting with BeautifulSoup4 lately and have found pretty good success across a range of different sites. Though I have stumbled into an issue when trying to scrape amazon.com.
Using the code below, when printing 'soup' I can see the div, but when I search for the div itself, BS4 brings back null. I think the issue is in how the html is being processed. I've tried with LXML and html5lib. Any ideas?
import bs4 as bs
import urllib3
urllib3.disable_warnings()
http = urllib3.PoolManager()
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
original = http.request('Get',url)
soup = bs.BeautifulSoup(original.data, 'lxml')
div = soup.find_all('div', {'class':'a-row padCenterContainer'})
You could use selenium to allow the javascript to load before grabbing the html.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options)
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'class': 'a-row padCenterContainer'})
print(div.prettify())
The output of this script was too long to put in this question but here is a link to it

Trouble Scraping site with BS4

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.
The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federalĀ <i>Public Utility Regulatory Policies Act of 1978</i>Ā (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...
So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

Categories

Resources