I am using selenium for both automation and scraping. Now I found that it's too slow on some of the sites. If i use beautifulSoup then I can scrape them faster but the automation can't be done.
Is there anyway where I can automate the website (button click events etc.) and can also scrape websites with it on beautifulSoup?
Can you give me an example of button/search automation with bs4 + selenium?
Any help would be appreciated...
Example
from bs4 import BeautifulSoup as Soup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/questions/tagged/beautifulsoup+selenium")
page = Soup(driver.page_source, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
Or Just
import requests
from bs4 import BeautifulSoup as Soup
url = 'https://stackoverflow.com/questions/tagged/beautifulsoup+selenium'
response = requests.get(url=url)
page = Soup(response.text, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
Remember to read https://stackoverflow.com/robots.txt
Absolutely . you can do all the rendering using selenium and pass on the page source to beautifulsoup as follows :
from bs4 import BeautifulSoup as bs
soup = bs(driver.page_source,'html.parser')
This how-to make it live DOM and loaded js so, Enjoy and Save your time searching, the idea is to get the whole body, if you want also head do replace the body, It will be exactly as selenium get, I hope you like it all.
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
dri = webdriver.Chrome(options=options)
html = dri.find_element_by_tag_name("body").get_attribute('innerHTML')
soup = BeautifulSoup(html, features="lxml")
Related
I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.
Python Beatifulsoup requests
import requests
import re
import os
import csv
from bs4 import BeautifulSoup
for d in searche:
truelink = d.replace(" ","-")
truelinkk=('https://www.fb.com
r = requests.get(truelinkk,headers=headers).text
soup=BeautifulSoup(r,'lxml')
mobile=soup.find_all('li',class_='EIR5N')
I am beginner to python. I can't scrape a website where url doesn't change on its next page when load more using requests and beautifulsoup please can someone visit the site let me know the procedure for scraping above websites using beautifulsoup and requests. Any answer would be appreciated Thankyou
Please look this link
https://www.olx.in/hyderabad_g4058526/q-Note-9-max-pro?isSearchCall=true
You can use selenium in headless mode instead of requests. Eventho selenium is used for web automation it can help you in this case.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
begin = time.time()
options = Options()
options.headless = True
options.add_argument('--log-level=3')
driver = webdriver.Chrome(options=options)
Since the URL doesn't change you have to click on the button that you want by getting its xpath and:
driver.find_element_by_xpath('xpath code').click()
You can avoid using requests and you can get the source code of the page by using:
html_text = driver.page_source
soup = BeautifulSoup(html_text, 'lxml')
I'm having hard time web scraping this web page top-programming-guru.
I was looking to have retrieve a list of all the youtube channels listed in the page.
I'm using BeautifulSoup, I took a look at source code of page, Then and try this code:
import requests
from bs4 import BeautifulSoup
URL = 'https://noonies.tech/award/top-programming-guru'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'lxml')
resluts = soup.find_all('div', class_='sc-jhAzac dldLgq')
resluts
But I always get an empty list.
Any ideas how to do this correctly?
this the tag I was looking for
<div class="sc-jhAzac dldLgq">
<p>
🥁
<em>And the winner is...</em>
</p>
<div class="sc-gZMcBi kTYIfA">
<div class="nomination-info">
<h3><i class="fad fa-trophy"></i>Programming with Mosh</h3>
The data is dynamically loaded. Use selenium or similar to allow javascript to load then scrape.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://noonies.tech/award/top-programming-guru'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(href=re.compile('youtube.com'))
Outputs a list of href with youtube.com in it. You may have to clean up the list if it captures youtube.com links you don't want or go back to your class search.
[Programming with Mosh,
Traversy Media,
Corey Schafer,
Tech With Tim,
Krish Naik,
freeCodeCamp.org,
Hitesh Choudhary,
Clever Programmer,
Caleb Curry,....
I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup
I've been experimenting with BeautifulSoup4 lately and have found pretty good success across a range of different sites. Though I have stumbled into an issue when trying to scrape amazon.com.
Using the code below, when printing 'soup' I can see the div, but when I search for the div itself, BS4 brings back null. I think the issue is in how the html is being processed. I've tried with LXML and html5lib. Any ideas?
import bs4 as bs
import urllib3
urllib3.disable_warnings()
http = urllib3.PoolManager()
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
original = http.request('Get',url)
soup = bs.BeautifulSoup(original.data, 'lxml')
div = soup.find_all('div', {'class':'a-row padCenterContainer'})
You could use selenium to allow the javascript to load before grabbing the html.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options)
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'class': 'a-row padCenterContainer'})
print(div.prettify())
The output of this script was too long to put in this question but here is a link to it