I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup
Related
I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?
The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)
Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.
I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.
I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?
As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']
I'm trying to get some data, from this webpage. All the text is visible on the page, however, when I grab it with BSoup I cant find any of the numbers for odds.
from selenium import webdriver
from urllib2 import urlopen
from bs4 import BeautifulSoup
URL = "https://www.sportsbookreview.com/betting-odds/mlb-baseball/money-line/1st-half/?date=20160601"
driver = webdriver.Chrome()
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
opener = soup.findAll('span' , {'class' : 'opener'})
print opener
What's even weirder, it worked once, then stopped for no apparent reason or changes to the code. When I checked what soup scrapes from this page, the numbers for odds weren't even included.
Why don't I get all the data and what to do to get it?
I think you need to use Selenium to make javascript triggered then extract data you want in this case code below can help :
options = webdriver.ChromeOptions()
# also look at options
driver = webdriver.Chrome('address_of_your_driver', chrome_options=options)
driver.get("your_URL_here")
# now you've got the source
soup = BeautifulSoup(driver.page_source, "html.parser")
I'm trying to retrieve the price of the item from this amazon page, URL:
https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/
Source Code
from bs4 import BeautifulSoup
import requests
text = "https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/"\
page = requests.get(text)
data = page.text
soup = BeautifulSoup(data, 'lxml')
web_text = soup.find_all('div')
print(web_text)
Everytime I run the program, I get an output of html that's nothing similar to that of the webpage, saying things like:
" Sorry! Something went wrong on our end. Please go back and try again..."
I'm not sure what I'm doing wrong, any help would be much appreciated. I'm new to python and webscraping so I'm sorry if my issue is super obvious. Thanks! :)
Website is serving content dynamically what request could not handle, use selenium instead:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/'
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
print(soup.select_one('span#priceblock_ourprice').get_text())
driver.close()