Scraping a Javascript enabled web page in Python - python

I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?

As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

How can I fetch the source code from a website that is blocking the requests when we use python bs4, selenium?

I want to scrape the data present in this website "https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105". I tried using beautiful soup and selenium,
The first approach:
import requests as requests
from bs4 import BeautifulSoup
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print(soup)
This was not giving the output i was expecting, the fetched page contains something like this "Sorry, something about your browser or browsing activity made us think you were a robot."
The second approach:
import selenium
from selenium import webdriver
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
PATH = r"C:\Users\Vinay Edula\Desktop\xxxxxxxx\chromedriver.exe"
driver= webdriver.Chrome(PATH)
driver.get(url)
This approach works fine for one or two pages in that site but then after the problem is this website is blocking the requests.
Approach 3:
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
chrome_path = 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s'
for i in range(10):
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105&cursor="+str(i*10)+"&limit=10"
webbrowser.get(chrome_path).open(url)
time.sleep(10)
This code working fine, it was opening the site in chrome without any error or blockibg but i dont know how to fetch the source code.
When the python code tries to fetch the code or by accessing from the guest browser as selenium did i am getting error. When I manually opens this webpage or using webbrowser module in python I can able to see the contents. So how can i solve this problem , my final aim is to fetch the contents present from this paginated site https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105? .
Any solution for this problem will be will be highly appreciated.
You can use the page_source property of Selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

Problem with BS4 printing only certain parts of the web page

I am having a problem with bs4 only finding some things in html. To be specific when I try to print span.nav2__menu-link-main-text it selects it and prints it without a problem but when I try to select other part of the page it probably selects it but It doesnt want to print it out. Here is the code that prints and the code that doesnt print:
Tried using different parsers other than lxml and none worked.
#This one prints
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('span.nav2__menu-link-main-text'):
print(i.text)
#This one does not print
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('div.value-dispaly__value'):
print(i.text)
I expect this program to print the current value of div.value-dispaly__value
but when I start the program it prints nothing even tough I can see the value is 4000 when I inspect the page.
It seems that code you are willing to get is dynamically added to the web page by javascript.
In order to update web js part, you have to use requests render() function.
Website page is javascript request rendering to get data, so you need to use automation library like selenium. download selenium web driver as per your browser requirement.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/
Replace your code to this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://osu.ppy.sh/users/12008062')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
for i in soup.find_all('div',{"class":"value-display__value"}):
print(i.get_text())
O/P:
#47,514
#108
11d 19h 49m
44
4,000
11d 19h 49m
44
4,000
#47,514
#108
0
0

BeautifulSoup - Cant get the content of the page

I'm using BeautifulSoup for a while and I've hadn't had much problems.
But now I'm trying to scrape from a site that gives me some problem.
My code is this:
preSoup = requests.get('https://www.betbrain.com/football/world/')
print(currUrl)
soup = BeautifulSoup(preSoup.content,"lxml")
print(soup)
the content I get seems to be some sort of script and/or api they're connected to, but not the real content of the webpage I see in the browser.
I cant reach the games for example. Does anyone knows a way around it?
Thank you
Okay requests gets only the html and doesnt load the js
you have to use webdriver for that
you can use Chrome, Firefox and etc.. i use PhantomJS because is running in the background its "headless" browser. Underneath you will find some example code that will help you understand how to use it
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://www.betbrain.com/football/world/")
time.sleep(5)# you can give it some time to load the js
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for i in soup.findAll("span", {"class": "Participant1"}):
print (i.text)

Categories

Resources