BeautifulSoup - Cant get the content of the page

BeautifulSoup - Cant get the content of the page - python

I'm using BeautifulSoup for a while and I've hadn't had much problems.
But now I'm trying to scrape from a site that gives me some problem.
My code is this:
preSoup = requests.get('https://www.betbrain.com/football/world/')
print(currUrl)
soup = BeautifulSoup(preSoup.content,"lxml")
print(soup)
the content I get seems to be some sort of script and/or api they're connected to, but not the real content of the webpage I see in the browser.
I cant reach the games for example. Does anyone knows a way around it?
Thank you

Okay requests gets only the html and doesnt load the js
you have to use webdriver for that
you can use Chrome, Firefox and etc.. i use PhantomJS because is running in the background its "headless" browser. Underneath you will find some example code that will help you understand how to use it
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://www.betbrain.com/football/world/")
time.sleep(5)# you can give it some time to load the js
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for i in soup.findAll("span", {"class": "Participant1"}):
print (i.text)

Related

Scraping a Javascript enabled web page in Python

I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?

As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']

Why BeautifulSoup misses <p> tags?

I am using BeautifulSoup, the findAll method is missing <p> tags. I run the code and it returns and empty list. But if I inspect the page I can clearly see it as shown in the picture bellow.
I chose some random site.
import requests
from bs4 import BeautifulSoup
#An example web site
url = 'https://www.kite.com/python/answers/how-to-extract-text-from-an-html-file-in-python'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.findAll("p"))
The output:
(env) pinux#main:~/dev$ python trial.py
[]
I inspect the page using the browser:
The text is clearly there. Why doesn't BeautifulSoup catch them? Can someone shed some light on what is going on?

It appears that parts of this webpage is rendered in JavaScript. You can try using selenium, since Selenium WebDrivers automatically wait for the page to fully render.
import bs4
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://url-to-webpage.com")
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)

Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.

You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

Python Webscraping not return the right text, and sometimes no text at all

I'm trying to retrieve the price of the item from this amazon page, URL:
https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/
Source Code
from bs4 import BeautifulSoup
import requests
text = "https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/"\
page = requests.get(text)
data = page.text
soup = BeautifulSoup(data, 'lxml')
web_text = soup.find_all('div')
print(web_text)
Everytime I run the program, I get an output of html that's nothing similar to that of the webpage, saying things like:
" Sorry! Something went wrong on our end. Please go back and try again..."
I'm not sure what I'm doing wrong, any help would be much appreciated. I'm new to python and webscraping so I'm sorry if my issue is super obvious. Thanks! :)

Website is serving content dynamically what request could not handle, use selenium instead:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/FANMATS-University-Longhorns-Chrome-Emblem/dp/B00EPDLL6U/'
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
print(soup.select_one('span#priceblock_ourprice').get_text())
driver.close()

Web scraper only works for a few minutes after I've opened the web page I want to scrape

Here is the bit of relevant code:
from bs4 import BeautifulSoup
from selenium import webdriver
item = 'https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29'
driver = webdriver.Chrome()
driver.get(item)
res = driver.execute_script('return document.documentElement.outerHTML')
driver.quit()
soup = BeautifulSoup(res, 'lxml')
buyorder_table = soup.find('table', {'class': 'market_commodity_orders_table'})
print(buyorder_table)
When I run this code normally it prints
None
But when I open the item url in my browser, then run the code it returns the table that I want (and then I have code to parse it).
I found this seemingly helpful article but I tried using the built-in HTML parser and had the same issue, which I think is the suggested solution in the article.
Is there any way to fix this issue? Thanks in advance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - Cant get the content of the page - python

Related

Scraping a Javascript enabled web page in Python

Why BeautifulSoup misses <p> tags?

Not getting all the information when scraping bet365.com

Python Webscraping not return the right text, and sometimes no text at all

Web scraper only works for a few minutes after I've opened the web page I want to scrape

Categories

Resources