BeautifulSoup4 output with JS Filters

BeautifulSoup4 output with JS Filters - python

Newbie here. I'm trying to scrape some sports statistics off a website using BeautifulSoup4. The script below does output a table, but it's not actually the specific data that appears in the browser (the data that appears in browser is the data I'm after - goalscorer data for a season, not all time records).
#import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
#specify the url
stat_page = 'https://www.premierleague.com/stats/top/players/goals?se=79'
# query the website and return the html to the variable ‘page’
page = urlopen(stat_page)
#parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
stats = soup.find('tbody', attrs={'class': 'statsTableContainer'})
name = stats.text.strip()
print(name)
It appears there is some filtering of data going on behind the scenes but I am not sure how I can filter the output with BeautifulSoup4. It would appear there is some Javascript filtering happening on top of the HTML.
I have tried to identify what this specific filter is, and it appears the filtering is done here.
<div class="current" data-dropdown-current="FOOTBALL_COMPSEASON" role="button" tabindex="0" aria-expanded="false" aria-labelledby="dd-FOOTBALL_COMPSEASON" data-listen-keypress="true" data-listen-click="true">2017/18</div>
I've had a read of the below link, but I'm not entirely sure how to apply it to my answer (again, beginner here).
Having problems understanding BeautifulSoup filtering
I've tried installing, importing and applying the different parsers, but I always get the same error (Couldn't find a Tree Builder). Any suggestions on how I can pull data off a website that appears to be using a JS filter?
Thanks.

In these cases, it's usually useful to track the network requests using your browser's developer tools, since the data is usually retrieved using AJAX and then displayed in the browser with JS.
In this case, it looks like the data you're looking for can be accessed at:
https://footballapi.pulselive.com/football/stats/ranked/players/goals?page=0&pageSize=20&compSeasons=79&comps=1&compCodeForActivePlayer=EN_PR&altIds=true
It has a standard JSON format so you should be able to parse and extract the data with minimal effort.
However, note that this endpoint requieres the Origin HTTP header to be set to https://www.premierleague.com in order for it to serve your request.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)

This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!

Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Webscraping table with multiple pages using BeautifulSoup

I'm trying to scrape this webpage https://www.whoscored.com/Statistics using BeautifulSoup in order to obtain all the information of the player statistics table. I'm having lot of difficulties and was wondering if anyone would be able to help me.
url = 'https://www.whoscored.com/Statistics'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
text = [element.text for element in soup.find_all('div' {'id':"statistics-table-summary"})]
My problem lies in the fact that I don't know what the correct tag is to obtain that table. Also the table has several pages and I would like to scrape every single one. The only indication I've seen of a change of page in the table is the number in the code below:
<div id="statistics-table-summary" class="" data-fwsc="11">

It looks to me like that site loads their data in using Javascript. In order to grab the data, you'll have to mimic how a browser loads a page; the requests library isn't enough. I'd recommend taking a look at a tool like Selenium, which uses a "robotic browser" to load the page. After the page is loaded, you can then use BeautifulSoup to retrieve the data you need.
Here's a link to a helpful tutorial from RealPython.
Good luck!

BeautifulSoup4 fails to parse multiple tables

I'd like to systematically scrape the privacy breach data found here which is directly embedded in the HTML of the page. I've found various links on StackOverflow about missing HTML and not being able to scrape a table using BS4. Both of these threads seem very similar to the issue that I'm having, however i'm having a difficult time reconciling the differences.
Here's my problem: When I pull the HTML using either Requests or urllib (python 3.6) the second table does not appear in the soup. The second link above details that this can occur if the table/data is added in after the page loads using javascript. However when I examine the page source the data is all there, so that doesn't seem to be the issue. A snippet of my code is below.
url = 'https://www.privacyrights.org/data-breach/new?title=&page=1'
r = requests.get(url, verify=False)
soupy = BeautifulSoup(r.content, 'html5lib')
print(len(soupy.find_all('table')))
# only finds 1 table, there should be 2
This code snippet fails to find the table with the actual data in it. I've tried lmxl, html5lib, and html.parse parsers. I've tried urllib and Requests packages to pull down the page.
Why can't requests + BS4 find the table that I'm looking for?

Looking at the HTML delivered from the URL it appears that there only IS one table in it, which is precisely why Beautiful Soup can't find two!

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>

The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup4 output with JS Filters - python

Related

How can i fix the find_all error while web scraping?

Python Requests only pulling half of intented tags

Webscraping table with multiple pages using BeautifulSoup

BeautifulSoup4 fails to parse multiple tables

BeautifulSoup4: Missing Parsed Table Data

Categories

Resources