Webscraping table with multiple pages using BeautifulSoup - python

I'm trying to scrape this webpage https://www.whoscored.com/Statistics using BeautifulSoup in order to obtain all the information of the player statistics table. I'm having lot of difficulties and was wondering if anyone would be able to help me.
url = 'https://www.whoscored.com/Statistics'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
text = [element.text for element in soup.find_all('div' {'id':"statistics-table-summary"})]
My problem lies in the fact that I don't know what the correct tag is to obtain that table. Also the table has several pages and I would like to scrape every single one. The only indication I've seen of a change of page in the table is the number in the code below:
<div id="statistics-table-summary" class="" data-fwsc="11">

It looks to me like that site loads their data in using Javascript. In order to grab the data, you'll have to mimic how a browser loads a page; the requests library isn't enough. I'd recommend taking a look at a tool like Selenium, which uses a "robotic browser" to load the page. After the page is loaded, you can then use BeautifulSoup to retrieve the data you need.
Here's a link to a helpful tutorial from RealPython.
Good luck!

Related

beautifulsoup4 find_all not finding any data on Walmart grocery website

I am attempting to scrape some basic product information from the url linked here, but the bs4 find_all command isn't finding any data given the name of the class associated with the product div. Specifically, I am trying:
url = https://www.walmart.com/grocery/browse/Cereal-&-Breakfast-Food?aisle=1255027787111_1255027787501
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.find_all('div', class_='productListTile')
print(product_list)
But this prints an empty list []. Having inspected the webpage on Chrome, I know that 'productListTile' is the correct class name. Any idea what I am doing wrong?
You will need to use Selenium most likely. Beautiful Soup requests get redirected to a "Verify Your Identity" page.
Here is a very similar question to this one, which has code with Selenium and Beautiful Soup working in concert to scrape Wal-Mart
python web scraping using beautiful soup is not working
Web scraping technics vary with websites. In this case, you can either use selenium that is a good option and here I am adding another method with the beautiful soup itself, this helped me a lot.
In this case, inspect the web page and then select network, please refresh the page.
Then sort with type:
In the below image I had marked with red color the API's they called to get the data from the backend. So you can directly call the backend API to fetch the player's data.
Check the "Headers" you will see the API endpoint and in preview, you can see the API response in JSON format.
Now if you want to get the images then please check the source you will see the images and u can download the images and map with the id.

Scraping href with BeautifulSoup

I'm trying to scrape the CO2 per resource trend table data from this url: pcaiso.com/todaysoutlook/pages/emissions.html
The href attribute of the contains the dataset for the chart (as a very long string) I was attempting to return this attribute, but my code is returning a zero set for the following request, no matter how hard I try and google other suggestions.
url = 'http://www.caiso.com/todaysoutlook/pages/emissions.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
html = soup.find("a", {"class": "dropdown-item"})
print(html)
Any thoughts are appreciated! Thanks.
It seems like what you really are asking about is why the element does not have an href attribute when you inspect it in your code. The reason is, when you request the HTML page from the server it actually returns a static page without any chart data. When you view this page in a web browser, it runs some JavaScript code that queries the backend and populates the chart data dynamically. So you'll need to modify your approach to get that data.
One option is to manually inspect the page in your browser, reverse-engineer how it fetches the data, and do the same thing in your code. Most web browser have built-in developer tools that can help with this.
Another option is to use a browser automation tool like Selenium to load the page in a web browser environment and scrape the data from there. This is less efficient, but might be easier for someone inexperienced in web programming because you can treat the JavaScript functionality as a "black box" and interact with the page more like how a real user would.

BeautifulSoup4 output with JS Filters

Newbie here. I'm trying to scrape some sports statistics off a website using BeautifulSoup4. The script below does output a table, but it's not actually the specific data that appears in the browser (the data that appears in browser is the data I'm after - goalscorer data for a season, not all time records).
#import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
#specify the url
stat_page = 'https://www.premierleague.com/stats/top/players/goals?se=79'
# query the website and return the html to the variable ‘page’
page = urlopen(stat_page)
#parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
stats = soup.find('tbody', attrs={'class': 'statsTableContainer'})
name = stats.text.strip()
print(name)
It appears there is some filtering of data going on behind the scenes but I am not sure how I can filter the output with BeautifulSoup4. It would appear there is some Javascript filtering happening on top of the HTML.
I have tried to identify what this specific filter is, and it appears the filtering is done here.
<div class="current" data-dropdown-current="FOOTBALL_COMPSEASON" role="button" tabindex="0" aria-expanded="false" aria-labelledby="dd-FOOTBALL_COMPSEASON" data-listen-keypress="true" data-listen-click="true">2017/18</div>
I've had a read of the below link, but I'm not entirely sure how to apply it to my answer (again, beginner here).
Having problems understanding BeautifulSoup filtering
I've tried installing, importing and applying the different parsers, but I always get the same error (Couldn't find a Tree Builder). Any suggestions on how I can pull data off a website that appears to be using a JS filter?
Thanks.
In these cases, it's usually useful to track the network requests using your browser's developer tools, since the data is usually retrieved using AJAX and then displayed in the browser with JS.
In this case, it looks like the data you're looking for can be accessed at:
https://footballapi.pulselive.com/football/stats/ranked/players/goals?page=0&pageSize=20&compSeasons=79&comps=1&compCodeForActivePlayer=EN_PR&altIds=true
It has a standard JSON format so you should be able to parse and extract the data with minimal effort.
However, note that this endpoint requieres the Origin HTTP header to be set to https://www.premierleague.com in order for it to serve your request.

BeautifulSoup4 fails to parse multiple tables

I'd like to systematically scrape the privacy breach data found here which is directly embedded in the HTML of the page. I've found various links on StackOverflow about missing HTML and not being able to scrape a table using BS4. Both of these threads seem very similar to the issue that I'm having, however i'm having a difficult time reconciling the differences.
Here's my problem: When I pull the HTML using either Requests or urllib (python 3.6) the second table does not appear in the soup. The second link above details that this can occur if the table/data is added in after the page loads using javascript. However when I examine the page source the data is all there, so that doesn't seem to be the issue. A snippet of my code is below.
url = 'https://www.privacyrights.org/data-breach/new?title=&page=1'
r = requests.get(url, verify=False)
soupy = BeautifulSoup(r.content, 'html5lib')
print(len(soupy.find_all('table')))
# only finds 1 table, there should be 2
This code snippet fails to find the table with the actual data in it. I've tried lmxl, html5lib, and html.parse parsers. I've tried urllib and Requests packages to pull down the page.
Why can't requests + BS4 find the table that I'm looking for?
Looking at the HTML delivered from the URL it appears that there only IS one table in it, which is precisely why Beautiful Soup can't find two!

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Categories

Resources