Scraping difficult table

Scraping difficult table - python

I have been trying to scrape a table from here for quite some time but have been unsuccessful. The table I am trying to scrape is titled "Team Per Game Stats". I am confident that once I am able to scrape one element of that table that I can iterate through the columns I want from the list and eventually end up with a pandas data frame.
Here is my code so far:
from bs4 import BeautifulSoup
import requests
# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)
# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)
# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())
# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)
Any help would be greatly appreciated.

That webpage employs a trick to try to stop search engines and other automated web clients (including scrapers) from finding the table data: the tables are stored in HTML comments:
<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">
<div class="section_heading">
<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2> <div class="section_heading_text">
<ul> <li><small>* Playoff teams</small></li>
</ul>
</div>
</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team-stats-per_game">
<table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>
...
</table>
</div>
</div>
-->
</div>
I note that the opening div has setup_commented and commented classes. Javascript code included in the page is then executed by your browser that then loads the text from those comments and replaces the placeholder div with the contents as new HTML for the browser to display.
You can extract the comment text here:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')
then continue to parse the table HTML.
This specific site has published both terms of use, and a page on data use you should probably read if you are going to use their data. Specifically, their terms state, under section 6. Site Content:
You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.
Scraping the data would fall under that heading.

Just to complete Martijn Pieters's answer (and without lxml)
from bs4 import BeautifulSoup, Comment
import requests
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
soup = BeautifulSoup(r.content, 'html.parser')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table = BeautifulSoup(comment, 'html.parser')
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if cells:
print([cell.text for cell in cells])
Partial output
[u'New Orleans Pelicans', u'71', u'240.0', u'43.6', u'91.7', u'.476', u'10.1', u'29.4', u'.344', u'33.5', u'62.4', u'.537', u'18.1', u'23.9', u'.760', u'11.0', u'36.0', u'47.0', u'27.0', u'7.5', u'5.5', u'14.5', u'21.4', u'115.5']
[u'Milwaukee Bucks*', u'69', u'241.1', u'43.3', u'90.8', u'.477', u'13.3', u'37.9', u'.351', u'30.0', u'52.9', u'.567', u'17.6', u'22.8', u'.773', u'9.3', u'40.1', u'49.4', u'26.0', u'7.4', u'6.0', u'14.0', u'19.8', u'117.6']
[u'Los Angeles Clippers', u'70', u'241.8', u'41.0', u'87.6', u'.469', u'9.8', u'25.2', u'.387', u'31.3', u'62.3', u'.502', u'22.8', u'28.8', u'.792', u'9.9', u'35.7', u'45.6', u'23.4', u'6.6', u'4.7', u'14.5', u'23.5', u'114.6']

Related

Python BeautifulSoup Printing info from multiple tags div class

Trying to create a python script to collect info from a website.
Trying to work out how I can extract information from 2/3 DIV tags and print.
example of html code
<div class="PowerDetails">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
I have managed to get it one by running a for loop, but trying to get RunningCost and Time side by side
Python Script, I'm new to it so playing around trying a few different things
import bs4, requests, time
while True:
url = "https://www.website.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#soupTitle = soup.select('.RunningCost')
soupDetail = soup.select('.Time')
for soupDetailList in soupDetail:
print (soupDetailList.text)
End goal for this script is a web monitor to list changes/updates

zip should do the job.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
for r, t in zip(div.find_all("p", {"class":"RunningCost"}),
div.find_all("p", {"class":"Time"})):
print(r.string, t.string)
$4.44 peek
$2.33 Off-peek

Assuming that soup is the HTML code
PowerDetails = soup.find("div")
RunningCost = PowerDetails.find_all("p", _class="RunningCost")
Time = PowerDetails.find_all("p", _class="Time")

How to extract the href attribute value from an a tag with beautiful soup?

This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()

Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])

BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)

This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

Download Multiple PDF files from a webpage

So I am trying to download a few eBooks that I have purchased through humble bundle. I am using beautifulsoup and requests to try and parse the html and get the href links for the pdfs.
Python
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.humblebundle.com/downloads?key=fkuzzq6R8MA8ydEw")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("div", {"class": "js-all-downloads-holder"})
print(links)
I am going to put a imgar link to the site and html layout because I don't believe you can access the html page without prompting a login(Which might be one of the reason I am having this issue to start with.) https://imgur.com/24x2X0m
HTML
<div class="flexbtn active noicon js-start-download">
<div class="right"></div>
<span class="label">PDF</span>
<a class="a" download="" href="https://dl.humble.com/makea2drpginaweekend.pdf?gamekey=fkuzzq6R8MA8ydEw&ttl=1521117317&t=b714bb732413a1f0532ec6aa72b282f9">
PDF
</a>
</div>
So the print statement should output to contents of the div but that is not the case.
Output
python3 pdf_downloader.py
[]
Sorry for the long post, I have just been up all night working on this and at this point it would have just been easier to hit the download button 20+ times but that is not how you learn.

Webscraping a particular element of html

I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is "https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security"
The code I'm using is:
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr">
<p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks

If you are only interested in data located inside govuk-govspeak direction-ltr class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself. For class use . and for id use #
data = soup.select('.govuk-govspeak.direction-ltr')
# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]
#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]

You can find that particular <div> and then under that div you can find the <p> tags and get the data like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div.find_all("p"):
data.append(i.get_text().encode("ascii","ignore"))
data="\n".join(data)
now data will contain the whole content with paragraphs seperated by \n
Note: The above code will give you only the text content of paragraph heading content will not be included
if you want both heading with paragraph text then you can extract both <h3> and <p> like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div:
if i.name=="h3":
data.append(i.get_text().encode("ascii","ignore")+"\n\n")
if i.name=="p":
data.append(i.get_text().encode("ascii","ignore")+"\n")
data="".join(data)
Now data will have both headings and paragraphs where headings will be seperated by \n\n and paragraphs will be seperated by \n

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping difficult table - python

Related

Python BeautifulSoup Printing info from multiple tags div class

How to extract the href attribute value from an a tag with beautiful soup?

How to get data from nested HTML using BeautifulSoup in Django

Download Multiple PDF files from a webpage

Webscraping a particular element of html

Categories

Resources