BeautifulSoup returning unrelated HTML

BeautifulSoup returning unrelated HTML - python

I'm trying to parse basketball stat data from pages like http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html. I'm using Python 2.7.6 and BeautifulSoup 4-4.3.2. I'm searching gamelogs like the above page for the class "sortable" in order to get access to the raw stat data contained within the tables. I am only interested in the "Basic Stats" for each team.
However, the HTML that BeautifulSoup is returning is not at all what I expect. Instead I get a list of all-time team records and data for every school that has ever played. I don't have enough reputation to post a second link here of the output or I would.
Basically, there are four class "sortable" tables on the boxscore page. When I ask BS to find them by the only way I can think of to distinguish them from the other data, it instead returns completely irrelevant data and I can't even figure out where the returned data comes from.
Here's the code:
import urllib2
import re
import sys
from bs4 import BeautifulSoup
class Gamelogs():
def __init__(self):
#the base bage that has all boxscore links
self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
'http://www.sports-reference.com/cbb/schools/' + school +
'/2015-gamelogs.html'))
#use regex to only find links with score data
self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
"boxscores"));
def scoredata(links, school):
#for each link in the school's season
for l in links:
gameSoup = BeautifulSoup(urllib2.urlopen(l))
#remove extra link formatting to get just filename alone
l = l[59+len(school):]
#open a local file with that filename to store the results
fo = open(str(l),"w")
#create a list that will hold the box score data only
output = gameSoup.findAll(class_="sortable")
#write it line by line to the file that was just opened
for o in output:
fo.write(str(o) + '\n')
fo.close
def getlinks(school):
gamelogs = Gamelogs()
#open a new file to store the output
fo = open(school + '.txt',"w")
#remove extraneous links
gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
#create the list that will hold each school's seasonlong boxscores
boxlinks = list()
for s in gamelogs.statusPageLinks:
#make the list element a string so it can be sliced
string = str(s)
#remove extra link formatting
string = string[9:]
string = string[:-16]
#create the full list of games per school
boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
+ school + string)
scoredata(boxlinks, school)
if __name__ == '__main__':
#for each school as a commandline argument
for arg in sys.argv[1:]:
school = arg
getlinks(school)
Is this a problem with BS, my code, or the site? T

It looks like this is an issue with your code. The page that you are getting back sounds like this one: http://www.sports-reference.com/cbb/schools/?redir
Whenever I enter an invalid school name I am redirected to a page showing stats for 477 different teams. FYI: team names in the url are also case sensitive.

Related

How to fix this web scraping program?

Im a complete beginner with python, and I wrote this program to scrape and find closing odds of NHL games off The Score website, and put that data in a file. The program runs but for some reason only 2 games out of the about 200 I tried show up with incorrect data.
I think it is because of how I am search through divs within a div, I wrote the code that returns the data from that in a way that it only stores that last div (which conveniently is the div im looking to scrape).
Also im sure my way of writing to the file is poor for runtime, is there a better way to do this?
import requests
from bs4 import BeautifulSoup
# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
# Set up html parser
response = requests.get(url_val)
html = response.text
soup = BeautifulSoup(response.content, "html.parser")
# Scrape for header which is "matchtitle"
matchtitle = soup.find('h1',{'class': "sr-only"})
# Code to find div and search for div within
divs = soup.find('div',{'class': 'col-sm-4'})
for tag in divs:
# find div
target = tag.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
for tag in target:
# find divs within target div
odds = tag.find("div", {"class","GameDetailsCard__content--2L_KF"})
# Call write_to_file -> add data scraped from web
write_to_file(matchtitle.text +" "+ odds.text)
# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
for x in range(26500, 26715):
#print(x)
url = link + str(x)
#print(url)
get_match_data(url)
def write_to_file(game_data):
file = open("NHL_GAMES.txt","a")
file.write(game_data +"\n")
file.close
### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)
Here is one line in the text file with correct data:
Toronto Maple Leafs # New Jersey Devils on November 24, 2022 NJD -140, o/u 6.5
Here is one with incorrect data
Carolina Hurricanes # Anaheim Ducks on December 7, 2022 Justin St. Pierre, Chris Lee
Only 2/215 were wrong like this.

It looks like that certain NHL game webpages ex: Carolina does not contain a <div> section for the 'Odds', this might be due to then being OT games? Regardless best bet is to add in a clause to handle 'no odds found'. I have updated some of your code below:
import requests
from bs4 import BeautifulSoup
# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
results = []
# Set up html parser
response = requests.get(url_val)
html = response.text
soup = BeautifulSoup(html, "html.parser")
# Scrape for header which is "matchtitle"
matchtitle = soup.find('h1',{'class': "sr-only"})
target = soup.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
for tag in target:
if "Odds" in str(tag.find("div", {"class":"GameDetailsCard__label--iBMhJ"})):
odds = str(tag.find("div", {"class":"GameDetailsCard__content--2L_KF"}).text)
else:
odds = "No Odds found!"
print(matchtitle.text + " " + odds)
results.append(matchtitle.text + " " + odds)
# Call write_to_file -> add data scraped from web
write_to_file(results)
# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
print("Getting game details...")
for x in range(26500, 26715):
#print(x)
url = link + str(x)
#print(url)
get_match_data(url)
def write_to_file(game_data):
with open("NHL_GAMES.txt", "a") as file:
for line in game_data:
file.write(line + "\n")
### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)

Stuck Scraping with Beautifulsoup

So i'm trying to scrape a html webpage. It has novel chapters and i'm trying to get the text and store in text files to read offline. I don't have any previous experience with html or other things either. So the webpage I am trying to scrape is this. And the code i've been testing so far looks like this
`
import sys
import requests
import time
import re
from bs4 import BeautifulSoup
def browse_and_scrape(seed_url, page_number=1):
# Fetch the URL - We will be using this to append to images and info routes
url_pat = re.compile(r"(http://.*\.org)")
source_url = url_pat.search(seed_url).group(0)
# Page_number from the argument gets formatted in the URL & Fetched
formatted_url = seed_url.format(str(page_number))
# print(url_pat,source_url,formatted_url)
try:
html_text = requests.get(formatted_url).text
# print(html_text)
# Prepare the soup
soup = BeautifulSoup(html_text, "html.parser")
print(soup.find_all(id="chapterContent")[0]["style"])
print(f"Now Scraping - {formatted_url}")
# help = soup.find_all("div",class_="chapter-content text-normal")[0].text.strip().encode("ascii", "ignore").decode("ascii")
# for node in soup.findAll("div",class_="chapter-content text-normal"):
# print(node)
# print(''.join(node.findAll(text=True)))
# for node in soup.findAll("div"):
# # print(node)
# print(''.join(node.findAll(text=True)))
# help = soup.find_all("div",class_="chapter-content text-normal")[0]
# print(''.join(help.findAll(text=True)))
# print(help)
except Exception as e:
return e
return true
if __name__ == "__main__":
# seed_url = "http://books.toscrape.com/catalogue/page-{}.html"
seed_url = "http://wnmtl.org/chapter/324909-heavenly-wolf-valley.html"
# seed_url = "http://wnmtl.org/chapter/{}.html"
print("Web scraping has begun")
result = browse_and_scrape(seed_url)
if result == True:
print("Web scraping is now complete!")
else:
print(f"Oops, That doesn't seem right!!! - {result}")`
All the commented stuff are things i've been trying to rip the text from the tag. From my inspection of the developer console in the browser, all the text is in the tag with id of chapter content. My plan is to iteratively get the text, stuff it, get the link for the next page and repeat but i've been stuck for a bit now, any suggestions.

Instead of scraping each page, you can directly get the text from this API endpoint using requests.
https://api.mystorywave.com/story-wave-backend/api/v1/content/chapters/324909
The last item in the above API is the chapter ID (324909). You can navigate to chapters by giving in the chapter IDs.
The next and prev chapter IDs are present in the current chapter's API endpoint. Have a look at the above URL in browser to understand it better.
Here is the full recursive code that writes the text from 3 pages to a file called novel.txt. You may change the number of pages and other details as per your need.
import requests
def get_data(chapter_id, pages):
if pages == 0:
return
url = 'https://api.mystorywave.com/story-wave-backend/api/v1/content/chapters/' + str(chapter_id)
r = requests.get(url)
x = r.json()
pre_id = x['data']['preId']
next_id = x['data']['nextId']
title = x['data']['title']
content = x['data']['content']
chapter_title = f'\n***** Chapter: {title} *****\n'
with open('novel.txt', 'a') as f:
f.write(chapter_title)
f.write(content + '\n')
print(f"Chapter: '{title}' written to file.")
get_data(next_id, pages-1)
curr_id = '324909'
get_data(curr_id, 3)
Chapter: 'Heavenly Wolf Valley' written to file.
Chapter: 'Leaving' written to file.
Chapter: 'Pure Fabrication' written to file.

Python - Beginner Scraping with Beautiful Soup 4 - onmouseover

i'm a beginner python (3) user and i'm currently trying to scrape some sports stats for my fantasy football season. Previously I did this in a round about way (downloading in HT-track, converting to excel and using VBA to combine my data). But now I'm trying to learn python to improve my coding abilities.
I want to scrape this page but running into some difficulty in selecting only the rows/tables I want. Here is how my code currently stands. It still has a bit of code where I have been trying to play around with it.
from urllib.request import urlopen # import the library
from bs4 import BeautifulSoup # Import BS
from bs4 import SoupStrainer # Import Soup Strainer
page = urlopen('http://www.footywire.com/afl/footy/ft_match_statistics?mid=6172') # access the website
only_tables = SoupStrainer('table') # parse only table elements when parsing
soup = BeautifulSoup(page, 'html.parser') # parse the html
# for row in soup('table',{'class':'tbody'}[0].tbody('tr')):
# tds = row('td')
# print (tds[0].string, tds[1].string)
# create variables to keep the data in
team = []
player = []
kicks = []
handballs = []
disposals = []
marks = []
goals = []
tackles = []
hitouts = []
inside50s = []
freesfor = []
freesagainst = []
fantasy = []
supercoach = []
table = soup.find_all('tr')
# print(soup.prettify())
print(table)
Right now I can select all 'tr' from the page, however I'm having trouble only selecting the rows which have the following attribute:
<tr bgcolor="#ffffff" onmouseout="this.bgColor='#ffffff';" onmouseover="this.bgColor='#cbcdd0';">
"onmouseover" seems to be the only attribute which is common/unique to the table I want to scrape.
Does anyone know how I can alter this line of code, to select this attribute?
table = soup.find_all('tr')
From here I am confident I can place the data into a dataframe which hopefully I can export to CSV.
Any help would be greatly appreciated as I have looked through the BS4 documentation with no luck.

As explained on the BeautifulSoup documentation
You may use this :
table = soup.findAll("tr", {"bgcolor": "#ffffff", "onmouseout": "this.bgColor='#ffffff'", "onmouseover": "this.bgColor='#cbcdd0';"})
More, you can also use the following approach:
tr_tag = soup.findAll(lambda tag:tag.name == "tr" and tag["bgcolor"] == "#ffffff") and tag["onmouseout"] = "this.bgColor='#ffffff'" and tag["onmouseover"] = "this.bgColor='#cbcdd0';"
The advantage of the above approach is that it uses the full power of BS and it's giving you the result in a very optimized way

Check this:
soup.find_all("tr", attrs={"onmouseover" : "this.bgColor='#cbcdd0';"})

Using BeautifulSoup to scrape li's and id's in same method

How would i modify the parameters of the findAll method to read both li's and id's? li's are elements and id's are attributes correct?
#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites
import csv
import requests
from bs4 import BeautifulSoup
###################### SURFLINE URL STRINGS AND TAG ###########################
slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'
slTagText = 'observed-wave-range'
slTag = 'id'
#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl, slCardiffUrl]
###############################################################################
#################### MAGICSEAWEED URL STRINGS AND TAG #########################
msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'
msTagText = 'rating-text text-dark'
msTag = 'li'
#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]
###############################################################################
'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location
rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each
break
tag: the html tag where the actual report lives on the page
returns: a list of strings of each breaks surf report
'''
def extract_Reports(rootUrl, urlList, tag, tagText):
#empty list to hold reports
reports = []
#loop thru URLs
for url in urlList:
try:
#request page
request = requests.get(rootUrl + url)
#turn into soup
soup = BeautifulSoup(request.content, 'lxml')
#get the tag where report lives
reportTag = soup.findAll(id = tagText)
for report in reportTag:
reports.append(report.string.strip())
#notify if fail
except:
print 'scrape failure'
pass
return reports
#END METHOD
slReports = extract_Reports(slRootUrl, slUrls, slTag, slTagText)
msReports = extract_Reports(msRootUrl, msUrls, msTag, msTagText)
print slReports
print msReports
As of right now, only slReports prints correctly because i have it explicitly set to id = tagText. I am also aware that my tag paramater is not used currently.

So the problem is that you want to search the parse tree for elements that have either a class name of rating-text (it turns out you do not need text-dark to identify the relevant elements in the case of Magicseaweed) or an ID of observed-wave-range, using a single findAll call.
You can use a filter function to achieve this:
def reportTagFilter(tag):
return (tag.has_attr('class') and 'rating-text' in tag['class']) \
or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')
Then change your extract_Reports function to read:
reportTag = soup.findAll(reportTagFilter)[0]
reports.append(reportTag.text.strip())

urlopen for loop with beautifulsoup

New user here. I'm starting to get the hang of Python syntax but keep getting thrown off by for loops. I understand each scenario I've reach on SO thus far (and my previous examples), but can't seem to come up with one for my current scenario.
I am playing around with BeautifulSoup to extract features from app stores as an exercise.
I created a list of both GooglePlay and iTunes urls to play around with.
list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
"https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
"https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
"https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
"https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
"https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
"https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
"https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
"https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
"https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
"https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}
To test out beautifulsoup (bs in my code), I used one app for each store:
gptest = bs(urllib.urlopen("https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en"))
ios = bs(urllib.urlopen("https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8"))
I found an app's category on iTunes using:
print ios.find(itemprop="applicationCategory").get_text()
...and on Google Play:
print gptest.find(itemprop="genre").get_text()
With this newfound confidence, I wanted to try to iterate through my entire list and output these values, but then I realized I suck at for loops...
Here's my attempt:
def opensite():
for item in list:
bs(urllib.urlopen())
for item in list:
try:
if "itunes.apple.com" in row:
print "Category:", opensite.find(itemprop="applicationCategory").get_text()
else if "play.google.com" in row:
print "Category", opensite.find(itemprop="genre").get_text()
except:
pass
Note: Ideally I'd be passing a csv (called "sample" with one column "URL") so I believe my loop would start with
for row in sample.URL:
but I figured it was more helpful to show you a list rather than deal with a data frame.
Thanks in advance!

from __future__ import print_function #
try: #
from urllib import urlopen # Support Python 2 and 3
except ImportError: #
from urllib.request import urlopen #
from bs4 import BeautifulSoup as bs
for line in open('urls.dat'): # Read urls from file line by line
doc = bs(urlopen(line.strip()), 'html5lib') # Strip \n from url, open it and parse
if 'apple.com' in line:
prop = 'applicationCategory'
elif 'google.com' in line:
prop = 'genre'
else:
continue
print(doc.find(itemprop=prop).get_text())

Try this for reading urls from list:
from bs4 import BeautifulSoup as bs
import urllib2
import requests
list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
"https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
"https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
"https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
"https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
"https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
"https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
"https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
"https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
"https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
"https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}
def opensite():
for item in list:
bs(urllib2.urlopen(item),"html.parser")
source = requests.get(item)
text_new = source.text
soup = bs(text_new, "html.parser")
try:
if "itunes.apple.com" in item:
print item,"Category:",soup.find('span',{'itemprop':'applicationCategory'}).text
elif "play.google.com" in item:
print item,"Category:", soup.find('span',{'itemprop':'genre'}).text
except:
pass
opensite()
It will print like
https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en Category: Role Playing
https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en Category: Role Playing
https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en Category: Role Playing
https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en Category: Role Playing
https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8 Category: Games
https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en Category: Role Playing
https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en Category: Role Playing
https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8 Category: Games

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup returning unrelated HTML - python

Related

How to fix this web scraping program?

Stuck Scraping with Beautifulsoup

Python - Beginner Scraping with Beautiful Soup 4 - onmouseover

Using BeautifulSoup to scrape li's and id's in same method

urlopen for loop with beautifulsoup

Categories

Resources