How to use beautiful soup find function to extract html elements - python

I am trying to use beautiful soup to pull the table corresponding to the HTML code below
<table class="sortable stats_table now_sortable" id="team_pitching" data-cols-to-freeze=",2">
<caption>Team Pitching</caption>
from https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2. Here is a screenshot of the site layout and HTML code I am trying to extract from.
I was using the code
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
res = requests.get(url)
soup1 = BS(res.content, 'html.parser')
table1 = soup1.find('table',{'id':'team_pitching'})
table1
I can't seem to figure out how to get this working. The table above can be extracted with the line
table1 = soup1.find('table',{'id':'team_batting'})
and I figured similar code should work for the one below. Additionally, is there a way to extract this using the table class "sortable stats_table now_sortable" rather than id?

The problem is that if you open the page normally it shows all the tables, however if you load the page with Developer Tools just the first table is shown. So, when you do your request the left tables are not included into the HTML you're getting. The table you're looking for is not shown until "Show team pitchin" button is pressed, to do this you could use Selenium and get the full HTML response.

That is because the table you are looking for - i.e. <table> with id="team_pitching" is present as a comment inside the soup. You can check it for yourself by printing soup.
You need to
Extract that comment from the soup
Convert it into a soup object
Extract the table data from the soup object.
Here is the complete code that does the above mentioned steps.
from bs4 import BeautifulSoup, Comment
import requests
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
main_div = soup.find('div', {'id': 'all_team_pitching'})
# Extracting the comment from the above selected <div>
for comments in main_div.find_all(text=lambda x: isinstance(x, Comment)):
temp = comments.extract()
# Converting the above extracted comment to a soup object
s = BeautifulSoup(temp, 'lxml')
trs = s.find('table', {'id': 'team_pitching'}).find_all('tr')
# Printing the first five entries of the table
for tr in trs[1:5]:
print(list(tr.stripped_strings))
The first 5 entries from the table
['1', 'Tyler Ahearn', '21', '1', '0', '1.000', '1.93', '6', '0', '0', '1', '9.1', '8', '5', '2', '0', '4', '14', '0', '0', '0', '42', '1.286', '7.7', '0.0', '3.9', '13.5', '3.50']
['2', 'Jack Anderson', '20', '2', '0', '1.000', '0.79', '4', '1', '0', '0', '11.1', '6', '4', '1', '0', '3', '11', '1', '0', '0', '45', '0.794', '4.8', '0.0', '2.4', '8.7', '3.67']
['3', 'Shane Drohan', '*', '21', '0', '1', '.000', '4.08', '4', '4', '0', '0', '17.2', '15', '12', '8', '0', '11', '27', '1', '0', '2', '82', '1.472', '7.6', '0.0', '5.6', '13.8', '2.45']
['4', 'Conor Grady', '21', '2', '0', '1.000', '3.00', '4', '4', '0', '0', '15.0', '10', '5', '5', '3', '8', '15', '1', '0', '2', '68', '1.200', '6.0', '1.8', '4.8', '9.0', '1.88']

Related

How to find a specific pattern into a list

I'm trying to create a pdf reader in python, I already got the pdf read and
I got a list with the content of the pdf and I want now to give me back the numbers with eleven characters, like 123.456.789-33 or 124.323.432.33
from PyPDF2 import PdfReader
import re
reader = PdfReader(r"\\abcdacd.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
num = re.findall(r'\d+', text)
print(num)
here's the output:
['01', '01', '2000', '26', '12', '2022', '04483203983', '044', '832', '039', '83', '20210002691450', '5034692', '79', '2020', '8', '24', '0038', '1', '670', '03', '2', '14', '2', '14', '1', '670', '03', '2', '14', '2', '14', '1', '1', '8', '21', '1']
If someone could help me, I'll be really thankful.
Change regex pattern to the following (to match groups of digits):
s = 'text text 123.456.789-33 or 124.323.432.33 text or 12323112333 or even 123,231,123,33 '
num = re.findall(r'\d{3}[.,]?\d{3}[.,]?\d{3}[.,-]?\d{2}', s)
print(num)
['123.456.789-33', '124.323.432.33', '12323112333', '123,231,123,33']
You can try:
\b(?:\d[.-]*){11}\b
Regex demo.
import re
s = '''\
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9'''
pat = re.compile(r'\b(?:\d[.-]*){11}\b')
for m in pat.findall(s):
print(m)
Prints:
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9

How to iterate through all tags of a website in Python with Beautifulsoup?

I'm a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html"
It's a simple website for practice. The HTML code look something like:
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
I need to get the number between comments and span (100,100,99)
Below is my code:
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
tag=soup.span
print(tag) #<span class="comments">100</span>
print(tag.string) #100
I got the number 100 but only the first one, now I want to get all of them by iterating through a list or sth like that. What is the method to do this with beautifulsoup?
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
for i in tags:
print(i.string)
You can use find_all() function and then iterate it to get the numbers.
If you want names also you can use python dictionary :
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
comments = {}
for index, tag in enumerate(tags):
commentorName = tag.find_previous('tr').text
commentorComments = tag.string
comments[commentorName] = commentorComments
print(comments)
This will give you the output as :
{'Melodie100': '100', 'Machaela100': '100', 'Rhoan99': '99', 'Murrough96': '96', 'Lilygrace93': '93', 'Ellenor93': '93', 'Verity89': '89', 'Karlie88': '88', 'Berlin85': '85', 'Skylar84': '84', 'Benny84': '84', 'Crispin81': '81', 'Asya79': '79', 'Kadi76': '76', 'Dua74': '74', 'Stephany73': '73', 'Eila71': '71', 'Jennah70': '70', 'Eduardo67': '67', 'Shannan61': '61', 'Chymari60': '60', 'Inez60': '60', 'Charlene59': '59', 'Rosalin54': '54', 'James53': '53', 'Rhy53': '53', 'Zein52': '52', 'Ayren50': '50', 'Marissa46': '46', 'Mcbride46': '46', 'Ruben45': '45', 'Mikee41': '41', 'Carmel38': '38', 'Idahosa37': '37', 'Brooklin37': '37', 'Betsy36': '36', 'Kayah34': '34', 'Szymon26': '26', 'Tea24': '24', 'Queenie24': '24', 'Nima23': '23', 'Eassan23': '23', 'Haleema21': '21', 'Rahma17': '17', 'Rob17': '17', 'Roma16': '16', 'Jeffrey14': '14', 'Yorgos12': '12', 'Denon11': '11', 'Jasmina7': '7'}
Try the following approach:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html, 'html.parser')
data = []
for tr in soup.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
data.append(row[1]) # or data.append(row) for both
print(data)
Giving you data holding a list containing just the one column:
['Comments', '100', '100', '99', '96', '93', '93', '89', '88', '85', '84', '84', '81', '79', '76', '74', '73', '71', '70', '67', '61', '60', '60', '59', '54', '53', '53', '52', '50', '46', '46', '45', '41', '38', '37', '37', '36', '34', '26', '24', '24', '23', '23', '21', '17', '17', '16', '14', '12', '11', '7']
First locate all of the table <tr> rows. Then extract all of the <td> values for each row. As you only want the second one, append row[1] to a data list holding your values.
You can skip the first one if needed with data[1:].
This approach would let you also save the name at the same time by appending the whole of row. e.g. use data.append(row) instead...
You could then display the entries using:
for name, comment in data[1:]:
print(name, comment)
Giving output starting:
Melodie 100
Machaela 100
Rhoan 99
Murrough 96
Lilygrace 93
Ellenor 93
Verity 89
Karlie 88

I can't locate a reocurring element from a bs4 object

The issue I am having is driving me crazy. I am trying to pull text from the Pro Football Reference website.
The information I need is in a td element displaying qb hurries In the second section of the web page. The information is in a td element called qb_hurry. Here is what I have so far:
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
I tried
totalQbHurrys = soup.find('div', {'id':'all_detailed_defense'})
and I can see the information I need to pull when I parse through the beautiful soup object and print it. But when I try to retrieve the td element I need
totalQbHurrys = soup.find('div', {'id':'all_detailed_defense'}).find('td', {'data-stat':'qb_hurry'})
it returns None, I think the text I am looking for exists as a comment first, but I am having trouble getting to the actual HTML element I need. Would anyone know of a way to target the qb_hurry element successfully?
The issue is that this field is inside HTML comment tag.
Here is a resolution :
import bs4
import requests
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
extract = soup.find('div', {'id':'all_detailed_defense'})
for comments in extract.find_all(text=lambda text:isinstance(text, bs4.Comment)):
comments.extract()
soup2 = bs4.BeautifulSoup(comments, 'html.parser')
totalQbHurrys = soup2.find('td', {'data-stat':'qb_hurry'})
print(totalQbHurrys)
PS: I have used this trick : https://stackoverflow.com/a/52874885/2186074
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://www.pro-football-reference.com/players/D/DonaAa00.htm")
df = pd.read_html(driver.page_source, attrs={
'class': 'row_summable sortable stats_table now_sortable'}, header=1)[0]
print(df.loc[1, 'Hrry'])
driver.quit()
Output:
32
The HTML you need is inside a comment so will not be directly visible in the soup. You need to first grab the comment and then parse this as a new soup object. From this you can then locate the tr and th elements. For example:
from bs4 import BeautifulSoup, Comment
import requests
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = BeautifulSoup(res.text, 'html.parser')
div = soup.find('div', {'id':'all_detailed_defense'})
comment_html = div.find(string=lambda text: isinstance(text, Comment))
comment_soup = BeautifulSoup(comment_html, 'html.parser')
for tr in comment_soup.find_all('tr'):
row = [td.text for td in tr.find_all(['td', 'th'])]
print(row)
Giving you:
['', 'Games', 'Pass Coverage', 'Pass Rush', 'Tackles']
['Year', 'Age', 'Tm', 'Pos', 'No.', 'G', 'GS', 'Int', 'Tgt', 'Cmp', 'Cmp%', 'Yds', 'Yds/Cmp', 'Yds/Tgt', 'TD', 'Rat', 'DADOT', 'Air', 'YAC', 'Bltz', 'Hrry', 'QBKD', 'Sk', 'Prss', 'Comb', 'MTkl', 'MTkl%']
['2018*+', '27', 'LAR', 'DT', '99', '16', '16', '0', '1', '0', '0.0%', '0', '', '0.0', '0', '39.6', '-2.0', '0', '0', '0', '30', '19', '20.5', '70', '59', '6', '9.2%']
['2019*+', '28', 'LAR', 'DT', '99', '16', '16', '0', '0', '0', '', '0', '', '', '0', '', '', '0', '0', '0', '32', '9', '12.5', '55', '48', '6', '11.1%']

How can I access a URL using python requests when the page changes when I request it, even though the parameters are in the URL?

I'm trying to scrape the following website:
http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=2018&batting_team=119&batter=571771&pitching_team=133&pitcher=641941
(this is an example URL with a certain pitcher/batter matchup)
I'm able to enter the player codes and team codes easily with this function:
def matchupURL(season, batter, batterTeam, pitcher, pitcherTeam):
return "http://mlb.mlb.com/stats/sortable_batter_vs_pitcher.jsp#season=" + str(season)+ "&batting_team="+str(teamNumDict[batterTeam])+"&batter="+str(batter)+"&pitching_team="+str(teamNumDict[pitcherTeam])+"&pitcher="+str(pitcher);
which works nicely, and the returned string works when pasted into my browser.
But when i make a request a la
newURL = matchupURL(2018,i.id,x.home_team,j.id,x.away_team)
print(i+ " vs " + j)
newSes = requests.get(newURL);
html = BeautifulSoup(newSes.text, "lxml")
mydivs = html.findAll("td",{"class":"dg-ops"})
#do something with this div
I'm unable to find the div. Infact, the entire format of the HTML returned changes. Further, adding headers didnt help, nor did using urllib instead of requests.
This page is a dynamic, i.e., the content is dynamically generated by javascript and showed in the front. That is the reason you can't detect the div tag.
But in this case you can scrape easier. With inspect tool from your browser you can detect that the data comes from a GET request to an URL. For your example, you only have to provide the players id :
import requests
url = 'http://lookup-service-prod.mlb.com/json/named.stats_batter_vs_pitcher_composed.bam'
params = {"sport_code":"'mlb'","game_type":"'R'","player_id":"571771","pitcher_id":"641941"}
resp = requests.get(url, params=params).json()
print(resp)
That prints:
{'stats_batter_vs_pitcher_composed': {'stats_batter_vs_pitcher_total': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'bats': 'R', 'xbh': '1', 'bb': '0', 'slg': '4.000', 'avg': '1.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sf': '0', 'tpa': '1', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'go': '0'}}}, 'copyRight': ' Copyright 2018 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt ', 'stats_batter_vs_pitcher': {'queryResults': {'created': '2018-04-12T22:21:47', 'totalSize': '1', 'row': {'hr': '1', 'gidp': '0', 'pitcher_first_last_html': 'Emilio Pagán', 'player': 'Hernandez, Enrique', 'np': '4', 'sac': '0', 'pitcher': 'Pagan, Emilio', 'rbi': '1', 'opponent': 'Oakland Athletics', 'player_first_last_html': 'Enrique Hernández', 'tb': '4', 'xbh': '1', 'bats': 'R', 'bb': '0', 'avg': '1.000', 'slg': '4.000', 'pitcher_id': '641941', 'ops': '5.000', 'hbp': '0', 'pitcher_html': 'Pagán, Emilio', 'g': '', 'd': '0', 'so': '0', 'throws': 'R', 'sport': 'MLB', 'sf': '0', 'team': 'Los Angeles Dodgers', 'tpa': '1', 'league': 'NL', 'h': '1', 'cs': '0', 'obp': '1.000', 't': '0', 'ao': '0', 'season': '2018', 'r': '1', 'go_ao': '-.--', 'sb': '0', 'opponent_league': 'AL', 'player_html': 'Hernández, Enrique', 'sbpct': '.---', 'player_id': '571771', 'ibb': '0', 'ab': '1', 'opponent_id': '133', 'team_id': '119', 'go': '0', 'opponent_sport': 'MLB'}}}}}

How to get rid of [''] when reading .csv files in python

I am opening and reading one .csv file at a time from a folder and printing them out as follows:
ownerfiles = os.listdir(filepath)
for ownerfile in ownerfiles:
if ownerfile.endswith(".csv"):
eachfile = (filepath + ownerfile) #loops over each file in ownerfiles
with open (eachfile, 'r', encoding="UTF-8") as input_file:
next(input_file)
print(eachfile)
for idx, line in enumerate(input_file.readlines()) :
line = line.strip().split(",")
print(line)
However, when I do print(line) the files are printing as follows:
/Users/Sulz/Desktop/MSBA/Applied Data Analytics/Test_File/ownerfile_138.csv
['']
['2010-01-01 11:28:35', '16', '54', '59', '0000000040400', 'O.Coffee Hot Small', 'I', ' ', ' ', '14', '1', '0', '0.3241', '1.4900', '1.4900', '1.4900', '0.0000', '1', '0', '0', '0', '0.0000', '0.0000', '1', '44', '0', '0.00000000', '1', '0', '0', '0.0000', '0', '0', '', '0', '5', '0', '0', '0', '0', 'NULL', '0', 'NULL', '', '0', '20436', '1', '0', '0', '1']
How can I get rid of [''] before the list of all the data ??
EDIT:
I now tried reading it with the .csv module like this:
ownerfiles = os.listdir(filepath)
for ownerfile in ownerfiles:
if ownerfile.endswith(".csv"):
eachfile = (filepath + ownerfile) #loops over each file in ownerfiles
with open (eachfile, 'r', encoding="UTF-8") as input_file:
next(input_file)
reader = csv.reader(input_file, delimiter=',', quotechar='|')
for row in reader :
print(row)
However, it still prints output like this:
[] ['2010-01-01 11:28:35', '16', '54', '59', '0000000040400', 'O.Coffee Hot Small', 'I', ' ', ' ', '14', '1', '0', '0.3241', '1.4900', '1.4900', '1.4900', '0.0000', '1', '0', '0', '0', '0.0000', '0.0000', '1', '44', '0', '0.00000000', '1', '0', '0', '0.0000', '0', '0', '', '0', '5', '0', '0', '0', '0', 'NULL', '0', 'NULL', '', '0', '20436', '1', '0', '0', '1']
That's just Python's list syntax being printed. You are splitting each line on a comma which is generating a list. If you print the line before the split you'll probably get what you're looking for:
line = line.strip()
print(line)
line = line.split(",")
By the way, Python has a built in CSV module for reading and writing csv files, in case you didn't know.
EDIT: Sorry, I misread your question. Add this to the start of your readlines loop:
line = line.strip()
if not line:
continue

Categories

Resources