Cannot scrape some table using Pandas - python

i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!

If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1

Related

Scrape web with info from several years and create a csv file for each year

I have scraped information with the results of the 2016 Chess Olympiad, using the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
start_year=2012
i=2
end_year=2016
def download_chess(start_year):
url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
response = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
while start_year<end_year:
download_chess(start_year)
start_year+=i
download_chess(start_year)
I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.
The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:
import requests
import pandas as pd
years = range(2000, 2019, 2)
for y in years:
try:
df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
print(df)
df.to_csv(f'chess_olympics_{y}.csv')
except Exception as e:
print(y, 'error', e)
This will print out the results table for each year:
no.
team
Elo
flag
code
pos.
pts
Buch
MP
gms
nan
+
=
-
nan
+
=
-
nan
%
Eloav
Elop
ind.medals
3
1
Russia
2685
nan
RUS
1
38
457.5
20
56
nan
8
4
2
nan
23
30
3
nan
67.9
2561
2694
1 - 0 - 2
4
2
Germany
2604
nan
GER
2
37
455.5
22
56
nan
10
2
2
nan
21
32
3
nan
66.1
2568
2685
0 - 0 - 2
5
3
Ukraine
2638
nan
UKR
3
35½
457.5
21
56
nan
8
5
1
nan
18
35
3
nan
63.4
2558
2653
1 - 0 - 0
6
4
Hungary
2661
nan
HUN
4
35½
455.5
21
56
nan
8
5
1
nan
22
27
7
nan
63.4
2570
2665
0 - 0 - 0
7
5
Israel
2652
nan
ISR
5
34½
463.5
20
56
nan
7
6
1
nan
17
35
4
nan
61.6
2562
2649
0 - 0 - 0
[...]
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Python Beautiful Soup Webscraping: Cannot get a full table to display

I am relatively new to python and this is my first web scrape. I am trying to scrape a table and can only get the first column to show up. I am using the find method instead of find_all which I am pretty sure what is causing this, but when I use the find_all method I cannot get any text to display. Here is the url I am scraping from: https://www.fangraphs.com/teams/mariners/stats
I am trying to get the top table (Batting Stat Leaders) to work. My code is below:
from bs4 import BeautifulSoup
import requests
import time
htmlText = requests.get('https://www.fangraphs.com/teams/mariners/stats').text
soup = BeautifulSoup(htmlText, 'lxml', )
playerTable = soup.find('div', class_='team-stats-table')
input = input("Would you like to see Batting, Starting Pitching, Relief Pitching, or Fielding Stats? \n")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all('tr')[1:55]:
tds = tr.find('td').text
print(tds)
if input == "Batting" or "batting":
BattingStats()
You can use list-comprehension to get text from all rows:
import requests
from bs4 import BeautifulSoup
playerTable = soup.find("div", class_="team-stats-table")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all("tr")[1:55]:
tds = [td.text for td in tr.select("td")]
print(tds)
BattingStats()
Prints:
BATTING STATS:
Player Name:
Mitch Haniger 30 94 406 25 0 6.7% 23.4% .257 .291 .268 .323 .524 .358 133 0.2 16.4 -6.5 2.4
Ty France 26 89 372 9 0 7.3% 16.9% .150 .314 .276 .355 .426 .341 121 0.0 9.5 -2.6 2.0
Kyle Seager 33 97 403 18 2 8.4% 25.8% .201 .246 .215 .285 .416 .302 95 -0.3 -2.9 5.4 1.6
...
Solution with pandas:
import pandas as pd
url = "https://www.fangraphs.com/teams/mariners/stats"
df = pd.read_html(url)[7]
print(df)
Prints:
Name Age G PA HR SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR
0 Mitch Haniger 30 94 406 25 0 6.7% 23.4% 0.257 0.291 0.268 0.323 0.524 0.358 133.0 0.2 16.4 -6.5 2.4
1 Ty France 26 89 372 9 0 7.3% 16.9% 0.150 0.314 0.276 0.355 0.426 0.341 121.0 0.0 9.5 -2.6 2.0
2 Kyle Seager 33 97 403 18 2 8.4% 25.8% 0.201 0.246 0.215 0.285 0.416 0.302 95.0 -0.3 -2.9 5.4 1.6
...

How come my web scraping results won't print after looping through each web page?

I am writing a python script that uses selenium to parse each page of basketball stats on ESPN over the last 18 years (each year's stats is its own web page). I am able to connect to the site and parse no problem, however my results are not populating in the terminal while the parsing is occurring. I used a regex checker to make sure the elements I am trying to grab (for now, just the value after "data-idx=" in the html) are correct and they seem to be so I am not too sure what I am doing wrong. Please see code below:
import requests
import pandas as pd
import re
import time
from selenium import webdriver
# Initializing parameters an tools
driver = webdriver.Chrome()
url = "https://www.espn.com/nba/stats/player/_/season/$NUM$/seasontype/2/table/offensive/sort/avgPoints/dir/desc"
# Parsing the starting page to calculate total number of pages
starting_URL = url.replace("$NUM$", str(2002))
print("Starting with:" + starting_URL)
driver.get(starting_URL)
starting_page_content = driver.page_source
# Collecting stats from all pages
for i in range(2001,2020):
page_URL = url.replace("$NUM$", str(i+1))
print("Collecting stats from: " + page_URL)
driver.get(page_URL)
time.sleep(1) # a good practice is to wait a little time between each HTTP request
page_content = driver.page_source # getting HTML source of page i
all_chunks = re.compile(r'Table__TR--sm(.*?)data-idx=\"([^\"]+)\"').findall(page_content) # #UndefinedVariable
if len(all_chunks) > 0: # if found any
for chunk in all_chunks:
#initialization
player_index=""
#parsing index
indexes = re.compile(r'data-idx=\"([^\"]+)\"stack ',re.S|re.I).findall(str(chunk)) # #UndefinedVariable
if(len(indexes) > 0):
player_index = indexes.group(1)[0]
print(player_index) # printing collected data to screen
driver.close()
You could use pandas.read_html to get the desired output.
dfs = pd.read_html(page_content)
pd.DataFrame.merge(*dfs, left_index=True, right_index=True)
RK Name POS GP MIN PTS FGM FGA FG% 3PM ... FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 Allen IversonPHI SG 60 43.7 31.4 11.1 27.8 39.8 1.3 ... 9.8 81.2 4.5 5.5 2.8 0.2 4.0 4 1 0.0
1 2 Shaquille O'NealLAL C 67 36.1 27.2 10.6 18.3 57.9 0.0 ... 10.7 55.5 10.7 3.0 0.6 2.0 2.6 40 0 0.0
2 3 Paul PierceBOS SF 82 40.3 26.1 8.6 19.5 44.2 2.6 ... 7.8 80.9 6.9 3.2 1.9 1.0 2.9 17 0 0.0
A few other suggestions
don't use regular expressions to parse HTML, see this famous answer
use Selenium's functionality such as XPATH to locate your elements
You can replace placeholder in strings with Python builtin functionality, e.g.
url.format(i)

Multiple table header <thead> in table <table> and how to scrape data from <thead> as a table row

I'm trying to scrape data from a website but the table has two sets of data, first, 2-3 lines of data are in thead and rest in tbody. I can easily extract data only from one at a time when I try both I got some error like TypeError, AttributeError. btw I'm using python
here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.worldometers.info/world-population/"
r=requests.get(url)
print(r)
html=r.text
soup=BeautifulSoup(html,'html.parser')
print(soup.title.text)
print()
print()
live_data=soup.find_all('div',id='maincounter-wrap')
print(live_data)
for i in live_data:
print(i.text)
table_body=soup.find('thead')
table_rows=table_body.find_all('tr')
table_body_2=soup.find('tbody')
table_rows_2=soup.find_all('tr')
year_july1=[]
population=[]
yearly_change_in_perchantage=[]
yearly_change=[]
median_age=[]
fertillity_rate=[]
density=[]#density (p\km**)
urban_population_in_perchantage=[]
urban_population=[]
for tr in table_rows:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
for tr in table_rows_2:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
headers=['year_july1','population','yearly_change_in_perchantage','yearly_change','median_age','fertillity_rate','density','urban_population_in_perchantage','urban_population']
data_2= pd.DataFrame(list(zip(year_july1,population,yearly_change_in_perchantage,yearly_change,median_age,fertillity_rate,density,urban_population_in_perchantage,urban_population)),columns=headers)
print(data_2)
data_2.to_csv("C:\\Users\\data_2.csv")
you can try the below code it generates the required data. Do let me know if you need any clarification:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/world-population/'
html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
#print(df)
df.to_csv("data.csv", index=False)
gives me below output
print(df)
Year (July 1) Population ... Urban Pop % Urban Population
0 2020 7794798739 ... 56.2 % 4378993944
1 2019 7713468100 ... 55.7 % 4299438618
2 2018 7631091040 ... 55.3 % 4219817318
3 2017 7547858925 ... 54.9 % 4140188594
4 2016 7464022049 ... 54.4 % 4060652683
5 2015 7379797139 ... 54.0 % 3981497663
6 2010 6956823603 ... 51.7 % 3594868146
7 2005 6541907027 ... 49.2 % 3215905863
8 2000 6143493823 ... 46.7 % 2868307513
9 1995 5744212979 ... 44.8 % 2575505235
10 1990 5327231061 ... 43.0 % 2290228096
11 1985 4870921740 ... 41.2 % 2007939063
12 1980 4458003514 ... 39.3 % 1754201029
13 1975 4079480606 ... 37.7 % 1538624994
14 1970 3700437046 ... 36.6 % 1354215496
15 1965 3339583597 ... N.A. N.A.
16 1960 3034949748 ... 33.7 % 1023845517
17 1955 2773019936 ... N.A. N.A.
[18 rows x 9 columns]

Python: Get html table data by xpath

I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..
Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?
Example:
url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
xpath_str = //*[#id="sortabletable"]
I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty
I know that sites like thingspeak can do these things..
There is a fairly general pattern which you could use to parse many, though not
all, tables.
import lxml.html as LH
import requests
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)
for table in root.xpath('//table[#id="sortabletable"]'):
header = [text(th) for th in table.xpath('//th')] # 1
data = [[text(td) for td in tr.xpath('td')]
for tr in table.xpath('//tr')] # 2
data = [row for row in data if len(row)==len(header)] # 3
data = pd.DataFrame(data, columns=header) # 4
print(data)
You can use table.xpath('//th') to find the column names.
table.xpath('//tr') returns the rows, and for each row, tr.xpath('td')
returns the element representing one "cell" of the table.
Sometimes you may need to filter out certain rows, such as in this case, rows
with fewer values than the header.
What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:
Pris Adresse Tidspunkt
0 8.04 Brovejen 18 5500 Middelfart 3 min 38 sek
1 7.88 Hovedvejen 11 5500 Middelfart 4 min 52 sek
2 7.88 Assensvej 105 5500 Middelfart 5 min 56 sek
3 8.23 Ejby Industrivej 111 2600 Glostrup 6 min 28 sek
4 8.15 Park Alle 125 2605 Brøndby 25 min 21 sek
5 8.09 Sletvej 36 8310 Tranbjerg J 25 min 34 sek
6 8.24 Vindinggård Center 29 7100 Vejle 27 min 6 sek
7 7.99 * Søndergade 116 8620 Kjellerup 31 min 27 sek
8 7.99 * Gertrud Rasks Vej 1 9210 Aalborg SØ 31 min 27 sek
9 7.99 * Sorøvej 13 4200 Slagelse 31 min 27 sek
If you mean all the text:
from bs4 import BeautifulSoup
url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
import requests
r = requests.get(url_str).content
print([x.text for x in BeautifulSoup(r).find_all("table",attrs={"id":"sortabletable"})]
['Pris\nAdresse\nTidspunkt\n\n\n\n\n* Denne pris er indberettet af selskabet Indberet pris\n\n\n\n\n\n\xa08.24\n\xa0Gladsaxe Møllevej 33 2860 Søborg\n7 min 4 sek \n\n\n\n\xa08.89\n\xa0Frederikssundsvej 356 2700 Brønshøj\n9 min 10 sek \n\n\n\n\xa07.98\n\xa0Gartnerivej 1 7500 Holstebro\n14 min 25 sek \n\n\n\n\xa07.99 *\n\xa0Søndergade 116 8620 Kjellerup\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Gertrud Rasks Vej 1 9210 Aalborg SØ\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Sorøvej 13 4200 Slagelse\n15 min 7 sek \n\n\n\n\xa08.08 *\n\xa0Tørholmsvej 95 9800 Hjørring\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Nordvej 6 9900 Frederikshavn\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Skelmosevej 89 6980 Tim\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Højgårdsvej 2 4000 Roskilde\n15 min 7 sek']

Categories

Resources