Web Scraping a page with multiple tables

Web Scraping a page with multiple tables - python

I am trying to web scrape the second table from this website:
https://fbref.com/en/comps/9/stats/Premier-League-Stats
However, I have only ever managed to extract the information from the first table when trying to access the information by finding the table tag. Would anyone be able to explain to me why I cannot access the second table or show me how to do it.
import requests
from bs4 import BeautifulSoup
url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
pl_table = soup.find_all("table")
player_table = tables[0]

Something along these lines should do it
tables = soup.find_all("table") # returns a list of tables
second_table = tables[1]

The table is inside HTML comments <!-- ... -->.
To get the table from comments, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = BeautifulSoup(soup.select_one('#all_stats_standard').find_next(text=lambda x: isinstance(x, Comment)), 'html.parser')
#print some information from the table to screen:
for tr in table.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print('{:<30}{:<20}{:<10}'.format(tds[0], tds[3], tds[5]))
Prints:
Patrick van Aanholt Crystal Palace 1990
Max Aarons Norwich City 2000
Tammy Abraham Chelsea 1997
Che Adams Southampton 1996
Adrián Liverpool 1987
Sergio Agüero Manchester City 1988
Albian Ajeti West Ham 1997
Nathan Aké Bournemouth 1995
Marc Albrighton Leicester City 1989
Toby Alderweireld Tottenham 1989
...and so on.

Related

Getting first (or a specific) td in BeautifulSoup with no class

I have one of those nightmare tables with no class given for the tr and td tags.
A sample page is here: https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m
(You'll see in the code below that I'm getting multiple pages, but that's not the problem.)
I want the team name (nothing else) from each bracket. The output should be:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
etc.
I've been able to get every td in the specified tables. But every attempt to use [0] to get the first td of every row gives me an "index out of range" error.
The code is:
import requests
import csv
from bs4 import BeautifulSoup
batch_size = 2
urls = ['https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m', 'https://system.gotsport.com/org_event/events/1271/schedules?age=17&gender=m']
# iterate through urls
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# iterate through leagues and teams
leagues = soup.find_all('table', class_='table table-bordered table-hover table-condensed')
for league in leagues:
row = ''
rows = league.find_all('tr')
for row in rows:
team = row.find_all('td')
teamName = team[0].text.strip()
print(teamName)
After a couple of hours of work, I feel like I'm just one syntax change away from getting this right. Yes?

You can use a CSS Selector nth-of-type(n). It works for both links:
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for tag in soup.select(".small-margin-bottom td:nth-of-type(1)"):
print(tag.text.strip())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
...
...
Real Salt Lake U19
Real Colorado
Empire United Soccer Academy

Each bracket corresponds to one "panel", and each panel has two rows, the first of which contains the first table of all teams in the match tables.
def main():
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for panel in soup.find_all("div", {"class": "panel-body"}):
for row in panel.find("tbody").find_all("tr"):
print(row.find("td").text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
Weston FC
Chargers SC
South Florida FA
Solar SC
RISE SC
...

I think the problem is with the header of the table, which contains th elements instead of td elements. It leads to the index of range error, when you try to retrieve first element from an empty list. Try to add check for the length of the td:
for row in rows:
team = row.find_all('td')
if(len(team) > 0):
teamName = team[0].text.strip()
print(teamName)
It should print you the team names.

Is there any way to select specific a tags using beautiful soup?

I'm using BeautifulSoup and trying to print all a tag href which contains only companies website url. But my code is selecting other href too. There are total 71 companies website links but my code is not selecting all those href.
This is the source from where I'm extracting data
Here is my code
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.constructionplacements.com/top-construction-companies-in-india-2020/'
name_data = []
website_data = []
print(url)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Loop to select and print all companies title
for h in soup.select('h4'):
print(h.text)
name_data.append(h.text)
# Loop to select and print all companies website url
for w in soup.select('p em a'):
print(w['href'])
website_data.append(w['href'])
df = pd.DataFrame({
'Company Title': name_data,
'Website': website_data
})
print(df)
df.to_csv('ata.csv')

To get all links to companies, you can use this example:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.constructionplacements.com/top-construction-companies-in-india-2020/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for h4 in soup.find_all(lambda t: t.name=='h4' and re.search(r'^\d+\s*\.', t.text)):
print('{:<75} {}'.format(h4.text, h4.find_next('a')['href']))
Prints:
1. L&T Engineering & Construction Division (L&T ECC), Chennai http://www.lntecc.com/
2. Tata Projects Ltd, Mumbai http://www.tataprojects.com/
3. Shapoorji Pallonji & Co Ltd, Mumbai https://www.shapoorjipallonji.com/
4. GMR Group, Mumbai http://www.gmrgroup.in/
5. Hindustan Construction Company (HCC), Mumbai http://www.hccindia.com/
6. Afcons Infrastructure Limited, Mumbai http://www.shapoorjipallonji.com/
7. JMC Projects, Mumbai https://www.jmcprojects.com/
8. Gammon India Ltd, Mumbai http://www.gammonindia.com
9. IVRCL, Hyderabad http://www.ivrcl.com/
10. J Kumar Infra, Mumbai http://www.jkumar.com/
11. Gammon Infrastructure Projects Limited (GIPL), Mumbai http://www.gammoninfra.com/
12. Reliance Infrastructure http://www.rinfra.com
13. Ashoka Buildcon, Nashik https://ashokabuildcon.com/
14. B L Kashyap & Sons Ltd (BLK), New Delhi http://www.blkashyap.com
15. Consolidated Construction Consortium Ltd (CCCL), Chennai http://www.ccclindia.com/
16. Essar Group, Mumbai https://www.essar.com/
...and so on.

How to use Beautiful Soup find all to scrape only a list that is a part of the body

I am having trouble scraping this wikipedia list with the neighborhoods of Los Angeles using beautiful soup. I am getting all the content of the body and not just the neighborhood list like I would like to. I saw a lot about how to scrape a table but I got stucked in how to apply the table logic in this case.
This is the code I have been using:
import BeautifulSoup
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')
neighborhoodList = []
-- append the data into the list
for row in soup.find_all("div", class_="mw-body")[0].findAll("li"):
neighborhoodList.append(row.text.replace(', LA',''))
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})

If you review the page source the neighborhood entries are within divs that have a class of "div-col" and the link contains an attribute of "title".
Also, the replace on the text during the append doesn't appear to be needed.
The following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
neighborhoodList = []
# -- append the data into the list
links = []
for row in soup.find_all("div", class_="div-col"):
for item in row.select("a"):
if item.has_attr('title'):
neighborhoodList.append(item.text)
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})
print(f'First 10 Rows:')
print(df_neighborhood.head(n=10))
print(f'\nLast 10 Rows:')
print(df_neighborhood.tail(n=10))
Results:
First 10 Rows:
Neighborhood
0 Angelino Heights
1 Arleta
2 Arlington Heights
3 Arts District
4 Atwater Village
5 Baldwin Hills
6 Baldwin Hills/Crenshaw
7 Baldwin Village
8 Baldwin Vista
9 Beachwood Canyon
Last 10 Rows:
Neighborhood
186 Westwood Village
187 Whitley Heights
188 Wholesale District
189 Wilmington
190 Wilshire Center
191 Wilshire Park
192 Windsor Square
193 Winnetka
194 Woodland Hills
195 Yucca Corridor

incomplete result in table scraping

I am trying to scrape http://bifr.nic.in/asp/list.asp this page with beautifulsoup and get the table from it.
Following is my code
from bs4 import BeautifulSoup
import urllib.request
base_url = "http://bifr.nic.in/asp/list.asp"
page = urllib.request.urlopen(base_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table",{"class":"forumline"})
tr = table.find_all("tr")
for rows in tr:
print(rows.get_text())
It shows no error, but when i execute it i am only able to get first row of content from the table.
List of Companies
Case
No
Company
Name
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
This is the result i am getting. I can't understand what's going on.

Try this to get all the data from that table:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://bifr.nic.in/asp/list.asp")
soup = BeautifulSoup(page, "html5lib")
table = soup.select_one("table.forumline")
for items in table.select("tr")[4:]:
data = ' '.join([item.get_text(" ",strip=True) for item in items.select("td")])
print(data)
Partial Output:
359 2000 A & F OVERSEAS LTD.
99 1988 A B C PRODUCTS LTD.
103 1989 A INFRASTRUCTURE LTD.
3 2006 A V ALLOYS LTD.
13 1988 A V J WIRES LTD.

Probably the page code contains some errors in html markup, try use html5lib instead of html.parser, but before you need install it:
pip install html5lib
soup = BeautifulSoup(page, "html5lib")

Beautiful Soup - Why does my scrape stop half way through table?

I'm trying to figure out why my web scrape stops half way through?
My code:
import requests,re
from bs4 import BeautifulSoup
url ="http://odds.aussportsbetting.com/betting?competitionid=15"
r = requests.get(url)
soup = BeautifulSoup(r.content,'html')
#get market
m_data = soup.find_all('div', {'class': 'tabContent'})
for items in m_data:
all_rows = items.findAll('tr')
for data in all_rows:
game = data.findAll('a', {'title': 'Click To Compare Odds'})
market = data.findAll('a', {'title': 'Click To Compare Odds Sorted By Best Bookmaker Odds'})
for g_row in game:
text = ''.join(g_row.findAll(text=True))
g_data = text.strip()
print g_data
for g_row in market:
text = ''.join(g_row.findAll(text=True))
g_data = text.strip()
print g_data
My output:
Cleveland # Cincinnati
102.93
Miami # Buffalo
102.27
Green Bay # Carolina
123.42
St Louis # Minnesota
102.92
Washington # New England
101.85
Tennessee # New Orleans
185.93
Jacksonville # New York Jets
189.21
Oakland # Pittsburgh
102.51
Atlanta # San Francisco
101.75
If you notice in this link Click here you will see there is a lot more data that needs to be scraped however it stops. Can you help determine why?

You may want to change your parser.
The built in one, html, can pretty easily fail, use html.parser or lxml instead:
soup = BeautifulSoup(r.content,'html.parser')
See the BeautifulSoup Docs for some more information about recommended parsers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping a page with multiple tables - python

Something along these lines should do it tables = soup.find_all("table") # returns a list of tables second_table = tables[1]

Related

Getting first (or a specific) td in BeautifulSoup with no class

Is there any way to select specific a tags using beautiful soup?

How to use Beautiful Soup find all to scrape only a list that is a part of the body

incomplete result in table scraping

Beautiful Soup - Why does my scrape stop half way through table?

Categories

Resources