I have one of those nightmare tables with no class given for the tr and td tags.
A sample page is here: https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m
(You'll see in the code below that I'm getting multiple pages, but that's not the problem.)
I want the team name (nothing else) from each bracket. The output should be:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
etc.
I've been able to get every td in the specified tables. But every attempt to use [0] to get the first td of every row gives me an "index out of range" error.
The code is:
import requests
import csv
from bs4 import BeautifulSoup
batch_size = 2
urls = ['https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m', 'https://system.gotsport.com/org_event/events/1271/schedules?age=17&gender=m']
# iterate through urls
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# iterate through leagues and teams
leagues = soup.find_all('table', class_='table table-bordered table-hover table-condensed')
for league in leagues:
row = ''
rows = league.find_all('tr')
for row in rows:
team = row.find_all('td')
teamName = team[0].text.strip()
print(teamName)
After a couple of hours of work, I feel like I'm just one syntax change away from getting this right. Yes?
You can use a CSS Selector nth-of-type(n). It works for both links:
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for tag in soup.select(".small-margin-bottom td:nth-of-type(1)"):
print(tag.text.strip())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
...
...
Real Salt Lake U19
Real Colorado
Empire United Soccer Academy
Each bracket corresponds to one "panel", and each panel has two rows, the first of which contains the first table of all teams in the match tables.
def main():
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for panel in soup.find_all("div", {"class": "panel-body"}):
for row in panel.find("tbody").find_all("tr"):
print(row.find("td").text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
Weston FC
Chargers SC
South Florida FA
Solar SC
RISE SC
...
I think the problem is with the header of the table, which contains th elements instead of td elements. It leads to the index of range error, when you try to retrieve first element from an empty list. Try to add check for the length of the td:
for row in rows:
team = row.find_all('td')
if(len(team) > 0):
teamName = team[0].text.strip()
print(teamName)
It should print you the team names.
Related
I'm trying to get the main body data from this website
I want to get a data frame (or any other object which makes life easier) as output with subheadings as column names and body under the subheading as lines under that column.
My code is below:
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headings = []
lines = []
my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
if strong.parent.name =='p':
if strong.find(text=re.compile("News")):
headings.append(strong.text)
#headings
k=0
for ul in article.findAll('ul'):
for li in ul.findAll('li'):
lines.append(li.text)
lines= lines + [""]
my_df[k] = pd.Series(lines)
k=k+1
my_df
I want to use the "headings" list to get the data frame column names.
Clearly I'm not writing the correct logic. I explored nextSibling, descendants and other attributes too, but I can't figure out the correct logic. Can someone please help?
Once you get the headline, use .find_next() to get that news article list. Then add them into a list under the headline as a key in a dictionary. Then simply use pd.concat() with ignore_index=False
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
end_of_news = False
sub_title = news_headline.find_next('p')
headlines[news_headline.text] = []
#print(news_headline.text)
while end_of_news == False:
headlines[news_headline.text].append(sub_title.text)
articles = sub_title.find_next('ul')
for li in articles.findAll('li'):
headlines[news_headline.text].append(li.text)
#print(li.text)
sub_title = articles.find_next('p')
if 'News' in sub_title.text or sub_title.text == '' :
end_of_news = True
df_list = []
for headings, lines in headlines.items():
temp = pd.DataFrame({headings:lines})
df_list.append(temp)
my_df = pd.concat(df_list, ignore_index=False, axis=1)
Output:
print(my_df)
National News ... Obituaries News
0 1. Cabinet approves 100% FDI under automatic r... ... 11. Eminent Kashmiri Writer Aziz Hajini passes...
1 The Union Cabinet, chaired by Prime Minister N... ... Noted writer and former secretary of Jammu and...
2 A total of 9 structural and 5 process reforms ... ... He has over twenty books in Kashmiri to his cr...
3 Change in the definition of AGR: The definitio... ... 12. Former India player and Mohun Bagan great ...
4 Rationalised Spectrum Usage Charges: The month... ... Former India footballer and Mohun Bagan captai...
5 Four-year Moratorium on dues: Moratorium has b... ... Bhabani Roy helped Mohun Bagan win the Rovers ...
6 Foreign Direct Investment (FDI): The governmen... ... 13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7 Auction calendar fixed: Spectrum auctions will... ... Double Olympic hammer throw gold medallist Yur...
8 Important takeaways for all competitive exams: ... He set the world record for the hammer throw w...
9 Minister of Communications: Ashwini Vaishnaw. ... He won his first gold medal at the 1976 Olympi...
[10 rows x 8 columns]
I am trying to web scrape the second table from this website:
https://fbref.com/en/comps/9/stats/Premier-League-Stats
However, I have only ever managed to extract the information from the first table when trying to access the information by finding the table tag. Would anyone be able to explain to me why I cannot access the second table or show me how to do it.
import requests
from bs4 import BeautifulSoup
url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
pl_table = soup.find_all("table")
player_table = tables[0]
Something along these lines should do it
tables = soup.find_all("table") # returns a list of tables
second_table = tables[1]
The table is inside HTML comments <!-- ... -->.
To get the table from comments, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = BeautifulSoup(soup.select_one('#all_stats_standard').find_next(text=lambda x: isinstance(x, Comment)), 'html.parser')
#print some information from the table to screen:
for tr in table.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print('{:<30}{:<20}{:<10}'.format(tds[0], tds[3], tds[5]))
Prints:
Patrick van Aanholt Crystal Palace 1990
Max Aarons Norwich City 2000
Tammy Abraham Chelsea 1997
Che Adams Southampton 1996
Adrián Liverpool 1987
Sergio Agüero Manchester City 1988
Albian Ajeti West Ham 1997
Nathan Aké Bournemouth 1995
Marc Albrighton Leicester City 1989
Toby Alderweireld Tottenham 1989
...and so on.
I am having trouble scraping this wikipedia list with the neighborhoods of Los Angeles using beautiful soup. I am getting all the content of the body and not just the neighborhood list like I would like to. I saw a lot about how to scrape a table but I got stucked in how to apply the table logic in this case.
This is the code I have been using:
import BeautifulSoup
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')
neighborhoodList = []
-- append the data into the list
for row in soup.find_all("div", class_="mw-body")[0].findAll("li"):
neighborhoodList.append(row.text.replace(', LA',''))
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})
If you review the page source the neighborhood entries are within divs that have a class of "div-col" and the link contains an attribute of "title".
Also, the replace on the text during the append doesn't appear to be needed.
The following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
address = 'Los Angeles, United States'
url = "https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles"
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
neighborhoodList = []
# -- append the data into the list
links = []
for row in soup.find_all("div", class_="div-col"):
for item in row.select("a"):
if item.has_attr('title'):
neighborhoodList.append(item.text)
df_neighborhood = pd.DataFrame({"Neighborhood": neighborhoodList})
print(f'First 10 Rows:')
print(df_neighborhood.head(n=10))
print(f'\nLast 10 Rows:')
print(df_neighborhood.tail(n=10))
Results:
First 10 Rows:
Neighborhood
0 Angelino Heights
1 Arleta
2 Arlington Heights
3 Arts District
4 Atwater Village
5 Baldwin Hills
6 Baldwin Hills/Crenshaw
7 Baldwin Village
8 Baldwin Vista
9 Beachwood Canyon
Last 10 Rows:
Neighborhood
186 Westwood Village
187 Whitley Heights
188 Wholesale District
189 Wilmington
190 Wilshire Center
191 Wilshire Park
192 Windsor Square
193 Winnetka
194 Woodland Hills
195 Yucca Corridor
Total python3 beginner here. I can't seem to get just the name of of the colleges to print out.
the class is no where near the college names and i can't seem to narrow the find_all down to what i need. and print to a new csv file. Any ideas?
import requests
from bs4 import BeautifulSoup
import csv
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
colleges = soup.find_all("table", class_ = "wikitable sortable")
for college in colleges:
first_level = college.find_all("tr")
print(first_level)
You can use soup.select() to utilize css selectors and be more precise:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
l = soup.select(".mw-parser-output > table:nth-of-type(2) > tbody > tr > td:nth-of-type(1) a")
for each in l:
print(each.text)
Printed result:
Brown University
Columbia University
Cornell University
Dartmouth College
Harvard University
University of Pennsylvania
Princeton University
Yale University
To put a single column into csv:
import pandas as pd
pd.DataFrame([e.text for e in l]).to_csv("your_csv.csv") # This will include index
With:
colleges = soup.find_all("table", class_ = "wikitable sortable")
you are getting all the tables with this class (there are five), not getting all the colleges in the table. So you can do something like this:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
college_table = soup.find("table", class_ = "wikitable sortable")
colleges = college_table.find_all("tr")
for college in colleges:
college_row = college.find('td')
college_link = college.find('a')
if college_link != None:
college_name = college_link.text
print(college_name)
EDIT: I added an if to discard the first line, that has the table header
I'm trying to figure out why my web scrape stops half way through?
My code:
import requests,re
from bs4 import BeautifulSoup
url ="http://odds.aussportsbetting.com/betting?competitionid=15"
r = requests.get(url)
soup = BeautifulSoup(r.content,'html')
#get market
m_data = soup.find_all('div', {'class': 'tabContent'})
for items in m_data:
all_rows = items.findAll('tr')
for data in all_rows:
game = data.findAll('a', {'title': 'Click To Compare Odds'})
market = data.findAll('a', {'title': 'Click To Compare Odds Sorted By Best Bookmaker Odds'})
for g_row in game:
text = ''.join(g_row.findAll(text=True))
g_data = text.strip()
print g_data
for g_row in market:
text = ''.join(g_row.findAll(text=True))
g_data = text.strip()
print g_data
My output:
Cleveland # Cincinnati
102.93
Miami # Buffalo
102.27
Green Bay # Carolina
123.42
St Louis # Minnesota
102.92
Washington # New England
101.85
Tennessee # New Orleans
185.93
Jacksonville # New York Jets
189.21
Oakland # Pittsburgh
102.51
Atlanta # San Francisco
101.75
If you notice in this link Click here you will see there is a lot more data that needs to be scraped however it stops. Can you help determine why?
You may want to change your parser.
The built in one, html, can pretty easily fail, use html.parser or lxml instead:
soup = BeautifulSoup(r.content,'html.parser')
See the BeautifulSoup Docs for some more information about recommended parsers