How to extract player names using Python with BeautifulSoup from cricinfo - python

I'm learning beautiful soup. I want to extract the player names i.e. the playing eleven for both teams from cricinfo.com. The exact link is "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
The problem is that the website only displays the players under class "wrap batsmen" if they have batted. Otherwise they are placed under the class "wrap dnb". I want to extract all the players irrespective of whether they have batted or not. How I can maintain two arrays (one for each team) that will dynamically search for players in "wrap batsmen" and "wrap dnb" (if required)?
This is my attempt:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
years = []
# Years we will be analyzing
for i in range(2010, 2018):
years.append(i)
names = []
# URL page we will scraping (see image above)
url = "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
for a in range(0, 1):
names.append([a.getText() for a in soup.find_all("div", class_="cell batsmen")[1:][a].findAll('a', limit=1)])
soup = soup.find_all("div", class_="wrap dnb")
print(soup[0])

While this is possible with BeautifulSoup, it's not the best tool for the job. All that data (and much more) is available through the API. Simply pull that and then you can parse the json to get what you want (and more). Here's a quick script though to get the 11 players for each team:
You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)
import requests
url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'
payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}
jsonData = requests.get(url, params=payload).json()
roster = jsonData['rosters']
players = {}
for team in roster:
players[team['team']['displayName']] = []
for player in team['roster']:
playerName = player['athlete']['displayName']
players[team['team']['displayName']].append(playerName)
Output:
print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}
See below:

Related

How can Beautiful Soup loop through a list of URLs to scrape multiple text fields

I am attempting to loop through a stored list of URLs to scrape stats about footballers (age, name, club etc).
My list of URLs is stored as playerLinks
playerLinks[:5]
['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
If i attempt to scrape an individual link with the following code I am able to retrieve a result.
testreq = Request('https://footystats.org/players/england/dominic-solanke', headers=headers)
html_test = urlopen(testreq)
testsoup = BeautifulSoup(html_test, "html.parser")
testname = testsoup.find('p','col-lg-7 lh14e').text
print(testname)
#Dominic Solanke
However, when I loop through my list of URLs I receive errors. Below is the code I am using, but to no avail.
names = []
#For each player page...
for i in range(len(playerLinks)):
reqs2 = Request(playerLinks[i], headers=headers)
html_page = urlopen(reqs2)
psoup2 = BeautifulSoup(html_page, "html.parser")
for x in psoup2.find('p','col-lg-7 lh14e').text
names.append(x.get('text'))
Once I fix the name scrape I will need to repeat the process for other stats. I have pasted the html of the page below. Do I need to nest another loop within? At the moment i receive either 'invalid syntax' errors or 'no text object' errors.
"<div class="row cf lightGrayBorderBottom "> <p class="col-lg-5 semi-bold lh14e bbox mild-small">Full Name</p> <p class="col-lg-7 lh14e">Dominic Solanke</p></div>"
I'm getting the following output:
Code:
import bs4, requests
from bs4 import BeautifulSoup
playerLinks=['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
names = []
#For each player page...
for i in range(len(playerLinks)):
reqs2 = requests.get(playerLinks[i])
psoup2 = BeautifulSoup(reqs2.content, "html.parser")
for x in psoup2.find_all('p','col-lg-7 lh14e'):
names.append(x.text)
print(names)
#print(names)
Output:
['Martyn Waghorn']
['Martyn Waghorn', 'England']
['Martyn Waghorn', 'England', 'Forward']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)', '71st / 300 players']
Some of these links are blank content. I suspect you need to either be logged in and/or be paying for a subscription (as the do offer an api...but the free one only allows for 1 league)
But to correct that output, move your print statement until the end as opposed to printing after every time you append to the list:
import requests
from bs4 import BeautifulSoup
playerLinks=['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
names = []
#For each player page...
for link in playerLinks:
w=1
reqs2 = requests.get(link)
psoup2 = BeautifulSoup(reqs2.content, "html.parser")
for x in psoup2.find_all('p','col-lg-7 lh14e'):
names.append(x.text)
print(names)

Scraping author names from a website with try/except using Python

I am trying to use Try/Except in order to scrape through different pages of a URL containing author data. I need a set of author names from 10 subsequent pages of this website.
# Import Packages
import requests
import bs4
from bs4 import BeautifulSoup as bs
# Output list
authors = []
# Website Main Page URL
URL = 'http://quotes.toscrape.com/'
res = requests.get(URL)
soup = bs4.BeautifulSoup(res.text,"lxml")
# Get the contents from the first page
for item in soup.select(".author"):
authors.append(item.text)
page = 1
pagesearch = True
# Get the contents from 2-10 pages
while pagesearch:
# Check if page is available
try:
req = requests.get(URL + '/' + 'page/' + str(page) + '/')
soup = bs(req.text, 'html.parser')
page = page + 1
for item in soup.select(".author"): # Append the author class from the webpage html
authors.append(item.text)
except:
print("Page not found")
pagesearch == False
break # Break if no page is remaining
print(set(authors)) # Print the output as a unique set of author names
First page doesn't have any page number in it's URL so I treated it separately. I'm using the try/except block for iterating through all of the possible pages and throw an exception and break the loop when the last page is scanned.
When I run the program, it enters to an infinite loop while it needs to print the "Page not found" message when the pages are over. When I interrupt the kernel, I see the correct result as a list and my exception statement but nothing before that. I get the following result.
Page not found
{'Allen Saunders', 'J.K. Rowling', 'Pablo Neruda', 'J.R.R. Tolkien', 'Harper Lee', 'J.M. Barrie',
'Thomas A. Edison', 'J.D. Salinger', 'Jorge Luis Borges', 'Haruki Murakami', 'Dr. Seuss', 'George
Carlin', 'Alexandre Dumas fils', 'Terry Pratchett', 'C.S. Lewis', 'Ralph Waldo Emerson', 'Jim
Henson', 'Suzanne Collins', 'Jane Austen', 'E.E. Cummings', 'Jimi Hendrix', 'Khaled Hosseini',
'George Eliot', 'Eleanor Roosevelt', 'André Gide', 'Stephenie Meyer', 'Ayn Rand', 'Friedrich
Nietzsche', 'Mother Teresa', 'James Baldwin', 'W.C. Fields', "Madeleine L'Engle", 'William
Nicholson', 'George R.R. Martin', 'Marilyn Monroe', 'Albert Einstein', 'George Bernard Shaw',
'Ernest Hemingway', 'Steve Martin', 'Martin Luther King Jr.', 'Helen Keller', 'Charles M. Schulz',
'Charles Bukowski', 'Alfred Tennyson', 'John Lennon', 'Garrison Keillor', 'Bob Marley', 'Mark
Twain', 'Elie Wiesel', 'Douglas Adams'}
What can be the reason for this ? Thanks.
I think that's because there is a page literally. The exception may arise when there is no page to show on the browser.
But when you make a request for this one:
http://quotes.toscrape.com/page/11/
Then, the browser shows a page that bs4 still can parse to get an element.
How to stop at page 11? You can trace the presence of the Next Page Button.
Thanks for reading.
Try using the built-in range() function to go from pages 1-10 instead:
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com/page/{}/"
authors = []
for page in range(1, 11):
response = requests.get(url.format(page))
print("Requesting Page: {}".format(response.url))
soup = BeautifulSoup(response.content, "html.parser")
for tag in soup.select(".author"):
authors.append(tag.text)
print(set(authors))

BeautifulSoup scrape the first title tag in each <li>

I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a> in the list and stores their title tags. It currently goes:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
While this partially works there are two downsides:
It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns ['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>? Or even a better solution for what I am trying to do?
You can use css selectors to grab only the first <a> in a <li>:
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Example
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Stars.append(x.get('title'))
Stars
Output
['Harrison Ford',
'Cate Blanchett',
'Karen Allen',
'Ray Winstone',
'John Hurt',
'Jim Broadbent',
'Shia LaBeouf']
You can use Regex to fetch all the names from the text content of <li/> and just take the first two names and it will also fix the issue in case the actor doesn't have a Wikipedia page hyperlink
import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)
Example:
text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)
Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]
There is considerable variation for the html for cast within the film listings on Wikipaedia. Perhaps look to an API to get this info?
E.g. imdb8 allows for a reasonable number of calls which you could use with the following endpoint
https://imdb8.p.rapidapi.com/title/get-top-cast
There also seems to be Python IMDb API
Or choose something with more regular html. For example, if you take the imdb film ids in a list you can extract full cast and main actors, from IMDb as follows. To get the shorter cast list I am filtering out the rows which occur at/after the text "Rest" within "Rest of cast listed alphabetically:"
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
movie_ids = ['tt0367882', 'tt7126948']
base = 'https://www.imdb.com'
with requests.Session() as s:
for movie_id in movie_ids:
link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
# print(link)
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')]
main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
# print(df_full)
print(df_main)

Trouble Looping with Beautiful soup

I am a newbie at python and currently learning web scraping using BeautifulSoup. I am trying to get information on Steam to display the game name, price, and genre. I can get my code to find all of this but when I put in in a for loop, it doesn't work. Can you identify the problem?
Thank you so much for the help!
This will show everything I need(and more) on the page (name, price, genre)*
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.findAll("div", attrs={"id": "tab_content_NewReleases"}):
print(item.text)
This will only show the first game, therefore I believe it is not looping correctly*
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.findAll("div", attrs={"id": "tab_content_NewReleases"}):
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
I'm expecting results like this but more than 1 results:
{
'name': 'Little Misfortune',
'price': '$19.99',
'genre': 'Adventure, Indie, Casual, Singleplayer'
}
The issue is that content.findAll("div", attrs=....... contains all of the results you want in the very first index (results[0]) so you only get the first result. When you iterate over it; you only search the html that contains the good stuff once, hence the one result issue. The solution is to then search the found html block which contains your desired results and split THAT into an iterable you can work with. Here is my solution:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
bulk = content.find("div", attrs={"id": "tab_content_NewReleases"}) # Isolate the block you want
results = bulk.findAll('a', attrs={'class': 'tab_item'}) # Split it into the seperate results
for item in results:
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
You got 90% of the way there, just missing that little bit.
Make sure you are working with the children so add in a child a for the selector. You could also make the parent the rows element i.e. #NewReleasesRows a
from bs4 import BeautifulSoup
import requests
import json
url = 'https://store.steampowered.com/tags/en/Adventure/#p=0&tab=NewReleases'
response = requests.get(url, timeout=9)
content = BeautifulSoup(response.content, "html.parser")
for item in content.select('#NewReleasesRows a'):
itemObject = {
"name": item.find("div", attrs={"class": "tab_item_name"}).text,
"price": item.find("div", attrs={"class": "discount_final_price"}).text,
"genre": item.find("div", attrs={"class": "tab_item_top_tags"}).text
}
print(itemObject)
I think you are not selecting the right Tag. Use instead 'NewReleasesRows' to find the table containing rows of the new releases.
So code would be like this using CSS selector:
my_soup: BeautifulSoup = BeautifulSoup(my_page_text, 'lxml')
print("mysoup type:", type(my_soup))
my_table_list = my_soup.select('#NewReleasesRows')
print('my_table_list size:', len(my_table_list))
Then you can look for the rows (after having checked that you got only one table (could use select_one too):
print(BeautifulSoup.prettify(my_table_list[0]))
my_table_rows = my_table_list[0].select('.tab_item')
and from there you can iterate
for my_row in my_table_rows:
print(my_row.get_text(strip=True))
Result code:
R 130.00Little MisfortuneAdventure, Indie, Casual, Singleplayer
-33%R 150.00R 100.50TrailmakersBuilding, Sandbox, Multiplayer, LEGO
-10%R 105.00R 94.50Devil's Deck 恶魔秘境Early Access, RPG, Indie, Early Access
R 89.00Showdown BanditAction, Adventure, Indie, Horror
R 150.00HardlandAdventure, Indie, Open World, Singleplayer
R 120.00Aeon's EndCard Game, Strategy, Indie, Adventure
R 105.00Atomorf2Casual, Action, Indie, Adventure
-10%R 175.00R 157.50Daymare: 1998Indie, Action, Survival Horror, Horror
-25%R 79.00R 59.25Ling: A Road AloneAction, RPG, Indie, Gore
-10%R 105.00R 94.50NauticrawlIndie, Simulation, Atmospheric, Sci-fi
FreeOrpheus's DreamFree to Play, Adventure, Indie, Casual
-40%R 105.00R 63.00AVAEarly Access, Action, Early Access, Indie
-40%R 18.00R 10.80Angry GolfIndie, Casual, Sports, Adventure
-40%R 10.00R 6.00Death LiveIndie, Casual, Adventure, Anime
-30%R 130.00R 91.00Die YoungSurvival, Action, Open World, Gore
I hope that helps.
Best

BeautifulSoup: Get text, create dictionary

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
You could use regex to match each part of string.
[-\d]+ the string only have number and -
(?<=\s).*?(?=by) the string start with blank and end with by(which is begin with author)
(?<=by\s).* the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-\d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=\s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=by\s).*",paper.text)[0]
print(data)
datas.append(data)

Categories

Resources