Unable to extract table from the ESPN website - python

I am trying to extract a table from the ESPN website using the code below, however I am unable to extract the whole table and paste the whole table in the new csv file. I am getting error as AttributeError: 'list' object has no attribute 'to_csv'
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import requests
service = Service(r"C:\Users\Sai Ram\Documents\Python\Driver\chromedriver.exe")
def get_driver():
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service = service, options = options)
driver.get("https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year")
return driver
def main():
get_driver()
url = "https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year"
html = requests.get(url).content
df_list = pd.read_html(html)
df=df_list
df.to_csv(r'C:\Users\Sai Ram\Documents\Power BI\End to End T-20 World Cup DashBoard\CSV Data\scappying.csv')
main()
Could you please help in resolving this issue and if possible suggest me the best way to extract data using web scrapping code in python.

You can simply do this:
import pandas as pd
url = "https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year"
df = pd.read_html(url)[0] # <- this index is important
df.head()
Output
Team 1 Team 2 Winner Margin Ground Match Date Scorecard
0 West Indies England West Indies 9 wickets Bridgetown Jan 22, 2022 T20I # 1453
1 West Indies England England 1 run Bridgetown Jan 23, 2022 T20I # 1454
2 West Indies England West Indies 20 runs Bridgetown Jan 26, 2022 T20I # 1455
3 West Indies England England 34 runs Bridgetown Jan 29, 2022 T20I # 1456
4 West Indies England West Indies 17 runs Bridgetown Jan 30, 2022 T20I # 1457
To save your dataframe as a CSV file:
df.to_csv("your_data_file.csv")
By loading the webpage into pandas you get 2 dataframes (in this case for this particular site). The first dataframe is the one you want (at least I understand this from you post) with the large table of results. The second one is a smaller list. So by indexing the table with [0] you retrieve the first dataframe.
You get the attribute error because you try to treat the list of (2) dataframes as a dataframe.

Related

create a table using pandas using dataset from wikipedia table

I am trying to tabulate data into three columns title, release date and continuity using pandas. i am tryig to fetch my dataset by scraping the data from the Released films section of this wikipedia page and i tried following the steps from this Youtube Video.
here is my code
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
response = r.get("https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies")
wiki_text = response.text
soup = BeautifulSoup(wiki_text, "html.parser")
table_soup = soup.find_all("table")
filtered_table_soup = [table for table in table_soup if table.th is not None]
required_table = None
for table in filtered_table_soup:
if str(table.th.string).strip() == "Release date":
required_table = table
break
print(required_table)
When ever i run the code, it always return None instead of Release date.
I am new to web scrapping by the way, so please go easy on me.
Thank You.
Unless BS4 is a requirement, you can just use panda to fetch all html tables on that page. It will make a DataFrame of each table and store that in an array. You can then loop through the array to find the table of interest.
import pandas as pd
url = r"https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
tables = pd.read_html(url) # Returns list of all tables on page
for tab in tables:
if "Release date" in tab.columns:
required_table = tab
It's actually really simple:
The table is the second <table> on the page, so use slicing to get the correct table:
import pandas as pd
URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
df = pd.read_html(URL, header=0)[1]
print(df.to_string())
Prints (truncated)
Title Release date Continuity Adapted from
0 Superman: Doomsday September 21, 2007 Standalone "The Death of Superman"
1 Justice League: The New Frontier February 26, 2008 Standalone DC: The New Frontier
2 Batman: Gotham Knight July 8, 2008 Nolanverse (unofficial)[2] Batman: "The Batman Nobody Knows"
3 Wonder Woman March 3, 2009 Standalone Wonder Woman: "Gods and Mortals"
4 Green Lantern: First Flight July 28, 2009 Standalone NaN
5 Superman/Batman: Public Enemies September 29, 2009 Superman/Batman[3] Superman/Batman: "Public Enemies"
6 Justice League: Crisis on Two Earths February 23, 2010 Crisis on Two Earths / Doom "Crisis on Earth-Three!" / JLA: Earth 2
7 Batman: Under the Red Hood July 7, 2010 Standalone Batman: "Under the Hood"
8
Or, if you want to specifically use BeautifulSoup, you can use a CSS selector to select the second table:
import requests
import pandas as pd
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
# find the second table
table = soup.select_one("table:nth-of-type(2)")
df = pd.read_html(str(table))[0]
print(df.to_string())
Try:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/DC_Universe_Animated_Original_Movies'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:has(#Released_films) + table')
header = [th.text.strip() for th in table.select('th')]
data = []
for row in table.select('tr:has(td)'):
tds = [td.text.strip() for td in row.select('td')]
data.append(tds)
print(('{:<45}'*4).format(*header))
print('-' * (45*4))
for row in data:
print(('{:<45}'*len(row)).format(*row))
Prints:
Title Release date Continuity Adapted from
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Superman: Doomsday September 21, 2007 Standalone "The Death of Superman"
Justice League: The New Frontier February 26, 2008 Standalone DC: The New Frontier
Batman: Gotham Knight July 8, 2008 Nolanverse (unofficial)[2] Batman: "The Batman Nobody Knows"
Wonder Woman March 3, 2009 Standalone Wonder Woman: "Gods and Mortals"
Green Lantern: First Flight July 28, 2009 Standalone
Superman/Batman: Public Enemies September 29, 2009 Superman/Batman[3] Superman/Batman: "Public Enemies"
Justice League: Crisis on Two Earths February 23, 2010 Crisis on Two Earths / Doom "Crisis on Earth-Three!" / JLA: Earth 2
Batman: Under the Red Hood July 7, 2010 Standalone Batman: "Under the Hood"
Superman/Batman: Apocalypse September 28, 2010 Superman/Batman[3] Superman/Batman: "The Supergirl from Krypton"
All-Star Superman February 22, 2011 Standalone All-Star Superman
Green Lantern: Emerald Knights July 7, 2011 Standalone "New Blood" / "What Price Honor?" / "Mogo Doesn't Socialize" / "Tygers"
Batman: Year One October 18, 2011 Year One / Dark Knight Returns[4][5] Batman: Year One
Justice League: Doom February 28, 2012 Crisis on Two Earths / Doom JLA: "Tower of Babel"
Superman vs. The Elite June 12, 2012 Standalone "What's So Funny About Truth, Justice & the American Way?"
...and so on.

Python / Selenium. How to find relationship between data?

I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.
So making the scraper was easy enough. Here's a snippet so you understand the code I am using.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key
driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1")
# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[#class="a-size-base-plus a-color-base a-text-normal"]')
shoes_list = []
for s in range(len(shoes)):
shoes_list.append(shoes[s].text)
# retrieving prices
price = driver.find_elements(By.XPATH, '//span[#class="a-price"]')
price_list = []
for p in range(len(price)):
price_list.append(price[p].text)
# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that
temp_price_list = []
for price in price_list:
price = price.replace("\n", ".")
temp_price_list.append(price)
price_list = temp_price_list
So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe
title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])
At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.
Missing prices on Amazon site
Incorrect data
Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.
Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.
To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
df_list = []
url = 'https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1'
browser.get(url)
shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
# print(shoe.get_attribute('outerHTML'))
try:
shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
except Exception as e:
continue
try:
shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]')
except Exception as e:
continue
df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)
This would return (depending on Amazon's appetite for serving ads in html tags similar to products):
Shoe Price
0 Nike NIKE AIR MAX MOTION 2, Men's Running Shoe... £79\n99
1 Nike Air Max 270 React Se GS Running Trainers ... £69\n99
2 NIKE Women's Air Max Low Sneakers £69\n99
3 NIKE Men's React Miler Running Shoes £109\n99
4 NIKE Men's Revolution 5 Flyease Running Shoe £38\n70
5 NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6 NIKE Men's Downshifter 10 Running Shoe £54\n99
7 NIKE Women's Court Vision Low Better Basketbal... £30\n00
8 NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac... £20\n72
9 NIKE Men's Air Max Wright Gs Running Shoe £68\n51
10 NIKE Men's Air Max Sc Trainers £54\n99
11 NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof... £134\n95
12 NIKE Women's W Superrep Go 2 Sneakers £54\n00
13 NIKE Boys Tanjun Running Shoes £35\n53
14 NIKE Women's Air Max Bella Tr 4 Gymnastics Sho... £28\n00
15 NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16 NIKE Men's Venture Runner Sneaker £45\n90
17 Nike Nike Court Borough Low 2 (gs), Boy's Bask... £24\n00
18 NIKE Men's Court Royale 2 Better Essential Tra... £25\n81
19 NIKE Men's Quest 4 Running Shoe £38\n00
20 Women Trainers Running Shoes - Air Cushion Sne... £35\n69
21 Men Women Walking Trainers Light Running Breat... £42\n99
22 JSLEAP Mens Running Shoes Fashion Non Slip Ath... £44\n99
[...]
You should pay attention to a couple of things:
I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
Your results may vary, depending on your advertising profile
You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only
I found the answer to this question. Thank you to #platipus_on_fire for the answer. You certainly guided me towards to correct place but I did find that the issue was still persistent in that prices were being mismatched with the names.
The solution I found was finding the price first before the name. If there is no price, then it shouldn't log down the name. In your example you had it search for the name first which was still causing issues as sometimes the name is there but no the price.
Here is the code that finally works for me:
for shoe in shoes:
try:
price_list.append(shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]').text)
shoe_list.append(shoe.find_element(By.CSS_SELECTOR, ".a-text-normal").text)
except Exception as e:
continue
shoe_list and price_list now match regardless if some items don;t display price. Those items in question are now removed from the list entirely.
Thanks again to those who answered.

Web scraping table data using beautiful soup

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following:
Scraping Kansas City Chiefs active team player name with the college attended. This is the url used https://www.chiefs.com/team/players-roster/.
After compiling, I get an error saying "IndexError: list index out of range".
I don't know if my set classes are wrong. Help would be appreciated.
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)
TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries
Indexing
The Python Index Operator is represented by opening and closing square brackets: []. The syntax, however, requires you to put a number inside the brackets.
Example:
So [7] applies indexing to the preceding iterable (all found tds), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.
In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7].
row.find_all('td', class_='sorter-lastname selected')[7]
How to avoid index-errors ?
Are you sure there are any td elements found in the row?
If some are found, can we guarantee that it are always at least 8.
In this case, the were apparently less than 8 elements.
That's why Python would raise an IndexError, e.g. in given script line 15:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
Element-queries
When the index was fixed, the queried names returned empty results as None, None.
We need to debug (thus I added the print inside the loop) and adjust the queries:
(1) for the university-name:
If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last. The number of cells should be at least 1 or greater than 0.
(2) for the player-name:
It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a .. title="Player Name"> or in following sibling as inner text of span > a.
CSS selectors
You may use CSS selectors for that an bs4's select or select_one functions. Then you can select the path like td > ? > ? > a and get the title.
Note: the ? placeholders are left as challenging exercise for you.)
💡️ Tip: most browsers have an inspector (right click on the element, e.g. the player-name), then choose "inspect element" and an HTML source view opens selecting the element. Right-click again to "Copy" the element as "CSS selector".
Further Reading
About indexing, and the magic of negative numbers like [-1]:
AskPython: Indexing in Python - A Complete Beginners Guide
.. a bit further, about slicing:
Real Python: Indexing and Slicing
Research on Beautiful Soup here:
Using BeautifulSoup to extract the title of a link
Get text with BeautifulSoup CSS Selector
I couldn't find a td with class sorter-lastname selected in the source code. You basically need the last td in each row, so this would do:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS. scraping tables is extremely easy in pandas:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0]):
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]

Getting all links from page, receiving javascript.void() error?

I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/
To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):
Look at the onclick property of this link and get "real" address from them.

PANDAS Web Scraping Multiple Pages

I was working on scraping data using Beautiful soup on multiple pages for the following given website and was able to do it. Can I scrape data for multiple pages using Pandas. Following is the code to scrape a single page and the URL has link to other pages as http://www.example.org/whats-on/calendar?page=3 .
import pandas as pd
url = 'http://www.example.org/whats-on/calendar?page=3'
dframe = pd.read_html(url,header=0)
dframe[0]
dframe[0].to_csv('out.csv')
Simply loop over the range of numbers and append to a list of dataframes. Afterwards, concatenate to one large file. One issue too of your current code is header=0 is the default first row. However, pages do not have column headers. Hence, use header=None and then rename columns.
Below scrapes pages 0 - 3. Extend the loop limit for the other pages.
import pandas as pd
dfs = []
# PAGES 0 - 3 SCRAPE
url = 'http://www.lapl.org/whats-on/calendar?page={}'
for i in range(4):
dframe = pd.read_html(url.format(i), header=None)[0]\
.rename(columns={0:'Date', 1:'Topic', 2:'Location',
3:'People', 4:'Category'})
dfs.append(dframe)
finaldf = pd.concat(dfs)
finaldf.to_csv('Output.csv')
Output
print(finaldf.head())
# Date Topic Location People Category
# 0 Thu, Nov 09, 201710:00am to 12:30pm California Healthier Living : A Chronic Diseas... West Los Angeles Regional Library Seniors Health
# 1 Thu, Nov 09, 201710:00am to 11:30am Introduction to Microsoft WordLearn the basics... North Hollywood Amelia Earhart Regional Library Adults, Job Seekers, Seniors Computer Class
# 2 Thu, Nov 09, 201711:00am Board of Library Commissioners Central Library Adults Meeting
# 3 Thu, Nov 09, 201712:00pm to 1:00pm Tech TryOutCentral Library LobbyDid you know t... Central Library Adults, Teens Computer Class
# 4 Thu, Nov 09, 201712:00pm to 1:30pm Taller de Tejido/ Crochet WorkshopLearn how to... Benjamin Franklin Branch Library Adults, Seniors, Spanish Speakers Arts and Crafts, En Español
Below code will loop through the pages given in the range below and append to the dataframe with the selected fields.
def get_from_website():
Sample = pd.DataFrame()
for num in range(1,6):
website = 'https://weburl/?page=' + str(num)
datalist = pd.read_html(website)
Sample= Sample.append(datalist[0])
Sample.columns=['Field1', 'Field2', 'Field3', 'Field4', 'Field5', 'Field6', 'Time', 'Field7', 'Field8' ]
return Sample

Categories

Resources