I am trying to scrape the results table from the following url: https://utmbmontblanc.com/en/page/107/results.html
However when I run my code it says 'No Tables Found'
import pandas as pd
url = 'https://utmbmontblanc.com/en/page/107/results.html'
data = pd.read_html(url, header = 0)
data.head()
ValueError: No tables found
Having used developer tools I know that there is definitely a table in the html code. Why is it not being found? Any help is greatly appreciated. Thanks in advance
build URL for Ajax request, for 2017 - CCC is like this
url = 'https://.......com/result.php?mode=edPass&ajax=true&annee=2017&course=ccc'
data = pd.read_html(url, header = 0)
print(data[0])
You can also use selenium if you are unable to find any other hacks.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
from bs4 import BeautifulSoup as BSoup
import pandas as pd
url = "https://utmbmontblanc.com/en/page/107/results.html"
driver = webdriver.Chrome("/home/bitto/chromedriver")#change this to your chromedriver path
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
#changes div[#class='bloc'] to change year - [1] for 2018, [2] for 2017 etc
#change index of div[#class='row'] - [1], [2] for TDS etc
#change #value of option match your preferred option's value - you can find this from the inspect tool - First two are Scratch and ScratchH
EC.presence_of_element_located((By.XPATH, "//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/select[#name='cat'][1]/option[#value='Scratch']"))
)
element.click()#select option
#make relevant changes you made in top here also
driver.find_element_by_xpath("//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/input").click();#click go
sleep(10)#not preferred but will do for now
table=pd.read_html(driver.page_source)
print(table)
Output
[ GeneralRanking Family name First name Club Cat. ... Time Difference/ 1st Nationality
0 1 3001 - HAWKS Hayden HOKA ONE ONE SEH ... 10:24:30 00:00:00 United States
1 2 3018 - ŚWIERC Marcin SALOMON SUUNTO TEAM POLAND SEH ... 10:42:49 00:18:19 Poland
2 3 3005 - POMMERET Ludovic TEAM HOKA V1H ... 10:50:47 00:26:17 France
3 4 3214 - EVANS Thomas COMPRESS SPORT SEH ... 10:57:44 00:33:14 United Kingdom
4 5 3002 - OWENS Tom SALOMON SEH ... 11:03:48 00:39:18 United Kingdom
5 6 3011 - JONSSON Thorbergur 66 NORTH SEH ... 11:14:22 00:49:52 Iceland
6 7 3026 - BOUVIER-GAZ Nicolas TEAM NEW BALANCE SEH ... 11:18:33 00:54:03 France
7 8 3081 - JONES Michael WWW.APEXRUNNING.CO SEH ... 11:31:50 01:07:20 United Kingdom
8 9 3020 - COLLET Aurélien HOKA ONE ONE SEH ... 11:33:10 01:08:40 France
9 10 3009 - MARAVILLA Jorge HOKA ONE ONE V1H ... 11:36:14 01:11:44 United States
10 11 3036 - PERRILLAT Christophe SEH ... 11:40:05 01:15:35 France
11 12 3070 - FRAGUELA BREIJO Alejandro STUDIO54 V1H ... 11:40:11 01:15:41 Spain
12 13 3092 - AIGROZ Mike TRUST SEH ... 11:41:53 01:17:23 Switzerland
13 14 3021 - O'LEARY Paddy THE NORTH FACE SEH ... 11:47:04 01:22:34 Ireland
14 15 3065 - PÉREZ TORREGLOSA Juan CLUB ULTRATRAIL ... SEH ... 11:47:51 01:23:21 Spain
15 16 3031 - SÁNCHEZ CEBRIÁN Miguel Ángel LURBEL-LI... V1H ... 11:49:15 01:24:45 Spain
16 17 3062 - ANDREWS Justin SEH ... 11:49:47 01:25:17 United States
17 18 3039 - PIANA Giulio TEAM MUD AND SNOW SEH ... 11:50:23 01:25:53 Italy
18 19 3047 - RONIMOISS Andris Inov8 / OSveikals.lv ... SEH ... 11:52:25 01:27:55 Latvia
19 20 3052 - DURAND Regis TEAM TRAIL ISOSTAR V1H ... 11:56:40 01:32:10 France
20 21 3027 - SANDES Ryan SALOMON SEH ... 12:04:39 01:40:09 South Africa
21 22 3014 - EL MORABITY Rachid ULTRA TRAIL ATLAS T... SEH ... 12:10:01 01:45:31 Morocco
22 23 3067 - JONES Harry RUNIVORE SEH ... 12:10:12 01:45:42 United Kingdom
23 24 3030 - CLAVERY Erik - SEH ... 12:12:56 01:48:26 France
24 25 3056 - JIMENEZ LLORENS Juan Maria GREEN POWER... SEH ... 12:13:18 01:48:48 Spain
25 26 3024 - GALLAGHER Clare THE NORTH FACE SEF ... 12:13:57 01:49:27 United States
26 27 3136 - ASSEL Garry LICENCE INDIVIDUELLE LUXEM... SEH ... 12:20:46 01:56:16 Luxembourg
27 28 3071 - RIGODANZA Francesco SPIRITO TRAIL TEAM SEH ... 12:22:49 01:58:19 Italy
28 29 3118 - POLASZEK Christophe CHARTRES VERTICAL V1H ... 12:24:49 02:00:19 France
29 30 3125 - CALERO RODRIGUEZ David Altmann Sports/... SEH ... 12:25:07 02:00:37 Spain
... ... ... ... ... ... ... ...
1712 1713 5734 - GOT Hang Fai V2H ... 26:25:01 16:00:31 Hong Kong, China
1713 1714 4154 - RAMOS Liliana NIKE RUNNING CLUB V3F ... 26:26:22 16:01:52 Argentina
1714 1715 5448 - BECKRICH Xavier PHOENIX57 V1H ... 26:26:45 16:02:15 France
1715 1716 5213 - BARBERIO ARNOULT Isabelle PHOENIX57 V1F ... 26:26:49 16:02:19 France
1716 1717 4704 - ZHANG Zheng XIAOMABENTENG SEH ... 26:28:37 16:04:07 China
1717 1718 5282 - GUISOLAN Frédéric SEH ... 26:28:46 16:04:16 Switzerland
1718 1719 5306 - MEDINA Rafael V1H ... 26:29:26 16:04:56 Mexico
1719 1720 5379 - PENTCHEFF Nicolas SEH ... 26:33:05 16:08:35 France
1720 1721 4665 - GONZALEZ SUANCES Israel BAR ES PUIG V1H ... 26:33:58 16:09:28 Spain
1721 1722 4389 - TONANNY Marie SEF ... 26:34:51 16:10:21 France
1722 1723 5616 - GLORIAN Thierry V2H ... 26:35:47 16:11:17 France
1723 1724 5684 - CHEUNG Ho FAITHWALKERS V1H ... 26:37:09 16:12:39 Hong Kong, China
1724 1725 5719 - GANDER Pascal JEFF B TRAIL SEH ... 26:39:04 16:14:34 France
1725 1726 4555 - JURGIELEWICZ Urszula SEF ... 26:39:44 16:15:14 Poland
1726 1727 4722 - HIDALGO José Miguel C.D. ATLETISMO SAN... V1H ... 26:40:27 16:15:57 Spain
1727 1728 4425 - JITTIWUTIKARN Gif V1F ... 26:41:02 16:16:32 Thailand
1728 1729 4556 - ZHU Jing SEF ... 26:41:12 16:16:42 China
1729 1730 4314 - HU Dongli V1H ... 26:41:27 16:16:57 China
1730 1731 4239 - DURET Estelle OXYGENE BELBEUF V1F ... 26:41:51 16:17:21 France
1731 1732 4525 - MAGLIERI Fabrice ATHLETIC CLUB PAYS DE... V1H ... 26:42:11 16:17:41 France
1732 1733 4433 - ANDERSEN Laura Jentsch RUN DEM CREW SEF ... 26:42:27 16:17:57 Denmark
1733 1734 4563 - CHEUNG Annie On Nai FAITHWALKERS V1F ... 26:45:35 16:21:05 Hong Kong, China
1734 1735 4355 - KHALED Naïm GENEVE AEROPORT SEH ... 26:47:50 16:23:20 Algeria
1735 1736 4749 - STELLA Sara COURMAYEUR TRAILERS V1F ... 26:48:07 16:23:37 Italy
1736 1737 4063 - LALIMAN Leslie SEF ... 26:48:09 16:23:39 France
1737 1738 5702 - BURKE Tony Alchester/CTR/Bicester Tri V2H ... 26:50:52 16:26:22 Ireland
1738 1739 5146 - OLIVEIRA Sandra BUDEGUITA RUNNERS V1F ... 26:52:23 16:27:53 Portugal
1739 1740 5545 - VELLANDI Emilio TEAM PEGGIORI SCARPA MICO V1H ... 26:55:32 16:31:02 Italy
1740 1741 5543 - GASPAROVIC Bernard STADE FRANCAIS V3H ... 26:56:31 16:32:01 France
1741 1742 4760 - MENDONCA Carine ASPTT COMPIEGNE V2F ... 27:19:15 16:54:45 Belgium
[1742 rows x 7 columns]]
Related
The code is supposed to be able to find the correct books by just running the code, entering what category then a sign such as '==' and then what value of that category were looking for. For instance, if I want the publishers from Vertigo, I would input 'Vertigo' then '==' and then 'Vertigo'. My professor says that my code is incorrect and says that there are three errors in my code but I have no idea what he's talking about. Can someone help me find these errors and then how fix them in Python?
import csv
def convertIfPossible(value):
# Convert a string value to integer or float if possible
try:
val = int(value)
return val
except:
try:
val = float(value)
return val
except:
return value
def readCSVIntoDictionary(fname):
# read a CSV file and make it into a dictionary
myDictionary = dict()
with open(fname) as f:
freader = csv.reader(f)
headers = next(freader)
ID = 1
for row in freader:
itemDictionary = dict()
for i in range(0,len(row)):
itemDictionary[headers[i]] = convertIfPossible(row[i])
myDictionary[ID] = itemDictionary
ID += 1
f.close()
return myDictionary
def printItem(item, indent=0):
# print out a dictionary item formatted somewhat nicely
for k, v in item.items():
for i in range(0,indent):
print(" |",end="")
print(f' {k:<20}',end="")
if (type(v) is dict):
print()
printItem(v,indent+1)
else:
print(f': {v}')
def lookupIDs(myDict,key,value,matchType="=="):
# find the IDs of all the items that match for category key on the
# value value using the comparison matchType
matchingKeys = []
for i in myDict.keys():
if (type(myDict[i][key]) == type(value)):
if (matchType == "=="):
if (myDict[i][key] == value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] <= value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] >= value):
matchingKeys.append(i)
elif (matchType == "<"):
if (myDict[i][key] < value):
matchingKeys.append(i)
elif (matchType == ">"):
if (myDict[i][key] < value):
matchingKeys.append(i)
return matchingKeys
def categoryExists(myDict,category):
# check if a particular category occurs in the first item
firstKey = list(myDict.keys())[0]
if (myDict[firstKey].get(category,None) == None):
return False
return True
def printMatches(myDict,matches):
# print the list of matches or No matches if none
if (len(matches) <= 1):
print("No matches")
return
else:
print("Matches")
for m in matches:
printItem(myDict[m])
print()
bookDictionary = readCSVIntoDictionary(r'C:\Users\jsric\OneDrive\Desktop\PopularBooks.csv')
print(f'There are {len(bookDictionary.keys())} items in the file')
print('These are the items:')
printItem(bookDictionary)
while True:
category = input("Category to look up: ")
if (categoryExists(bookDictionary,category)):
comparisonToDo = input("Comparison to do: ")
valueToCompare = convertIfPossible(input("Value to compare: "))
matches = lookupIDs(bookDictionary,category,valueToCompare,comparisonToDo)
'''
Here's the file I'm using
'''
bookID title authors average_ratingisbn language_codenum_pagesratings_counttext_reviews_countpublication_datepublisher
24812 The Complete Calvin and HobbesBill Watterson 4.82 7.41E+08 eng 1456 32213 930 9/6/05 Andrews McMeel Publishing
8 Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)J.K. Rowling/Mary GrandPré4.78 4.4E+08 eng 2690 41428 164 9/13/04 Scholastic
24814 It's a Magical World (Calvin and Hobbes #11)Bill Watterson 4.76 8.36E+08 eng 176 23875 303 9/1/96 Andrews McMeel Publishing
10 Harry Potter Collection (Harry Potter #1-6)J.K. Rowling 4.73 4.4E+08 eng 3342 28242 808 9/12/05 Scholastic
6550 Early ColorSaul Leiter/Martin Harrison4.73 3.87E+09 eng 156 144 8 1/15/06 Steidl
24816 Homicidal Psycho Jungle Cat (Calvin and Hobbes #9)Bill Watterson 4.72 8.36E+08 eng 176 15365 290 9/6/94 Andrews McMeel Publishing
34545 Elliott Erwitt: SnapsMurray Sayle/Charles Flowers/Elliott Erwitt4.72 071484330Xen-GB 544 102 6 6/1/03 Phaidon Press
24820 Calvin and Hobbes: Sunday Pages 1985-1995: An Exhibition CatalogueBill Watterson 4.71 7.41E+08 eng 96 3613 85 9/17/01 Andrews McMeel Publishing
20749 Study Bible: NIVAnonymous 4.7 3.11E+08 eng 2198 4166 186 10/1/02 Zondervan Publishing House
24520 The Complete Aubrey/Maturin Novels (5 Volumes)Patrick O'Brian 4.7 039306011Xeng 6576 1338 81 10/17/04 W. W. Norton Company
44826 The Price of the Ticket: Collected Nonfiction 1948-1985James Baldwin 4.7 3.13E+08 eng 712 404 30 9/15/85 St. Martin's Press
24818 The Days Are Just PackedBill Watterson 4.69 8.36E+08 eng 176 20308 244 9/1/93 Andrews McMeel Publishing
26805 The Sibley Field Guide to Birds of Western North AmericaDavid Allen Sibley4.69 6.79E+08 en-US 473 730 36 4/29/03 Alfred A. Knopf
5309 The Life and Times of Scrooge McDuckDon Rosa 4.67 9.12E+08 eng 266 2467 149 6/1/05 Gemstone Publishing
23753 The Absolute Sandman Volume OneNeil Gaiman/Mike Dringenberg/Chris Bachalo/Michael Zulli/Kelly Jones/Charles Vess/Colleen Doran/Malcolm Jones III/Steve Parkhouse/Daniel Vozzo/Lee Loughridge/Steve Oliff/Todd Klein/Dave McKean/Sam Kieth4.65 1.4E+09 eng 612 15640 512 11/1/06 Vertigo
3582 The New Annotated Sherlock Holmes: The Complete Short StoriesArthur Conan Doyle/Leslie S. Klinger4.64 3.93E+08 eng 1878 1411 54 11/30/04 W. W. Norton & Company
27204 The Gospel According to LukeAnonymous/Thomas Cahill4.64 8.02E+08 eng 81 169 17 10/29/99 Grove Press
39661 The Shawshank Redemption: The Shooting ScriptFrank Darabont/Stephen King4.64 1.56E+09 eng 184 2406 29 9/30/04 Newmarket Press
13206 The Collected Autobiographies of Maya AngelouMaya Angelou 4.63 6.8E+08 eng 1184 991 55 9/21/04 Modern Library
17280 The Feynman Lectures on Physics Vol 3Richard P. Feynman/Robert B. Leighton/Matthew L. Sands4.63 8.05E+08 eng 384 716 15 9/1/05 Addison Wesley Publishing Company
24813 The Calvin and Hobbes Tenth Anniversary BookBill Watterson 4.63 8.36E+08 eng 208 49122 368 9/5/95 Andrews McMeel Publishing
24819 The Calvin And Hobbes: Tenth Anniversary BookBill Watterson 4.63 7.52E+08 eng 208 303 12 2/1/08 Time Warner Books UK
19333 The World of Peter Rabbit (Original Peter Rabbit Books 1-23)Beatrix Potter 4.62 7.23E+08 eng 1072 220 14 5/4/06 Warne
23721 Nausicaä of the Valley of the Wind Vol. 6 (Nausicaä of the Valley of the Wind #6)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.62 1.59E+09 eng 159 1776 33 8/10/04 VIZ Media
313 100 Years of LynchingsRalph Ginzburg 4.61 9.33E+08 eng 270 88 4 11/22/96 Black Classic Press
17142 Collected Essays: Notes of a Native Son / Nobody Knows My Name / The Fire Next Time / No Name in the Street / The Devil Finds Work / Other EssaysJames Baldwin/Toni Morrison4.61 1.88E+09 eng 869 1495 75 2/1/98 Library of America (NY)
17334 The Complete C.S. Lewis Signature ClassicsC.S. Lewis 4.61 61208493 eng 746 925 40 2/6/07 HarperOne
19445 A Guide to the Words of My Perfect TeacherNgawang Pelzang/Padmakara Translation Group/Patrul Rinpoche/Alak Zenkar4.61 1.59E+09 eng 336 79 0 6/22/04 Shambhala
23318 Discovery of the Presence of God: Devotional NondualityDavid R. Hawkins4.61 9.72E+08 en-US 296 186 11 6/28/07 Veritas Publishing
23722 Nausicaä of the Valley of the Wind Vol. 5 (Nausicaä of the Valley of the Wind #5)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 151 1904 28 6/30/04 VIZ Media
23724 Nausicaä of the Valley of the Wind Vol. 7 (Nausicaä of the Valley of the Wind #7)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 223 1716 79 9/7/04 VIZ Media
5545 The Feynman Lectures on Physics 3 VolsRichard P. Feynman/Robert B. Leighton/Matthew L. Sands4.6 2.02E+08 en-US 3 78 7 1/1/89 Addison Wesley Publishing Company
9325 Fullmetal Alchemist Vol. 10Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 200 8989 151 11/21/06 VIZ Media LLC
26426 Fullmetal Alchemist Vol. 12 (Fullmetal Alchemist #12)Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 192 7480 119 3/20/07 VIZ Media LLC
30 J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the RingsJ.R.R. Tolkien 4.59 3.46E+08 eng 1728 101233 1550 9/25/12 Ballantine Books
119 The Lord of the Rings: The Art of the Fellowship of the RingGary Russell 4.59 6.18E+08 eng 192 26153 102 6/12/02 Houghton Mifflin Harcourt
3529 The World's First Love: Mary Mother of GodFulton J. Sheen 4.59 8.99E+08 en-US 276 641 63 9/1/96 Ignatius Press
15336 The Lord of the Rings / The HobbitJ.R.R. Tolkien 4.59 7144083 eng 1600 141 5 10/7/02 Collins Modern Classics
23506 Fullmetal Alchemist Vol. 11 (Fullmetal Alchemist #11)Hiromu Arakawa/Akira Watanabe4.59 1.42E+09 eng 192 7655 129 1/16/07 VIZ Media LLC
23720 Nausicaä of the Valley of the Wind Vol. 4 (Nausicaä of the Valley of the Wind #4)Hayao Miyazaki/David Lewis/Toren Smith/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.59 1.59E+09 eng 134 1956 38 6/2/04 VIZ Media
26422 Fullmetal Alchemist Vol. 14 (Fullmetal Alchemist #14)Hiromu Arakawa/Akira Watanabe4.59 142151379Xeng 192 8634 118 8/14/07 VIZ Media LLC
16602 Coffin: The Art of Vampire Hunter DYoshitaka Amano4.58 1.6E+09 eng 199 596 16 11/1/06 Dark Horse Books
17961 Collected FictionsJorge Luis Borges/Andrew Hurley4.58 1.4E+08 eng 565 18874 791 9/30/99 Penguin Classics Deluxe Edition
33342 The More Than Complete Hitchhiker's Guide (Hitchhiker's Guide #1-4 + short story)Douglas Adams 4.58 6.81E+08 en-US 624 433 34 11/1/89 Longmeadow Press
41468 Dewdrops on a Lotus Leaf: Zen Poems of RyokanRyōkan/John Stevens4.58 1.59E+09 eng 120 183 19 4/13/04 Shambhala
44734 Fullmetal Alchemist Vol. 6 (Fullmetal Alchemist #6)Hiromu Arakawa/Akira Watanabe4.58 1.42E+09 eng 200 10052 201 3/21/06 VIZ Media LLC
1 Harry Potter and the Half-Blood Prince (Harry Potter #6)J.K. Rowling/Mary GrandPré4.57 4.4E+08 eng 652 2095690 27591 9/16/06 Scholastic Inc.
866 Fullmetal Alchemist Vol. 9 (Fullmetal Alchemist #9)Hiromu Arakawa/Akira Watanabe4.57 142150460Xeng 192 9013 153 9/19/06 VIZ Media LLC
869 Fullmetal Alchemist Vol. 8 (Fullmetal Alchemist #8)Hiromu Arakawa/Akira Watanabe4.57 1.42E+09 eng 192 11451 161 7/18/06 VIZ Media LLC
'''
I am trying to create 2 columns in my dataframe for Longitude and Latitude which I want to find by using my address column called 'Details'.
I have tried from
geopy.extra.rate_limiter import RateLimiter
locator=Nominatim(user_agent="MyGeocoder")
results['location']=results['Details'].apply
results['point']=results['location'].apply(lambda loc:tuple(loc['point']) if loc else None)
results[['latitude', 'longitude',]]=pd.DataFrame(results['point'].tolist(), index=results.index)
But this gives the error "method object is not subscriptable"
I want to create a loop to get all coordinates for each address
Details Sale Price Post Code Year Sold
1 53 Eastbury Grove, London, W4 2JT Flat, Lease... 450000.0 W4 2020
2 Flat 148 Wedgwood House Lambeth Walk, London, ... 325000.0 E11 2020
3 63 Russell Road, Wimbledon, London, SW19 1QN ... 800000.0 W19 2020
4 Flat 2 9 Queens Gate Place, London, SW7 5NX F... 400000.0 W7 2020
5 83 Chingford Mount Road, London, E4 8LU Freeh... 182000.0 E4 2020
... ... ... ... ...
47 702 Rutherford Heights Rodney Road, London, SE... 554750.0 E17 2015
48 Flat 48 Highlands Court Highland Road, London,... 340000.0 E19 2015
49 5 Mount Nod Road, London, SW16 2LQ Flat, Leas... 395000.0 W16 2015
50 6 Woodmill Street, London, SE16 3GG Terraced,... 1010000.0 E16 2015
51 402 Rutherford Heights Rodney Road, London, SE... 403200.0 E17 2015
300 rows × 4 columns
Try this
import pandas as pd
import geopandas
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
locator = Nominatim(user_agent="myGeocoder")
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
def lat_long(row):
loc = locator.geocode(row["Details"])
row["latitude"] = loc.latitude
row["longitude"] = loc.longitude
return row
results.apply(lat_long, axis=1)
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
For some reason, the last line with header one gives me an error, "list index out of range". I am not too sure what is causing this error to happen, but I know I need this line. Here is a link to the website I am using for the data, https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. The specific table I want is the one that is below the horizontal bar chart.
Traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-47-67ef2aac7bf3> in <module>
28 data_tables.append(td.findAll('table'))
29
---> 30 header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
31
32 header1
IndexError: list index out of range
Use pandas.read_html
Read HTML tables into a list of DataFrame objects.
This answer side-steps the question to provide a more efficient method for extracting tables from Wikipedia and gives the OP the desired end result.
The following code will more easily get the desired table from the Wikipedia page.
.read_html will return a list of dataframes.
The table you're interested in, is at index 4
Clean the table
Select the rows and columns with valid data.
This method does return the table headers, but the column names are multi-level so we'll rename them.
Before renaming the columns, if you need the original data from the column names, use us_covid_data.columns which will return a list of tuples with all the column name values.
import pandas as pd
# get list of dataframes and select index 4
us_covid_data = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')[4]
# select rows and columns
us_covid_data = us_covid_data.iloc[0:56, 1:6]
# rename columns
us_covid_data.columns = ['state_territory', 'cases', 'deaths', 'recovered', 'hospitalized']
# display(us_covid_data)
state_territory cases deaths recovered hospitalized
0 Alabama 45785 1033 22082 2961
1 Alaska 1184 17 560 78
2 American Samoa 0 0 – –
3 Arizona 116892 2082 – 5272
4 Arkansas 24253 292 17834 1604
5 California 296499 6711 – –
6 Colorado 34316 1704 – 5527
7 Connecticut 46976 4338 – –
8 Delaware 12293 512 6778 –
9 District of Columbia 10569 561 1465 –
10 Florida 244151 4102 – 15150
11 Georgia 111211 2965 – 11919
12 Guam 1272 6 179 –
13 Hawaii 1012 19 746 116
14 Idaho 8222 94 2886 350
15 Illinois 151767 7144 – –
16 Indiana 49560 2698 36788 7139
17 Iowa 31906 725 24242 –
18 Kansas 17618 282 – 1269
19 Kentucky 17526 623 4785 2662
20 Louisiana 66435 3296 43026 –
21 Maine 3440 110 2787 354
22 Maryland 70497 3246 – 10939
23 Massachusetts 111110 8296 88725 10985
24 Michigan 73403 6225 52841 –
25 Minnesota 38606 1511 33907 4112
26 Mississippi 31257 1114 22167 2881
27 Missouri 24985 1077 – –
28 Montana 1249 23 678 89
29 Nebraska 20053 286 14641 1224
30 Nevada 22930 537 – –
31 New Hampshire 5914 382 4684 558
32 New Jersey 174628 15479 31014 –
33 New Mexico 14549 539 6181 2161
34 New York 400299 32307 71371 –
35 North Carolina 81331 1479 55318 –
36 North Dakota 3858 89 3350 218
37 Northern Mariana Islands 31 2 19 –
38 Ohio 57956 2927 – 7292
39 Oklahoma 16362 399 12432 1676
40 Oregon 10402 218 2846 1069
41 Pennsylvania 93876 6880 – –
42 Puerto Rico 8714 157 – –
43 Rhode Island 16991 960 – 1922
44 South Carolina 47214 838 – –
45 South Dakota 7105 97 6062 689
46 Tennessee 51509 646 31020 2860
47 Texas 240111 3013 122996 9610
48 Virgin Islands 112 6 79 –
49 Utah 25563 190 14448 1565
50 Vermont 1251 56 1022 –
51 Virginia 66740 1881 – 9549
52 Washington 38517 1370 – 4463
53 West Virginia 3461 95 2518 –
54 Wisconsin 35318 805 25542 3574
55 Wyoming 1675 20 1172 253
Addressing the original issue:
data is an empty list generated from data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
With data = data_table.tbody.findAll('tr', recursive=False)[1] and then data = [v for v in data.get_text().split('\n') if v], you will get the headers.
The output of data will be ['U.S. state or territory[i]', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]', 'Ref.']
Since data_tables is generated from iterating through data, it is also empty.
header1 is generated from iterating data_tables[0], so IndexError occurs because data_tables is empty.
I have the following dict which I converted to dataframe
players_info = {'Afghanistan': {'Asghar Stanikzai': 809.0,
'Mohammad Nabi': 851.0,
'Mohammad Shahzad': 1713.0,
'Najibullah Zadran': 643.0,
'Samiullah Shenwari': 774.0},
'Australia': {'AJ Finch': 1082.0,
'CL White': 988.0,
'DA Warner': 1691.0,
'GJ Maxwell': 822.0,
'SR Watson': 1465.0},
'England': {'AD Hales': 1340.0,
'EJG Morgan': 1577.0,
'JC Buttler': 985.0,
'KP Pietersen': 1176.0,
'LJ Wright': 759.0}}
pd.DataFrame(players_info)
The resulting output is
But I want the columns to be mapped with rows like the following
Player Team Score
Mohammad Nabi Afghanistan 851.0
Mohammad Shahzad Afghanistan 1713.0
Najibullah Zadran Afghanistan 643.0
JC Buttler England 985.0
KP Pietersen England 1176.0
LJ Wright England 759.0
I tried reset_index but it is not working as I want. How can I do that ?
You need:
df = df.stack().reset_index()
df.columns=['Player', 'Team', 'Score']
Output of df.head(5):
Player Team Score
0 AD Hales Score 1340.0
1 AJ Finch Team 1082.0
2 Asghar Stanikzai Player 809.0
3 CL White Team 988.0
4 DA Warner Team 1691.0
Let's take a stab at this using melt. Should be pretty fast.
df.rename_axis('Player').reset_index().melt('Player').dropna()
Player variable value
2 Asghar Stanikzai Afghanistan 809.0
10 Mohammad Nabi Afghanistan 851.0
11 Mohammad Shahzad Afghanistan 1713.0
12 Najibullah Zadran Afghanistan 643.0
14 Samiullah Shenwari Afghanistan 774.0
16 AJ Finch Australia 1082.0
18 CL White Australia 988.0
19 DA Warner Australia 1691.0
21 GJ Maxwell Australia 822.0
28 SR Watson Australia 1465.0
30 AD Hales England 1340.0
35 EJG Morgan England 1577.0
37 JC Buttler England 985.0
38 KP Pietersen England 1176.0
39 LJ Wright England 759.0
Hi I am trying to aggregate some data in a dataframe by using agg but my initial statement mentioned a warning "FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version". I rewrote it based on Pandas documentation but instead of getting the right column label I am getting a function label. example: "". How can I correct the output so that the labels match the deprecated output above with column names std, mean, size, sum?
Deprecated Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']
.agg({'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
Deprecated Syntax Output:
std mean size sum
Continent
Asia 6.790979e+08 5.797333e+08 5.0 2.898666e+09
Australia NaN 2.331602e+07 1.0 2.331602e+07
Europe 3.464767e+07 7.632161e+07 6.0 4.579297e+08
North America 1.996696e+08 1.764276e+08 2.0 3.528552e+08
South America NaN 2.059153e+08 1.0 2.059153e+08
New Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']\
.agg(['size', 'sum', 'mean', 'std'])\
.rename(columns={'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
New Syntax Output:
<function size at 0x0000000002DE9950> <function sum at 0x0000000002DE90D0> <function mean at 0x0000000002DE9AE8> <function std at 0x0000000002DE9B70>
Continent
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
Dataframe:
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Pop Est Continent
Country
China 1 127050 126767 597237 411683 4.70 138 1.271910e+11 93.0 19.754910 3.992331e+12 4.559041e+12 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 1.367645e+09 Asia
United States 2 96661 94747 792274 265436 8.20 230 9.083800e+10 286.0 11.570980 1.479230e+13 1.505540e+13 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 3.176154e+08 North America
Japan 3 30504 30287 223024 61554 7.31 134 1.898400e+10 149.0 10.232820 5.496542e+12 5.617036e+12 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 1.274094e+08 Asia
United Kingdom 4 20944 20357 206091 37874 9.84 139 7.920000e+09 124.0 10.600470 2.419631e+12 2.482203e+12 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 6.387097e+07 Europe
Russian Federation 5 18534 18301 34266 12422 1.85 57 3.070900e+10 214.0 17.288680 1.385793e+12 1.504071e+12 1.583004e+12 1.459199e+12 1.524917e+12 1.589943e+12 1.645876e+12 1.666934e+12 1.678709e+12 1.616149e+12 1.435000e+08 Europe
Canada 6 17899 17620 215003 40930 12.01 149 1.043100e+10 296.0 61.945430 1.564469e+12 1.596740e+12 1.612713e+12 1.565145e+12 1.613406e+12 1.664087e+12 1.693133e+12 1.730688e+12 1.773486e+12 1.792609e+12 3.523986e+07 North America
Germany 7 17027 16831 140566 27426 8.26 126 1.326100e+10 165.0 17.901530 3.332891e+12 3.441561e+12 3.478809e+12 3.283340e+12 3.417298e+12 3.542371e+12 3.556724e+12 3.567317e+12 3.624386e+12 3.685556e+12 8.036970e+07 Europe
India 8 15005 14841 128763 37209 8.58 115 3.319500e+10 26.0 14.969080 1.265894e+12 1.374865e+12 1.428361e+12 1.549483e+12 1.708459e+12 1.821872e+12 1.924235e+12 2.051982e+12 2.200617e+12 2.367206e+12 1.276731e+09 Asia
France 9 13153 12973 130632 28601 9.93 114 1.059700e+10 166.0 17.020280 2.607840e+12 2.669424e+12 2.674637e+12 2.595967e+12 2.646995e+12 2.702032e+12 2.706968e+12 2.722567e+12 2.729632e+12 2.761185e+12 6.383735e+07 Europe
South Korea 10 11983 11923 114675 22595 9.57 104 1.100700e+10 221.0 2.279353 9.410199e+11 9.924316e+11 1.020510e+12 1.027730e+12 1.094499e+12 1.134796e+12 1.160809e+12 1.194429e+12 1.234340e+12 1.266580e+12 4.980543e+07 Asia
Italy 11 10964 10794 111850 26661 10.20 106 6.530000e+09 109.0 33.667230 2.202170e+12 2.234627e+12 2.211154e+12 2.089938e+12 2.125185e+12 2.137439e+12 2.077184e+12 2.040871e+12 2.033868e+12 2.049316e+12 5.990826e+07 Europe
Spain 12 9428 9330 123336 23964 13.08 115 4.923000e+09 106.0 37.968590 1.414823e+12 1.468146e+12 1.484530e+12 1.431475e+12 1.431673e+12 1.417355e+12 1.380216e+12 1.357139e+12 1.375605e+12 1.419821e+12 4.644340e+07 Europe
Iran 13 8896 8819 57470 19125 6.46 72 9.172000e+09 119.0 5.707721 3.895523e+11 4.250646e+11 4.289909e+11 4.389208e+11 4.677902e+11 4.853309e+11 4.532569e+11 4.445926e+11 4.639027e+11 NaN 7.707563e+07 Asia
Australia 14 8831 8725 90765 15606 10.28 107 5.386000e+09 231.0 11.810810 1.021939e+12 1.060340e+12 1.099644e+12 1.119654e+12 1.142251e+12 1.169431e+12 1.211913e+12 1.241484e+12 1.272520e+12 1.301251e+12 2.331602e+07 Australia
Brazil 15 8668 8596 60702 14396 7.00 86 1.214900e+10 59.0 69.648030 1.845080e+12 1.957118e+12 2.056809e+12 2.054215e+12 2.208872e+12 2.295245e+12 2.339209e+12 2.409740e+12 2.412231e+12 2.319423e+12 2.059153e+08 South America
Try using just this:
Top15.set_index('Continent').groupby(level=0)['Pop Est'].agg(['size', 'sum', 'mean', 'std'])