Scrape entire table from wikipedia using beautifulsoup and then load into pandas

Scrape entire table from wikipedia using beautifulsoup and then load into pandas - python

I'm currently scraping the following wiki page: https://en.wikipedia.org/wiki/Cargo_aircraft, there is only one table beginning at comparisons. I am trying to scrape the entire table and output it to pandas. I get how to add the initial column, Aircraft, but have trouble scraping the columns beginning from volume.
How can I add all rows of the table into the dataframe, or columns? Not sure which is the better approach.
from bs4 import BeautifulSoup
import requests
import pandas as pd
#this will use request library to call wikipedia
page = requests.get('https://en.wikipedia.org/wiki/Cargo_aircraft')
#create beautifulsoup object
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', attrs={'class':'wikitable sortable'})
tabledata = table.findAll('tbody')
links = table.findAll('a')
aircraft = []
for link in links:
aircraft.append(link.get('title'))
print(aircraft)
#pull table from Wikipedia
df = pd.DataFrame()
df['Aircraft'] = aircraft
df['Test'] = 'test'

Using pandas.read_html
Bypass beautifulsoup and read the table directly into pandas.
Read HTML tables into a list of DataFrame objects
In this case the table is at index [1]
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
# df view
Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270 m³ 37,000 kg (82,000 lb) 780 km/h (420 kn) 6,390 km (3,450 nmi) Military
1 Airbus A300-600F 391.4 m³ 48,000 kg (106,000 lb) – 7,400 km (4,000 nmi) Commercial
2 Airbus A330-200F 475 m³ 70,000 kg (154,000 lb) 871 km/h (470 kn) 7,400 km (4,000 nmi) Commercial
3 Airbus Beluga 1210 m³ 47,000 kg (104,000 lb) – 4,632 km (2,500 nmi) Commercial
4 Airbus Beluga XL 2615 m³ 53,000 kg (117,000 lb) – 4,074 km (2,200 nmi) Commercial
5 Antonov An-124 1028 m³ 150,000 kg (331,000 lb) 800 km/h (430 kn) 5,400 km (2,900 nmi) Both
6 Antonov An-225 1300 m³ 250,000 kg (551,000 lb) 800 km/h (430 kn) 15,400 km (8,316 nmi) Commercial
7 Boeing C-17 – 77,519 kg (170,900 lb) 830 km/h (450 kn) 4,482 km (2,420 nmi) Military
8 Boeing 737-700C 107.6 m³ 18,200 kg (40,000 lb) 931 km/h (503 kn) 5,330 km (2,880 nmi) Commercial
9 Boeing 757-200F 239 m³ 39,780 kg (87,700 lb) 955 km/h (516 kn) 5,834 km (3,150 nmi) Commercial
10 Boeing 747-8F 854.5 m³ 134,200 kg (295,900 lb) 908 km/h (490 kn) 8,288 km (4,475 nmi) Commercial
11 Boeing 747 LCF 1840 m³ 83,325 kg (183,700 lb) 878 km/h (474 kn) 7,800 km (4,200 nmi) Commercial
12 Boeing 767-300F 438.2 m³ 52,700 kg (116,200 lb) 850 km/h (461 kn) 6,025 km (3,225 nmi) Commercial
13 Boeing 777F 653 m³ 103,000 kg (227,000 lb) 896 km/h (484 kn) 9,070 km (4,900 nmi) Commercial
14 Bombardier Dash 8-100 39 m³ 4,700 kg (10,400 lb) 491 km/h (265 kn) 2,039 km (1,100 nmi) Commercial
15 Lockheed C-5 – 122,470 kg (270,000 lb) 919 km/h 4,440 km (2,400 nmi) Military
16 Lockheed C-130 – 20,400 kg (45,000 lb) 540 km/h (292 kn) 3,800 km (2,050 nmi) Military
17 Douglas DC-10-30 – 77,000 kg (170,000 lb) 908 km/h (490 kn) 5,790 km (3,127 nmi) Commercial
18 McDonnell Douglas MD-11 440 m³ 91,670 kg (202,100 lb) 945 km/h (520 kn) 7,320 km (3,950 nmi) Commercial

You can try:
df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
df['Volume'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Volume'].str.split()]).astype(float)
df['Payload'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Payload'].str.split()]).astype(int)
df['Cruise'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Cruise'].str.split()]).astype(float)
df['Range'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Range'].str.split()]).astype(int)
Result:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
Aircraft 19 non-null object
Volume 15 non-null float64
Payload 19 non-null int64
Cruise 16 non-null float64
Range 19 non-null int64
Usage 19 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB
print(df)
Aircraft Volume Payload Cruise Range Usage
0 Airbus A400M 270.0 37000 780.0 6390 Military
1 Airbus A300-600F 391.4 48000 NaN 7400 Commercial
2 Airbus A330-200F 475.0 70000 871.0 7400 Commercial
3 Airbus Beluga 1210.0 47000 NaN 4632 Commercial
4 Airbus Beluga XL 2615.0 53000 NaN 4074 Commercial
5 Antonov An-124 1028.0 150000 800.0 5400 Both
6 Antonov An-225 1300.0 250000 800.0 15400 Commercial
7 Boeing C-17 NaN 77519 830.0 4482 Military
8 Boeing 737-700C 107.6 18200 931.0 5330 Commercial
9 Boeing 757-200F 239.0 39780 955.0 5834 Commercial
10 Boeing 747-8F 854.5 134200 908.0 8288 Commercial
11 Boeing 747 LCF 1840.0 83325 878.0 7800 Commercial
12 Boeing 767-300F 438.2 52700 850.0 6025 Commercial
13 Boeing 777F 653.0 103000 896.0 9070 Commercial
14 Bombardier Dash 8-100 39.0 4700 491.0 2039 Commercial
15 Lockheed C-5 NaN 122470 919.0 4440 Military
16 Lockheed C-130 NaN 20400 540.0 3800 Military
17 Douglas DC-10-30 NaN 77000 908.0 5790 Commercial
18 McDonnell Douglas MD-11 440.0 91670 945.0 7320 Commercial

Related

How to web scrap Economic Calendar data from TradingView and load into Dataframe?

I want to load the Economic Calendar data from TradingView link and load into Dataframe ?
Link: https://in.tradingview.com/economic-calendar/
Filter-1: Select Data for India and United States
Filter-2: Data for This Week

You can request this url: https://economic-calendar.tradingview.com/events
import pandas as pd
import requests
url = 'https://economic-calendar.tradingview.com/events'
today = pd.Timestamp.today().normalize()
payload = {
'from': (today + pd.offsets.Hour(23)).isoformat() + '.000Z',
'to': (today + pd.offsets.Day(7) + pd.offsets.Hour(22)).isoformat() + '.000Z',
'countries': ','.join(['US', 'IN'])
}
data = requests.get(url, params=payload).json()
df = pd.DataFrame(data['result'])
Output:
>>> df
id title country ... ticker comment scale
0 312843 3-Month Bill Auction US ... NaN NaN NaN
1 312844 6-Month Bill Auction US ... NaN NaN NaN
2 316430 LMI Logistics Managers Index Current US ... USLMIC The Logistics Managers Survey is a monthly stu... NaN
3 316503 Exports US ... USEXP The United States is the world's third biggest... B
4 316504 Imports US ... USIMP The United States is the world's second-bigges... B
5 316505 Balance of Trade US ... USBOT The United States has been running consistent ... B
6 312845 Redbook YoY US ... USRI The Johnson Redbook Index is a sales-weighted ... NaN
7 316509 IBD/TIPP Economic Optimism US ... USEOI IBD/TIPP Economic Optimism Index measures Amer... NaN
8 337599 Fed Chair Powell Speech US ... USINTR In the United States, the authority to set int... NaN
9 334599 3-Year Note Auction US ... NaN NaN NaN
10 337600 Fed Barr Speech US ... USINTR In the United States, the authority to set int... NaN
11 316449 Consumer Credit Change US ... USCCR In the United States, Consumer Credit refers t... B
12 312846 API Crude Oil Stock Change US ... USCSC Stocks of crude oil refer to the weekly change... M
13 316575 Cash Reserve Ratio IN ... INCRR Cash Reserve Ratio is a specified minimum frac... NaN
14 334653 RBI Interest Rate Decision IN ... ININTR In India, interest rate decisions are taken by... NaN
15 312847 MBA 30-Year Mortgage Rate US ... USMR MBA 30-Year Mortgage Rate is average 30-year f... NaN
16 312848 MBA Mortgage Applications US ... USMAPL In the US, the MBA Weekly Mortgage Application... NaN
17 312849 MBA Mortgage Refinance Index US ... USMRI The MBA Weekly Mortgage Application Survey is ... NaN
18 312850 MBA Mortgage Market Index US ... USMMI The MBA Weekly Mortgage Application Survey is ... NaN
19 312851 MBA Purchase Index US ... USPIND NaN NaN
20 337604 Fed Williams Speech US ... USINTR In the United States, the authority to set int... NaN
21 316553 Wholesale Inventories MoM US ... USWI The Wholesale Inventories are the stock of uns... NaN
22 337601 Fed Barr Speech US ... USINTR In the United States, the authority to set int... NaN
23 312852 EIA Refinery Crude Runs Change US ... USRCR Crude Runs refer to the volume of crude oil co... M
24 312853 EIA Crude Oil Stocks Change US ... USCOSC Stocks of crude oil refer to the weekly change... M
25 312854 EIA Distillate Stocks Change US ... USDFS NaN M
26 312855 EIA Heating Oil Stocks Change US ... USHOS NaN M
27 312856 EIA Gasoline Production Change US ... USGPRO NaN M
28 312857 EIA Crude Oil Imports Change US ... USCOI NaN M
29 312858 EIA Gasoline Stocks Change US ... USGSCH Stocks of gasoline refers to the weekly change... M
30 312859 EIA Cushing Crude Oil Stocks Change US ... USCCOS Change in the number of barrels of crude oil h... M
31 312860 EIA Distillate Fuel Production Change US ... USDFP NaN M
32 337598 17-Week Bill Auction US ... NaN NaN NaN
33 334575 WASDE Report US ... NaN NaN NaN
34 334586 10-Year Note Auction US ... NaN Generally, a government bond is issued by a na... NaN
35 337602 Fed Waller Speech US ... USINTR In the United States, the authority to set int... NaN
36 312933 M3 Money Supply YoY IN ... INM3 India Money Supply M3 includes M2 plus long-te... NaN
37 312863 Jobless Claims 4-week Average US ... USJC4W NaN K
38 312864 Continuing Jobless Claims US ... USCJC Continuing Jobless Claims refer to actual numb... K
39 312865 Initial Jobless Claims US ... USIJC Initial jobless claims have a big impact in fi... K
40 312866 EIA Natural Gas Stocks Change US ... USNGSC Natural Gas Stocks Change refers to the weekly... B
41 312867 8-Week Bill Auction US ... NaN NaN NaN
42 312868 4-Week Bill Auction US ... NaN NaN NaN
43 334602 30-Year Bond Auction US ... NaN NaN NaN
44 312827 Deposit Growth YoY IN ... INDG In India, deposit growth refers to the year-ov... NaN
45 312869 Foreign Exchange Reserves IN ... INFER In India, Foreign Exchange Reserves are the fo... B
46 337022 Bank Loan Growth YoY IN ... INLG In India, bank loan growth refers to the year-... NaN
47 316685 Industrial Production YoY IN ... INIPYY In India, industrial production measures the o... NaN
48 316687 Manufacturing Production YoY IN ... INMPRYY Manufacturing production measures the output o... NaN
49 312902 Michigan Consumer Expectations Prel US ... USMCE The Index of Consumer Expectations focuses on ... NaN
50 312903 Michigan Current Conditions Prel US ... USMCEC The Index of Consumer Expectations focuses on ... NaN
51 312904 Michigan 5 Year Inflation Expectations Prel US ... USMIE5Y The Index of Consumer Expectations focuses on ... NaN
52 312905 Michigan Inflation Expectations Prel US ... USMIE1Y The Index of Consumer Expectations focuses on ... NaN
53 312906 Michigan Consumer Sentiment Prel US ... USCCI The Index of Consumer Expectations focuses on ... NaN
54 337603 Fed Waller Speech US ... USINTR In the United States, the authority to set int... NaN
55 312870 Baker Hughes Oil Rig Count US ... USCOR US Crude Oil Rigs refer to the number of activ... NaN
56 335652 Baker Hughes Total Rig Count US ... NaN US Total Rigs refer to the number of active US... NaN
57 335824 Monthly Budget Statement US ... USGBV Federal Government budget balance is the diffe... B
58 337605 Fed Harker Speech US ... USINTR In the United States, the authority to set int... NaN
[59 rows x 16 columns]
Info:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 59 non-null object
1 title 59 non-null object
2 country 59 non-null object
3 indicator 59 non-null object
4 period 59 non-null object
5 source 59 non-null object
6 actual 0 non-null object
7 previous 51 non-null float64
8 forecast 9 non-null float64
9 currency 59 non-null object
10 unit 28 non-null object
11 importance 59 non-null int64
12 date 59 non-null object
13 ticker 49 non-null object
14 comment 44 non-null object
15 scale 20 non-null object
dtypes: float64(2), int64(1), object(13)
memory usage: 7.5+ KB

Debugging my PopularBooks code in Python?

The code is supposed to be able to find the correct books by just running the code, entering what category then a sign such as '==' and then what value of that category were looking for. For instance, if I want the publishers from Vertigo, I would input 'Vertigo' then '==' and then 'Vertigo'. My professor says that my code is incorrect and says that there are three errors in my code but I have no idea what he's talking about. Can someone help me find these errors and then how fix them in Python?
import csv
def convertIfPossible(value):
# Convert a string value to integer or float if possible
try:
val = int(value)
return val
except:
try:
val = float(value)
return val
except:
return value
def readCSVIntoDictionary(fname):
# read a CSV file and make it into a dictionary
myDictionary = dict()
with open(fname) as f:
freader = csv.reader(f)
headers = next(freader)
ID = 1
for row in freader:
itemDictionary = dict()
for i in range(0,len(row)):
itemDictionary[headers[i]] = convertIfPossible(row[i])
myDictionary[ID] = itemDictionary
ID += 1
f.close()
return myDictionary
def printItem(item, indent=0):
# print out a dictionary item formatted somewhat nicely
for k, v in item.items():
for i in range(0,indent):
print(" |",end="")
print(f' {k:<20}',end="")
if (type(v) is dict):
print()
printItem(v,indent+1)
else:
print(f': {v}')
def lookupIDs(myDict,key,value,matchType="=="):
# find the IDs of all the items that match for category key on the
# value value using the comparison matchType
matchingKeys = []
for i in myDict.keys():
if (type(myDict[i][key]) == type(value)):
if (matchType == "=="):
if (myDict[i][key] == value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] <= value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] >= value):
matchingKeys.append(i)
elif (matchType == "<"):
if (myDict[i][key] < value):
matchingKeys.append(i)
elif (matchType == ">"):
if (myDict[i][key] < value):
matchingKeys.append(i)
return matchingKeys
def categoryExists(myDict,category):
# check if a particular category occurs in the first item
firstKey = list(myDict.keys())[0]
if (myDict[firstKey].get(category,None) == None):
return False
return True
def printMatches(myDict,matches):
# print the list of matches or No matches if none
if (len(matches) <= 1):
print("No matches")
return
else:
print("Matches")
for m in matches:
printItem(myDict[m])
print()
bookDictionary = readCSVIntoDictionary(r'C:\Users\jsric\OneDrive\Desktop\PopularBooks.csv')
print(f'There are {len(bookDictionary.keys())} items in the file')
print('These are the items:')
printItem(bookDictionary)
while True:
category = input("Category to look up: ")
if (categoryExists(bookDictionary,category)):
comparisonToDo = input("Comparison to do: ")
valueToCompare = convertIfPossible(input("Value to compare: "))
matches = lookupIDs(bookDictionary,category,valueToCompare,comparisonToDo)
'''
Here's the file I'm using
'''
bookID title authors average_ratingisbn language_codenum_pagesratings_counttext_reviews_countpublication_datepublisher
24812 The Complete Calvin and HobbesBill Watterson 4.82 7.41E+08 eng 1456 32213 930 9/6/05 Andrews McMeel Publishing
8 Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)J.K. Rowling/Mary GrandPr√©4.78 4.4E+08 eng 2690 41428 164 9/13/04 Scholastic
24814 It's a Magical World (Calvin and Hobbes #11)Bill Watterson 4.76 8.36E+08 eng 176 23875 303 9/1/96 Andrews McMeel Publishing
10 Harry Potter Collection (Harry Potter #1-6)J.K. Rowling 4.73 4.4E+08 eng 3342 28242 808 9/12/05 Scholastic
6550 Early ColorSaul Leiter/Martin Harrison4.73 3.87E+09 eng 156 144 8 1/15/06 Steidl
24816 Homicidal Psycho Jungle Cat (Calvin and Hobbes #9)Bill Watterson 4.72 8.36E+08 eng 176 15365 290 9/6/94 Andrews McMeel Publishing
34545 Elliott Erwitt: SnapsMurray Sayle/Charles Flowers/Elliott Erwitt4.72 071484330Xen-GB 544 102 6 6/1/03 Phaidon Press
24820 Calvin and Hobbes: Sunday Pages 1985-1995: An Exhibition CatalogueBill Watterson 4.71 7.41E+08 eng 96 3613 85 9/17/01 Andrews McMeel Publishing
20749 Study Bible: NIVAnonymous 4.7 3.11E+08 eng 2198 4166 186 10/1/02 Zondervan Publishing House
24520 The Complete Aubrey/Maturin Novels (5 Volumes)Patrick O'Brian 4.7 039306011Xeng 6576 1338 81 10/17/04 W. W. Norton Company
44826 The Price of the Ticket: Collected Nonfiction 1948-1985James Baldwin 4.7 3.13E+08 eng 712 404 30 9/15/85 St. Martin's Press
24818 The Days Are Just PackedBill Watterson 4.69 8.36E+08 eng 176 20308 244 9/1/93 Andrews McMeel Publishing
26805 The Sibley Field Guide to Birds of Western North AmericaDavid Allen Sibley4.69 6.79E+08 en-US 473 730 36 4/29/03 Alfred A. Knopf
5309 The Life and Times of Scrooge McDuckDon Rosa 4.67 9.12E+08 eng 266 2467 149 6/1/05 Gemstone Publishing
23753 The Absolute Sandman Volume OneNeil Gaiman/Mike Dringenberg/Chris Bachalo/Michael Zulli/Kelly Jones/Charles Vess/Colleen Doran/Malcolm Jones III/Steve Parkhouse/Daniel Vozzo/Lee Loughridge/Steve Oliff/Todd Klein/Dave McKean/Sam Kieth4.65 1.4E+09 eng 612 15640 512 11/1/06 Vertigo
3582 The New Annotated Sherlock Holmes: The Complete Short StoriesArthur Conan Doyle/Leslie S. Klinger4.64 3.93E+08 eng 1878 1411 54 11/30/04 W. W. Norton & Company
27204 The Gospel According to LukeAnonymous/Thomas Cahill4.64 8.02E+08 eng 81 169 17 10/29/99 Grove Press
39661 The Shawshank Redemption: The Shooting ScriptFrank Darabont/Stephen King4.64 1.56E+09 eng 184 2406 29 9/30/04 Newmarket Press
13206 The Collected Autobiographies of Maya AngelouMaya Angelou 4.63 6.8E+08 eng 1184 991 55 9/21/04 Modern Library
17280 The Feynman Lectures on Physics Vol 3Richard P. Feynman/Robert B. Leighton/Matthew L. Sands4.63 8.05E+08 eng 384 716 15 9/1/05 Addison Wesley Publishing Company
24813 The Calvin and Hobbes Tenth Anniversary BookBill Watterson 4.63 8.36E+08 eng 208 49122 368 9/5/95 Andrews McMeel Publishing
24819 The Calvin And Hobbes: Tenth Anniversary BookBill Watterson 4.63 7.52E+08 eng 208 303 12 2/1/08 Time Warner Books UK
19333 The World of Peter Rabbit (Original Peter Rabbit Books 1-23)Beatrix Potter 4.62 7.23E+08 eng 1072 220 14 5/4/06 Warne
23721 Nausica√§ of the Valley of the Wind Vol. 6 (Nausica√§ of the Valley of the Wind #6)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.62 1.59E+09 eng 159 1776 33 8/10/04 VIZ Media
313 100 Years of LynchingsRalph Ginzburg 4.61 9.33E+08 eng 270 88 4 11/22/96 Black Classic Press
17142 Collected Essays: Notes of a Native Son / Nobody Knows My Name / The Fire Next Time / No Name in the Street / The Devil Finds Work / Other EssaysJames Baldwin/Toni Morrison4.61 1.88E+09 eng 869 1495 75 2/1/98 Library of America (NY)
17334 The Complete C.S. Lewis Signature ClassicsC.S. Lewis 4.61 61208493 eng 746 925 40 2/6/07 HarperOne
19445 A Guide to the Words of My Perfect TeacherNgawang Pelzang/Padmakara Translation Group/Patrul Rinpoche/Alak Zenkar4.61 1.59E+09 eng 336 79 0 6/22/04 Shambhala
23318 Discovery of the Presence of God: Devotional NondualityDavid R. Hawkins4.61 9.72E+08 en-US 296 186 11 6/28/07 Veritas Publishing
23722 Nausica√§ of the Valley of the Wind Vol. 5 (Nausica√§ of the Valley of the Wind #5)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 151 1904 28 6/30/04 VIZ Media
23724 Nausica√§ of the Valley of the Wind Vol. 7 (Nausica√§ of the Valley of the Wind #7)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 223 1716 79 9/7/04 VIZ Media
5545 The Feynman Lectures on Physics 3 VolsRichard P. Feynman/Robert B. Leighton/Matthew L. Sands4.6 2.02E+08 en-US 3 78 7 1/1/89 Addison Wesley Publishing Company
9325 Fullmetal Alchemist Vol. 10Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 200 8989 151 11/21/06 VIZ Media LLC
26426 Fullmetal Alchemist Vol. 12 (Fullmetal Alchemist #12)Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 192 7480 119 3/20/07 VIZ Media LLC
30 J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the RingsJ.R.R. Tolkien 4.59 3.46E+08 eng 1728 101233 1550 9/25/12 Ballantine Books
119 The Lord of the Rings: The Art of the Fellowship of the RingGary Russell 4.59 6.18E+08 eng 192 26153 102 6/12/02 Houghton Mifflin Harcourt
3529 The World's First Love: Mary Mother of GodFulton J. Sheen 4.59 8.99E+08 en-US 276 641 63 9/1/96 Ignatius Press
15336 The Lord of the Rings / The HobbitJ.R.R. Tolkien 4.59 7144083 eng 1600 141 5 10/7/02 Collins Modern Classics
23506 Fullmetal Alchemist Vol. 11 (Fullmetal Alchemist #11)Hiromu Arakawa/Akira Watanabe4.59 1.42E+09 eng 192 7655 129 1/16/07 VIZ Media LLC
23720 Nausica√§ of the Valley of the Wind Vol. 4 (Nausica√§ of the Valley of the Wind #4)Hayao Miyazaki/David Lewis/Toren Smith/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.59 1.59E+09 eng 134 1956 38 6/2/04 VIZ Media
26422 Fullmetal Alchemist Vol. 14 (Fullmetal Alchemist #14)Hiromu Arakawa/Akira Watanabe4.59 142151379Xeng 192 8634 118 8/14/07 VIZ Media LLC
16602 Coffin: The Art of Vampire Hunter DYoshitaka Amano4.58 1.6E+09 eng 199 596 16 11/1/06 Dark Horse Books
17961 Collected FictionsJorge Luis Borges/Andrew Hurley4.58 1.4E+08 eng 565 18874 791 9/30/99 Penguin Classics Deluxe Edition
33342 The More Than Complete Hitchhiker's Guide (Hitchhiker's Guide #1-4 + short story)Douglas Adams 4.58 6.81E+08 en-US 624 433 34 11/1/89 Longmeadow Press
41468 Dewdrops on a Lotus Leaf: Zen Poems of RyokanRy≈çkan/John Stevens4.58 1.59E+09 eng 120 183 19 4/13/04 Shambhala
44734 Fullmetal Alchemist Vol. 6 (Fullmetal Alchemist #6)Hiromu Arakawa/Akira Watanabe4.58 1.42E+09 eng 200 10052 201 3/21/06 VIZ Media LLC
1 Harry Potter and the Half-Blood Prince (Harry Potter #6)J.K. Rowling/Mary GrandPr√©4.57 4.4E+08 eng 652 2095690 27591 9/16/06 Scholastic Inc.
866 Fullmetal Alchemist Vol. 9 (Fullmetal Alchemist #9)Hiromu Arakawa/Akira Watanabe4.57 142150460Xeng 192 9013 153 9/19/06 VIZ Media LLC
869 Fullmetal Alchemist Vol. 8 (Fullmetal Alchemist #8)Hiromu Arakawa/Akira Watanabe4.57 1.42E+09 eng 192 11451 161 7/18/06 VIZ Media LLC
'''

How to resolve, list index out of range, from scraping website?

from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
For some reason, the last line with header one gives me an error, "list index out of range". I am not too sure what is causing this error to happen, but I know I need this line. Here is a link to the website I am using for the data, https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. The specific table I want is the one that is below the horizontal bar chart.
Traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-47-67ef2aac7bf3> in <module>
28 data_tables.append(td.findAll('table'))
29
---> 30 header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
31
32 header1
IndexError: list index out of range

Use pandas.read_html
Read HTML tables into a list of DataFrame objects.
This answer side-steps the question to provide a more efficient method for extracting tables from Wikipedia and gives the OP the desired end result.
The following code will more easily get the desired table from the Wikipedia page.
.read_html will return a list of dataframes.
The table you're interested in, is at index 4
Clean the table
Select the rows and columns with valid data.
This method does return the table headers, but the column names are multi-level so we'll rename them.
Before renaming the columns, if you need the original data from the column names, use us_covid_data.columns which will return a list of tuples with all the column name values.
import pandas as pd
# get list of dataframes and select index 4
us_covid_data = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')[4]
# select rows and columns
us_covid_data = us_covid_data.iloc[0:56, 1:6]
# rename columns
us_covid_data.columns = ['state_territory', 'cases', 'deaths', 'recovered', 'hospitalized']
# display(us_covid_data)
state_territory cases deaths recovered hospitalized
0 Alabama 45785 1033 22082 2961
1 Alaska 1184 17 560 78
2 American Samoa 0 0 – –
3 Arizona 116892 2082 – 5272
4 Arkansas 24253 292 17834 1604
5 California 296499 6711 – –
6 Colorado 34316 1704 – 5527
7 Connecticut 46976 4338 – –
8 Delaware 12293 512 6778 –
9 District of Columbia 10569 561 1465 –
10 Florida 244151 4102 – 15150
11 Georgia 111211 2965 – 11919
12 Guam 1272 6 179 –
13 Hawaii 1012 19 746 116
14 Idaho 8222 94 2886 350
15 Illinois 151767 7144 – –
16 Indiana 49560 2698 36788 7139
17 Iowa 31906 725 24242 –
18 Kansas 17618 282 – 1269
19 Kentucky 17526 623 4785 2662
20 Louisiana 66435 3296 43026 –
21 Maine 3440 110 2787 354
22 Maryland 70497 3246 – 10939
23 Massachusetts 111110 8296 88725 10985
24 Michigan 73403 6225 52841 –
25 Minnesota 38606 1511 33907 4112
26 Mississippi 31257 1114 22167 2881
27 Missouri 24985 1077 – –
28 Montana 1249 23 678 89
29 Nebraska 20053 286 14641 1224
30 Nevada 22930 537 – –
31 New Hampshire 5914 382 4684 558
32 New Jersey 174628 15479 31014 –
33 New Mexico 14549 539 6181 2161
34 New York 400299 32307 71371 –
35 North Carolina 81331 1479 55318 –
36 North Dakota 3858 89 3350 218
37 Northern Mariana Islands 31 2 19 –
38 Ohio 57956 2927 – 7292
39 Oklahoma 16362 399 12432 1676
40 Oregon 10402 218 2846 1069
41 Pennsylvania 93876 6880 – –
42 Puerto Rico 8714 157 – –
43 Rhode Island 16991 960 – 1922
44 South Carolina 47214 838 – –
45 South Dakota 7105 97 6062 689
46 Tennessee 51509 646 31020 2860
47 Texas 240111 3013 122996 9610
48 Virgin Islands 112 6 79 –
49 Utah 25563 190 14448 1565
50 Vermont 1251 56 1022 –
51 Virginia 66740 1881 – 9549
52 Washington 38517 1370 – 4463
53 West Virginia 3461 95 2518 –
54 Wisconsin 35318 805 25542 3574
55 Wyoming 1675 20 1172 253
Addressing the original issue:
data is an empty list generated from data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
With data = data_table.tbody.findAll('tr', recursive=False)[1] and then data = [v for v in data.get_text().split('\n') if v], you will get the headers.
The output of data will be ['U.S. state or territory[i]', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]', 'Ref.']
Since data_tables is generated from iterating through data, it is also empty.
header1 is generated from iterating data_tables[0], so IndexError occurs because data_tables is empty.

No tables found error when making an AJAX request

I am trying to scrape the results table from the following url: https://utmbmontblanc.com/en/page/107/results.html
However when I run my code it says 'No Tables Found'
import pandas as pd
url = 'https://utmbmontblanc.com/en/page/107/results.html'
data = pd.read_html(url, header = 0)
data.head()
ValueError: No tables found
Having used developer tools I know that there is definitely a table in the html code. Why is it not being found? Any help is greatly appreciated. Thanks in advance

build URL for Ajax request, for 2017 - CCC is like this
url = 'https://.......com/result.php?mode=edPass&ajax=true&annee=2017&course=ccc'
data = pd.read_html(url, header = 0)
print(data[0])

You can also use selenium if you are unable to find any other hacks.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
from bs4 import BeautifulSoup as BSoup
import pandas as pd
url = "https://utmbmontblanc.com/en/page/107/results.html"
driver = webdriver.Chrome("/home/bitto/chromedriver")#change this to your chromedriver path
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
#changes div[#class='bloc'] to change year - [1] for 2018, [2] for 2017 etc
#change index of div[#class='row'] - [1], [2] for TDS etc
#change #value of option match your preferred option's value - you can find this from the inspect tool - First two are Scratch and ScratchH
EC.presence_of_element_located((By.XPATH, "//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/select[#name='cat'][1]/option[#value='Scratch']"))
)
element.click()#select option
#make relevant changes you made in top here also
driver.find_element_by_xpath("//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/input").click();#click go
sleep(10)#not preferred but will do for now
table=pd.read_html(driver.page_source)
print(table)
Output
[ GeneralRanking Family name First name Club Cat. ... Time Difference/ 1st Nationality
0 1 3001 - HAWKS Hayden HOKA ONE ONE SEH ... 10:24:30 00:00:00 United States
1 2 3018 - ŚWIERC Marcin SALOMON SUUNTO TEAM POLAND SEH ... 10:42:49 00:18:19 Poland
2 3 3005 - POMMERET Ludovic TEAM HOKA V1H ... 10:50:47 00:26:17 France
3 4 3214 - EVANS Thomas COMPRESS SPORT SEH ... 10:57:44 00:33:14 United Kingdom
4 5 3002 - OWENS Tom SALOMON SEH ... 11:03:48 00:39:18 United Kingdom
5 6 3011 - JONSSON Thorbergur 66 NORTH SEH ... 11:14:22 00:49:52 Iceland
6 7 3026 - BOUVIER-GAZ Nicolas TEAM NEW BALANCE SEH ... 11:18:33 00:54:03 France
7 8 3081 - JONES Michael WWW.APEXRUNNING.CO SEH ... 11:31:50 01:07:20 United Kingdom
8 9 3020 - COLLET Aurélien HOKA ONE ONE SEH ... 11:33:10 01:08:40 France
9 10 3009 - MARAVILLA Jorge HOKA ONE ONE V1H ... 11:36:14 01:11:44 United States
10 11 3036 - PERRILLAT Christophe SEH ... 11:40:05 01:15:35 France
11 12 3070 - FRAGUELA BREIJO Alejandro STUDIO54 V1H ... 11:40:11 01:15:41 Spain
12 13 3092 - AIGROZ Mike TRUST SEH ... 11:41:53 01:17:23 Switzerland
13 14 3021 - O'LEARY Paddy THE NORTH FACE SEH ... 11:47:04 01:22:34 Ireland
14 15 3065 - PÉREZ TORREGLOSA Juan CLUB ULTRATRAIL ... SEH ... 11:47:51 01:23:21 Spain
15 16 3031 - SÁNCHEZ CEBRIÁN Miguel Ángel LURBEL-LI... V1H ... 11:49:15 01:24:45 Spain
16 17 3062 - ANDREWS Justin SEH ... 11:49:47 01:25:17 United States
17 18 3039 - PIANA Giulio TEAM MUD AND SNOW SEH ... 11:50:23 01:25:53 Italy
18 19 3047 - RONIMOISS Andris Inov8 / OSveikals.lv ... SEH ... 11:52:25 01:27:55 Latvia
19 20 3052 - DURAND Regis TEAM TRAIL ISOSTAR V1H ... 11:56:40 01:32:10 France
20 21 3027 - SANDES Ryan SALOMON SEH ... 12:04:39 01:40:09 South Africa
21 22 3014 - EL MORABITY Rachid ULTRA TRAIL ATLAS T... SEH ... 12:10:01 01:45:31 Morocco
22 23 3067 - JONES Harry RUNIVORE SEH ... 12:10:12 01:45:42 United Kingdom
23 24 3030 - CLAVERY Erik - SEH ... 12:12:56 01:48:26 France
24 25 3056 - JIMENEZ LLORENS Juan Maria GREEN POWER... SEH ... 12:13:18 01:48:48 Spain
25 26 3024 - GALLAGHER Clare THE NORTH FACE SEF ... 12:13:57 01:49:27 United States
26 27 3136 - ASSEL Garry LICENCE INDIVIDUELLE LUXEM... SEH ... 12:20:46 01:56:16 Luxembourg
27 28 3071 - RIGODANZA Francesco SPIRITO TRAIL TEAM SEH ... 12:22:49 01:58:19 Italy
28 29 3118 - POLASZEK Christophe CHARTRES VERTICAL V1H ... 12:24:49 02:00:19 France
29 30 3125 - CALERO RODRIGUEZ David Altmann Sports/... SEH ... 12:25:07 02:00:37 Spain
... ... ... ... ... ... ... ...
1712 1713 5734 - GOT Hang Fai V2H ... 26:25:01 16:00:31 Hong Kong, China
1713 1714 4154 - RAMOS Liliana NIKE RUNNING CLUB V3F ... 26:26:22 16:01:52 Argentina
1714 1715 5448 - BECKRICH Xavier PHOENIX57 V1H ... 26:26:45 16:02:15 France
1715 1716 5213 - BARBERIO ARNOULT Isabelle PHOENIX57 V1F ... 26:26:49 16:02:19 France
1716 1717 4704 - ZHANG Zheng XIAOMABENTENG SEH ... 26:28:37 16:04:07 China
1717 1718 5282 - GUISOLAN Frédéric SEH ... 26:28:46 16:04:16 Switzerland
1718 1719 5306 - MEDINA Rafael V1H ... 26:29:26 16:04:56 Mexico
1719 1720 5379 - PENTCHEFF Nicolas SEH ... 26:33:05 16:08:35 France
1720 1721 4665 - GONZALEZ SUANCES Israel BAR ES PUIG V1H ... 26:33:58 16:09:28 Spain
1721 1722 4389 - TONANNY Marie SEF ... 26:34:51 16:10:21 France
1722 1723 5616 - GLORIAN Thierry V2H ... 26:35:47 16:11:17 France
1723 1724 5684 - CHEUNG Ho FAITHWALKERS V1H ... 26:37:09 16:12:39 Hong Kong, China
1724 1725 5719 - GANDER Pascal JEFF B TRAIL SEH ... 26:39:04 16:14:34 France
1725 1726 4555 - JURGIELEWICZ Urszula SEF ... 26:39:44 16:15:14 Poland
1726 1727 4722 - HIDALGO José Miguel C.D. ATLETISMO SAN... V1H ... 26:40:27 16:15:57 Spain
1727 1728 4425 - JITTIWUTIKARN Gif V1F ... 26:41:02 16:16:32 Thailand
1728 1729 4556 - ZHU Jing SEF ... 26:41:12 16:16:42 China
1729 1730 4314 - HU Dongli V1H ... 26:41:27 16:16:57 China
1730 1731 4239 - DURET Estelle OXYGENE BELBEUF V1F ... 26:41:51 16:17:21 France
1731 1732 4525 - MAGLIERI Fabrice ATHLETIC CLUB PAYS DE... V1H ... 26:42:11 16:17:41 France
1732 1733 4433 - ANDERSEN Laura Jentsch RUN DEM CREW SEF ... 26:42:27 16:17:57 Denmark
1733 1734 4563 - CHEUNG Annie On Nai FAITHWALKERS V1F ... 26:45:35 16:21:05 Hong Kong, China
1734 1735 4355 - KHALED Naïm GENEVE AEROPORT SEH ... 26:47:50 16:23:20 Algeria
1735 1736 4749 - STELLA Sara COURMAYEUR TRAILERS V1F ... 26:48:07 16:23:37 Italy
1736 1737 4063 - LALIMAN Leslie SEF ... 26:48:09 16:23:39 France
1737 1738 5702 - BURKE Tony Alchester/CTR/Bicester Tri V2H ... 26:50:52 16:26:22 Ireland
1738 1739 5146 - OLIVEIRA Sandra BUDEGUITA RUNNERS V1F ... 26:52:23 16:27:53 Portugal
1739 1740 5545 - VELLANDI Emilio TEAM PEGGIORI SCARPA MICO V1H ... 26:55:32 16:31:02 Italy
1740 1741 5543 - GASPAROVIC Bernard STADE FRANCAIS V3H ... 26:56:31 16:32:01 France
1741 1742 4760 - MENDONCA Carine ASPTT COMPIEGNE V2F ... 27:19:15 16:54:45 Belgium
[1742 rows x 7 columns]]

sorting and filtering pandas pivot table

Using this data
import pandas as pd
import numpy as np
df=pd.read_excel(
"https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True"
)
df["date"] = pd.to_datetime(df['date'])
I used the next code to get Year, month and day :
df['year'],df['month'],df['day'] = df.date.dt.year, df.date.dt.month, df.date.dt.day
account number name sku quantity \
0 740150 Barton LLC B1-20000 39
1 714466 Trantow-Barrows S2-77896 -1
2 218895 Kulas Inc B1-69924 23
3 307599 Kassulke, Ondricka and Metz S1-65481 41
4 412290 Jerde-Hilpert S2-34077 6
unit price ext price date year month day
0 86.69 3380.91 2014-01-01 07:21:51 2014 1 1
1 63.16 -63.16 2014-01-01 10:00:47 2014 1 1
2 90.70 2086.10 2014-01-01 13:24:58 2014 1 1
3 21.05 863.05 2014-01-01 15:05:22 2014 1 1
4 83.21 499.26 2014-01-01 23:26:55 2014 1 1
Then I used the next code to get pivot table
df.pivot_table(index=['year','month','name'],values='ext price',aggfunc=np.sum).head(25)
ext price
year month name
2014 1 Barton LLC 6177.57
Cronin, Oberbrunner and Spencer 1141.75
Frami, Hills and Schmidt 5112.34
Fritsch, Russel and Anderson 15130.77
Halvorson, Crona and Champlin 9997.17
Herman LLC 10749.84
Jerde-Hilpert 11274.33
Kassulke, Ondricka and Metz 7322.83
Keeling LLC 6847.86
Kiehn-Spinka 8097.50
Koepp Ltd 10768.33
Kuhn-Gusikowski 7309.54
Kulas Inc 15398.87
Pollich LLC 1004.22
Purdy-Kunde 4689.37
Sanford and Sons 9544.13
Stokes LLC 5809.34
Trantow-Barrows 14328.26
White-Trantow 13703.77
Will LLC 20953.87
2 Barton LLC 12218.03
Cronin, Oberbrunner and Spencer 13976.26
Frami, Hills and Schmidt 4124.53
Fritsch, Russel and Anderson 9595.35
Halvorson, Crona and Champlin 7082.15
I wonder if I can edit my pivot table to get and sort only the top 5 name (top ext price) for each month.
I'm trying to get this :
year month name
2014 1 Barton LLC 6177.57
Cronin, Oberbrunner and Spencer 1141.75
Frami, Hills and Schmidt 5112.34
Fritsch, Russel and Anderson 15130.77
Halvorson, Crona and Champlin 9997.17
2 Barton LLC 12218.03
Cronin, Oberbrunner and Spencer 13976.26
Frami, Hills and Schmidt 4124.53
Fritsch, Russel and Anderson 9595.35
Halvorson, Crona and Champlin 7082.15
... ...
11 Koepp Ltd 4882.27
Kuhn-Gusikowski 7197.89
Kulas Inc 4149.34
Pollich LLC 6334.21
12 Barton LLC 2772.90
Cronin, Oberbrunner and Spencer 7640.60
Frami, Hills and Schmidt 16249.81
Fritsch, Russel and Anderson 12345.64
I've tried to use groupby with sorting but still can't find it.

Is this what you're looking for?
>>> df.sort_values('ext price', ascending = False).groupby(
['year', 'month']).head(5).set_index(['year', 'month'])['name']
year month
2014 7 Kiehn-Spinka
7 Kuhn-Gusikowski
12 Koepp Ltd
7 Pollich LLC
3 Kulas Inc
2 Barton LLC
3 Keeling LLC
10 Koepp Ltd
7 Trantow-Barrows
9 Kassulke, Ondricka and Metz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape entire table from wikipedia using beautifulsoup and then load into pandas - python

Related

How to web scrap Economic Calendar data from TradingView and load into Dataframe?

Debugging my PopularBooks code in Python?

How to resolve, list index out of range, from scraping website?

No tables found error when making an AJAX request

sorting and filtering pandas pivot table

Categories

Resources