The code is supposed to be able to find the correct books by just running the code, entering what category then a sign such as '==' and then what value of that category were looking for. For instance, if I want the publishers from Vertigo, I would input 'Vertigo' then '==' and then 'Vertigo'. My professor says that my code is incorrect and says that there are three errors in my code but I have no idea what he's talking about. Can someone help me find these errors and then how fix them in Python?
import csv
def convertIfPossible(value):
# Convert a string value to integer or float if possible
try:
val = int(value)
return val
except:
try:
val = float(value)
return val
except:
return value
def readCSVIntoDictionary(fname):
# read a CSV file and make it into a dictionary
myDictionary = dict()
with open(fname) as f:
freader = csv.reader(f)
headers = next(freader)
ID = 1
for row in freader:
itemDictionary = dict()
for i in range(0,len(row)):
itemDictionary[headers[i]] = convertIfPossible(row[i])
myDictionary[ID] = itemDictionary
ID += 1
f.close()
return myDictionary
def printItem(item, indent=0):
# print out a dictionary item formatted somewhat nicely
for k, v in item.items():
for i in range(0,indent):
print(" |",end="")
print(f' {k:<20}',end="")
if (type(v) is dict):
print()
printItem(v,indent+1)
else:
print(f': {v}')
def lookupIDs(myDict,key,value,matchType="=="):
# find the IDs of all the items that match for category key on the
# value value using the comparison matchType
matchingKeys = []
for i in myDict.keys():
if (type(myDict[i][key]) == type(value)):
if (matchType == "=="):
if (myDict[i][key] == value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] <= value):
matchingKeys.append(i)
elif (matchType == "<="):
if (myDict[i][key] >= value):
matchingKeys.append(i)
elif (matchType == "<"):
if (myDict[i][key] < value):
matchingKeys.append(i)
elif (matchType == ">"):
if (myDict[i][key] < value):
matchingKeys.append(i)
return matchingKeys
def categoryExists(myDict,category):
# check if a particular category occurs in the first item
firstKey = list(myDict.keys())[0]
if (myDict[firstKey].get(category,None) == None):
return False
return True
def printMatches(myDict,matches):
# print the list of matches or No matches if none
if (len(matches) <= 1):
print("No matches")
return
else:
print("Matches")
for m in matches:
printItem(myDict[m])
print()
bookDictionary = readCSVIntoDictionary(r'C:\Users\jsric\OneDrive\Desktop\PopularBooks.csv')
print(f'There are {len(bookDictionary.keys())} items in the file')
print('These are the items:')
printItem(bookDictionary)
while True:
category = input("Category to look up: ")
if (categoryExists(bookDictionary,category)):
comparisonToDo = input("Comparison to do: ")
valueToCompare = convertIfPossible(input("Value to compare: "))
matches = lookupIDs(bookDictionary,category,valueToCompare,comparisonToDo)
'''
Here's the file I'm using
'''
bookID title authors average_ratingisbn language_codenum_pagesratings_counttext_reviews_countpublication_datepublisher
24812 The Complete Calvin and HobbesBill Watterson 4.82 7.41E+08 eng 1456 32213 930 9/6/05 Andrews McMeel Publishing
8 Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)J.K. Rowling/Mary GrandPré4.78 4.4E+08 eng 2690 41428 164 9/13/04 Scholastic
24814 It's a Magical World (Calvin and Hobbes #11)Bill Watterson 4.76 8.36E+08 eng 176 23875 303 9/1/96 Andrews McMeel Publishing
10 Harry Potter Collection (Harry Potter #1-6)J.K. Rowling 4.73 4.4E+08 eng 3342 28242 808 9/12/05 Scholastic
6550 Early ColorSaul Leiter/Martin Harrison4.73 3.87E+09 eng 156 144 8 1/15/06 Steidl
24816 Homicidal Psycho Jungle Cat (Calvin and Hobbes #9)Bill Watterson 4.72 8.36E+08 eng 176 15365 290 9/6/94 Andrews McMeel Publishing
34545 Elliott Erwitt: SnapsMurray Sayle/Charles Flowers/Elliott Erwitt4.72 071484330Xen-GB 544 102 6 6/1/03 Phaidon Press
24820 Calvin and Hobbes: Sunday Pages 1985-1995: An Exhibition CatalogueBill Watterson 4.71 7.41E+08 eng 96 3613 85 9/17/01 Andrews McMeel Publishing
20749 Study Bible: NIVAnonymous 4.7 3.11E+08 eng 2198 4166 186 10/1/02 Zondervan Publishing House
24520 The Complete Aubrey/Maturin Novels (5 Volumes)Patrick O'Brian 4.7 039306011Xeng 6576 1338 81 10/17/04 W. W. Norton Company
44826 The Price of the Ticket: Collected Nonfiction 1948-1985James Baldwin 4.7 3.13E+08 eng 712 404 30 9/15/85 St. Martin's Press
24818 The Days Are Just PackedBill Watterson 4.69 8.36E+08 eng 176 20308 244 9/1/93 Andrews McMeel Publishing
26805 The Sibley Field Guide to Birds of Western North AmericaDavid Allen Sibley4.69 6.79E+08 en-US 473 730 36 4/29/03 Alfred A. Knopf
5309 The Life and Times of Scrooge McDuckDon Rosa 4.67 9.12E+08 eng 266 2467 149 6/1/05 Gemstone Publishing
23753 The Absolute Sandman Volume OneNeil Gaiman/Mike Dringenberg/Chris Bachalo/Michael Zulli/Kelly Jones/Charles Vess/Colleen Doran/Malcolm Jones III/Steve Parkhouse/Daniel Vozzo/Lee Loughridge/Steve Oliff/Todd Klein/Dave McKean/Sam Kieth4.65 1.4E+09 eng 612 15640 512 11/1/06 Vertigo
3582 The New Annotated Sherlock Holmes: The Complete Short StoriesArthur Conan Doyle/Leslie S. Klinger4.64 3.93E+08 eng 1878 1411 54 11/30/04 W. W. Norton & Company
27204 The Gospel According to LukeAnonymous/Thomas Cahill4.64 8.02E+08 eng 81 169 17 10/29/99 Grove Press
39661 The Shawshank Redemption: The Shooting ScriptFrank Darabont/Stephen King4.64 1.56E+09 eng 184 2406 29 9/30/04 Newmarket Press
13206 The Collected Autobiographies of Maya AngelouMaya Angelou 4.63 6.8E+08 eng 1184 991 55 9/21/04 Modern Library
17280 The Feynman Lectures on Physics Vol 3Richard P. Feynman/Robert B. Leighton/Matthew L. Sands4.63 8.05E+08 eng 384 716 15 9/1/05 Addison Wesley Publishing Company
24813 The Calvin and Hobbes Tenth Anniversary BookBill Watterson 4.63 8.36E+08 eng 208 49122 368 9/5/95 Andrews McMeel Publishing
24819 The Calvin And Hobbes: Tenth Anniversary BookBill Watterson 4.63 7.52E+08 eng 208 303 12 2/1/08 Time Warner Books UK
19333 The World of Peter Rabbit (Original Peter Rabbit Books 1-23)Beatrix Potter 4.62 7.23E+08 eng 1072 220 14 5/4/06 Warne
23721 Nausicaä of the Valley of the Wind Vol. 6 (Nausicaä of the Valley of the Wind #6)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.62 1.59E+09 eng 159 1776 33 8/10/04 VIZ Media
313 100 Years of LynchingsRalph Ginzburg 4.61 9.33E+08 eng 270 88 4 11/22/96 Black Classic Press
17142 Collected Essays: Notes of a Native Son / Nobody Knows My Name / The Fire Next Time / No Name in the Street / The Devil Finds Work / Other EssaysJames Baldwin/Toni Morrison4.61 1.88E+09 eng 869 1495 75 2/1/98 Library of America (NY)
17334 The Complete C.S. Lewis Signature ClassicsC.S. Lewis 4.61 61208493 eng 746 925 40 2/6/07 HarperOne
19445 A Guide to the Words of My Perfect TeacherNgawang Pelzang/Padmakara Translation Group/Patrul Rinpoche/Alak Zenkar4.61 1.59E+09 eng 336 79 0 6/22/04 Shambhala
23318 Discovery of the Presence of God: Devotional NondualityDavid R. Hawkins4.61 9.72E+08 en-US 296 186 11 6/28/07 Veritas Publishing
23722 Nausicaä of the Valley of the Wind Vol. 5 (Nausicaä of the Valley of the Wind #5)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 151 1904 28 6/30/04 VIZ Media
23724 Nausicaä of the Valley of the Wind Vol. 7 (Nausicaä of the Valley of the Wind #7)Hayao Miyazaki/Matt Thorn/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.61 1.59E+09 eng 223 1716 79 9/7/04 VIZ Media
5545 The Feynman Lectures on Physics 3 VolsRichard P. Feynman/Robert B. Leighton/Matthew L. Sands4.6 2.02E+08 en-US 3 78 7 1/1/89 Addison Wesley Publishing Company
9325 Fullmetal Alchemist Vol. 10Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 200 8989 151 11/21/06 VIZ Media LLC
26426 Fullmetal Alchemist Vol. 12 (Fullmetal Alchemist #12)Hiromu Arakawa/Akira Watanabe4.6 1.42E+09 eng 192 7480 119 3/20/07 VIZ Media LLC
30 J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the RingsJ.R.R. Tolkien 4.59 3.46E+08 eng 1728 101233 1550 9/25/12 Ballantine Books
119 The Lord of the Rings: The Art of the Fellowship of the RingGary Russell 4.59 6.18E+08 eng 192 26153 102 6/12/02 Houghton Mifflin Harcourt
3529 The World's First Love: Mary Mother of GodFulton J. Sheen 4.59 8.99E+08 en-US 276 641 63 9/1/96 Ignatius Press
15336 The Lord of the Rings / The HobbitJ.R.R. Tolkien 4.59 7144083 eng 1600 141 5 10/7/02 Collins Modern Classics
23506 Fullmetal Alchemist Vol. 11 (Fullmetal Alchemist #11)Hiromu Arakawa/Akira Watanabe4.59 1.42E+09 eng 192 7655 129 1/16/07 VIZ Media LLC
23720 Nausicaä of the Valley of the Wind Vol. 4 (Nausicaä of the Valley of the Wind #4)Hayao Miyazaki/David Lewis/Toren Smith/Kaori Inoue/Joe Yamazaki/Walden Wong/Izumi Evers4.59 1.59E+09 eng 134 1956 38 6/2/04 VIZ Media
26422 Fullmetal Alchemist Vol. 14 (Fullmetal Alchemist #14)Hiromu Arakawa/Akira Watanabe4.59 142151379Xeng 192 8634 118 8/14/07 VIZ Media LLC
16602 Coffin: The Art of Vampire Hunter DYoshitaka Amano4.58 1.6E+09 eng 199 596 16 11/1/06 Dark Horse Books
17961 Collected FictionsJorge Luis Borges/Andrew Hurley4.58 1.4E+08 eng 565 18874 791 9/30/99 Penguin Classics Deluxe Edition
33342 The More Than Complete Hitchhiker's Guide (Hitchhiker's Guide #1-4 + short story)Douglas Adams 4.58 6.81E+08 en-US 624 433 34 11/1/89 Longmeadow Press
41468 Dewdrops on a Lotus Leaf: Zen Poems of RyokanRyōkan/John Stevens4.58 1.59E+09 eng 120 183 19 4/13/04 Shambhala
44734 Fullmetal Alchemist Vol. 6 (Fullmetal Alchemist #6)Hiromu Arakawa/Akira Watanabe4.58 1.42E+09 eng 200 10052 201 3/21/06 VIZ Media LLC
1 Harry Potter and the Half-Blood Prince (Harry Potter #6)J.K. Rowling/Mary GrandPré4.57 4.4E+08 eng 652 2095690 27591 9/16/06 Scholastic Inc.
866 Fullmetal Alchemist Vol. 9 (Fullmetal Alchemist #9)Hiromu Arakawa/Akira Watanabe4.57 142150460Xeng 192 9013 153 9/19/06 VIZ Media LLC
869 Fullmetal Alchemist Vol. 8 (Fullmetal Alchemist #8)Hiromu Arakawa/Akira Watanabe4.57 1.42E+09 eng 192 11451 161 7/18/06 VIZ Media LLC
'''
Related
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
For some reason, the last line with header one gives me an error, "list index out of range". I am not too sure what is causing this error to happen, but I know I need this line. Here is a link to the website I am using for the data, https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. The specific table I want is the one that is below the horizontal bar chart.
Traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-47-67ef2aac7bf3> in <module>
28 data_tables.append(td.findAll('table'))
29
---> 30 header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
31
32 header1
IndexError: list index out of range
Use pandas.read_html
Read HTML tables into a list of DataFrame objects.
This answer side-steps the question to provide a more efficient method for extracting tables from Wikipedia and gives the OP the desired end result.
The following code will more easily get the desired table from the Wikipedia page.
.read_html will return a list of dataframes.
The table you're interested in, is at index 4
Clean the table
Select the rows and columns with valid data.
This method does return the table headers, but the column names are multi-level so we'll rename them.
Before renaming the columns, if you need the original data from the column names, use us_covid_data.columns which will return a list of tuples with all the column name values.
import pandas as pd
# get list of dataframes and select index 4
us_covid_data = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')[4]
# select rows and columns
us_covid_data = us_covid_data.iloc[0:56, 1:6]
# rename columns
us_covid_data.columns = ['state_territory', 'cases', 'deaths', 'recovered', 'hospitalized']
# display(us_covid_data)
state_territory cases deaths recovered hospitalized
0 Alabama 45785 1033 22082 2961
1 Alaska 1184 17 560 78
2 American Samoa 0 0 – –
3 Arizona 116892 2082 – 5272
4 Arkansas 24253 292 17834 1604
5 California 296499 6711 – –
6 Colorado 34316 1704 – 5527
7 Connecticut 46976 4338 – –
8 Delaware 12293 512 6778 –
9 District of Columbia 10569 561 1465 –
10 Florida 244151 4102 – 15150
11 Georgia 111211 2965 – 11919
12 Guam 1272 6 179 –
13 Hawaii 1012 19 746 116
14 Idaho 8222 94 2886 350
15 Illinois 151767 7144 – –
16 Indiana 49560 2698 36788 7139
17 Iowa 31906 725 24242 –
18 Kansas 17618 282 – 1269
19 Kentucky 17526 623 4785 2662
20 Louisiana 66435 3296 43026 –
21 Maine 3440 110 2787 354
22 Maryland 70497 3246 – 10939
23 Massachusetts 111110 8296 88725 10985
24 Michigan 73403 6225 52841 –
25 Minnesota 38606 1511 33907 4112
26 Mississippi 31257 1114 22167 2881
27 Missouri 24985 1077 – –
28 Montana 1249 23 678 89
29 Nebraska 20053 286 14641 1224
30 Nevada 22930 537 – –
31 New Hampshire 5914 382 4684 558
32 New Jersey 174628 15479 31014 –
33 New Mexico 14549 539 6181 2161
34 New York 400299 32307 71371 –
35 North Carolina 81331 1479 55318 –
36 North Dakota 3858 89 3350 218
37 Northern Mariana Islands 31 2 19 –
38 Ohio 57956 2927 – 7292
39 Oklahoma 16362 399 12432 1676
40 Oregon 10402 218 2846 1069
41 Pennsylvania 93876 6880 – –
42 Puerto Rico 8714 157 – –
43 Rhode Island 16991 960 – 1922
44 South Carolina 47214 838 – –
45 South Dakota 7105 97 6062 689
46 Tennessee 51509 646 31020 2860
47 Texas 240111 3013 122996 9610
48 Virgin Islands 112 6 79 –
49 Utah 25563 190 14448 1565
50 Vermont 1251 56 1022 –
51 Virginia 66740 1881 – 9549
52 Washington 38517 1370 – 4463
53 West Virginia 3461 95 2518 –
54 Wisconsin 35318 805 25542 3574
55 Wyoming 1675 20 1172 253
Addressing the original issue:
data is an empty list generated from data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
With data = data_table.tbody.findAll('tr', recursive=False)[1] and then data = [v for v in data.get_text().split('\n') if v], you will get the headers.
The output of data will be ['U.S. state or territory[i]', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]', 'Ref.']
Since data_tables is generated from iterating through data, it is also empty.
header1 is generated from iterating data_tables[0], so IndexError occurs because data_tables is empty.
I am trying to scrape the results table from the following url: https://utmbmontblanc.com/en/page/107/results.html
However when I run my code it says 'No Tables Found'
import pandas as pd
url = 'https://utmbmontblanc.com/en/page/107/results.html'
data = pd.read_html(url, header = 0)
data.head()
ValueError: No tables found
Having used developer tools I know that there is definitely a table in the html code. Why is it not being found? Any help is greatly appreciated. Thanks in advance
build URL for Ajax request, for 2017 - CCC is like this
url = 'https://.......com/result.php?mode=edPass&ajax=true&annee=2017&course=ccc'
data = pd.read_html(url, header = 0)
print(data[0])
You can also use selenium if you are unable to find any other hacks.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
from bs4 import BeautifulSoup as BSoup
import pandas as pd
url = "https://utmbmontblanc.com/en/page/107/results.html"
driver = webdriver.Chrome("/home/bitto/chromedriver")#change this to your chromedriver path
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
#changes div[#class='bloc'] to change year - [1] for 2018, [2] for 2017 etc
#change index of div[#class='row'] - [1], [2] for TDS etc
#change #value of option match your preferred option's value - you can find this from the inspect tool - First two are Scratch and ScratchH
EC.presence_of_element_located((By.XPATH, "//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/select[#name='cat'][1]/option[#value='Scratch']"))
)
element.click()#select option
#make relevant changes you made in top here also
driver.find_element_by_xpath("//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/input").click();#click go
sleep(10)#not preferred but will do for now
table=pd.read_html(driver.page_source)
print(table)
Output
[ GeneralRanking Family name First name Club Cat. ... Time Difference/ 1st Nationality
0 1 3001 - HAWKS Hayden HOKA ONE ONE SEH ... 10:24:30 00:00:00 United States
1 2 3018 - ŚWIERC Marcin SALOMON SUUNTO TEAM POLAND SEH ... 10:42:49 00:18:19 Poland
2 3 3005 - POMMERET Ludovic TEAM HOKA V1H ... 10:50:47 00:26:17 France
3 4 3214 - EVANS Thomas COMPRESS SPORT SEH ... 10:57:44 00:33:14 United Kingdom
4 5 3002 - OWENS Tom SALOMON SEH ... 11:03:48 00:39:18 United Kingdom
5 6 3011 - JONSSON Thorbergur 66 NORTH SEH ... 11:14:22 00:49:52 Iceland
6 7 3026 - BOUVIER-GAZ Nicolas TEAM NEW BALANCE SEH ... 11:18:33 00:54:03 France
7 8 3081 - JONES Michael WWW.APEXRUNNING.CO SEH ... 11:31:50 01:07:20 United Kingdom
8 9 3020 - COLLET Aurélien HOKA ONE ONE SEH ... 11:33:10 01:08:40 France
9 10 3009 - MARAVILLA Jorge HOKA ONE ONE V1H ... 11:36:14 01:11:44 United States
10 11 3036 - PERRILLAT Christophe SEH ... 11:40:05 01:15:35 France
11 12 3070 - FRAGUELA BREIJO Alejandro STUDIO54 V1H ... 11:40:11 01:15:41 Spain
12 13 3092 - AIGROZ Mike TRUST SEH ... 11:41:53 01:17:23 Switzerland
13 14 3021 - O'LEARY Paddy THE NORTH FACE SEH ... 11:47:04 01:22:34 Ireland
14 15 3065 - PÉREZ TORREGLOSA Juan CLUB ULTRATRAIL ... SEH ... 11:47:51 01:23:21 Spain
15 16 3031 - SÁNCHEZ CEBRIÁN Miguel Ángel LURBEL-LI... V1H ... 11:49:15 01:24:45 Spain
16 17 3062 - ANDREWS Justin SEH ... 11:49:47 01:25:17 United States
17 18 3039 - PIANA Giulio TEAM MUD AND SNOW SEH ... 11:50:23 01:25:53 Italy
18 19 3047 - RONIMOISS Andris Inov8 / OSveikals.lv ... SEH ... 11:52:25 01:27:55 Latvia
19 20 3052 - DURAND Regis TEAM TRAIL ISOSTAR V1H ... 11:56:40 01:32:10 France
20 21 3027 - SANDES Ryan SALOMON SEH ... 12:04:39 01:40:09 South Africa
21 22 3014 - EL MORABITY Rachid ULTRA TRAIL ATLAS T... SEH ... 12:10:01 01:45:31 Morocco
22 23 3067 - JONES Harry RUNIVORE SEH ... 12:10:12 01:45:42 United Kingdom
23 24 3030 - CLAVERY Erik - SEH ... 12:12:56 01:48:26 France
24 25 3056 - JIMENEZ LLORENS Juan Maria GREEN POWER... SEH ... 12:13:18 01:48:48 Spain
25 26 3024 - GALLAGHER Clare THE NORTH FACE SEF ... 12:13:57 01:49:27 United States
26 27 3136 - ASSEL Garry LICENCE INDIVIDUELLE LUXEM... SEH ... 12:20:46 01:56:16 Luxembourg
27 28 3071 - RIGODANZA Francesco SPIRITO TRAIL TEAM SEH ... 12:22:49 01:58:19 Italy
28 29 3118 - POLASZEK Christophe CHARTRES VERTICAL V1H ... 12:24:49 02:00:19 France
29 30 3125 - CALERO RODRIGUEZ David Altmann Sports/... SEH ... 12:25:07 02:00:37 Spain
... ... ... ... ... ... ... ...
1712 1713 5734 - GOT Hang Fai V2H ... 26:25:01 16:00:31 Hong Kong, China
1713 1714 4154 - RAMOS Liliana NIKE RUNNING CLUB V3F ... 26:26:22 16:01:52 Argentina
1714 1715 5448 - BECKRICH Xavier PHOENIX57 V1H ... 26:26:45 16:02:15 France
1715 1716 5213 - BARBERIO ARNOULT Isabelle PHOENIX57 V1F ... 26:26:49 16:02:19 France
1716 1717 4704 - ZHANG Zheng XIAOMABENTENG SEH ... 26:28:37 16:04:07 China
1717 1718 5282 - GUISOLAN Frédéric SEH ... 26:28:46 16:04:16 Switzerland
1718 1719 5306 - MEDINA Rafael V1H ... 26:29:26 16:04:56 Mexico
1719 1720 5379 - PENTCHEFF Nicolas SEH ... 26:33:05 16:08:35 France
1720 1721 4665 - GONZALEZ SUANCES Israel BAR ES PUIG V1H ... 26:33:58 16:09:28 Spain
1721 1722 4389 - TONANNY Marie SEF ... 26:34:51 16:10:21 France
1722 1723 5616 - GLORIAN Thierry V2H ... 26:35:47 16:11:17 France
1723 1724 5684 - CHEUNG Ho FAITHWALKERS V1H ... 26:37:09 16:12:39 Hong Kong, China
1724 1725 5719 - GANDER Pascal JEFF B TRAIL SEH ... 26:39:04 16:14:34 France
1725 1726 4555 - JURGIELEWICZ Urszula SEF ... 26:39:44 16:15:14 Poland
1726 1727 4722 - HIDALGO José Miguel C.D. ATLETISMO SAN... V1H ... 26:40:27 16:15:57 Spain
1727 1728 4425 - JITTIWUTIKARN Gif V1F ... 26:41:02 16:16:32 Thailand
1728 1729 4556 - ZHU Jing SEF ... 26:41:12 16:16:42 China
1729 1730 4314 - HU Dongli V1H ... 26:41:27 16:16:57 China
1730 1731 4239 - DURET Estelle OXYGENE BELBEUF V1F ... 26:41:51 16:17:21 France
1731 1732 4525 - MAGLIERI Fabrice ATHLETIC CLUB PAYS DE... V1H ... 26:42:11 16:17:41 France
1732 1733 4433 - ANDERSEN Laura Jentsch RUN DEM CREW SEF ... 26:42:27 16:17:57 Denmark
1733 1734 4563 - CHEUNG Annie On Nai FAITHWALKERS V1F ... 26:45:35 16:21:05 Hong Kong, China
1734 1735 4355 - KHALED Naïm GENEVE AEROPORT SEH ... 26:47:50 16:23:20 Algeria
1735 1736 4749 - STELLA Sara COURMAYEUR TRAILERS V1F ... 26:48:07 16:23:37 Italy
1736 1737 4063 - LALIMAN Leslie SEF ... 26:48:09 16:23:39 France
1737 1738 5702 - BURKE Tony Alchester/CTR/Bicester Tri V2H ... 26:50:52 16:26:22 Ireland
1738 1739 5146 - OLIVEIRA Sandra BUDEGUITA RUNNERS V1F ... 26:52:23 16:27:53 Portugal
1739 1740 5545 - VELLANDI Emilio TEAM PEGGIORI SCARPA MICO V1H ... 26:55:32 16:31:02 Italy
1740 1741 5543 - GASPAROVIC Bernard STADE FRANCAIS V3H ... 26:56:31 16:32:01 France
1741 1742 4760 - MENDONCA Carine ASPTT COMPIEGNE V2F ... 27:19:15 16:54:45 Belgium
[1742 rows x 7 columns]]
I have a dataframe with city, name and members. I need to find the top 5 groups (name) in terms of highest member ('members') count per city.
This is what I get when I use:
clust.groupby(['city','name']).agg({'members':sum})
members
city name
Bath AWS Bath User Group 346
Agile Bath & Bristol 957
Bath Crypto Chat 47
Bath JS 142
Bath Machine Learning Meetup 435
Belfast 4th Industrial Revolution Challenge 609
Belfast Adobe Meetup 66
Belfast Azure Meetup 205
Southampton Crypto Currency Trading SouthCoast 50
Southampton Bitcoin and Altcoin Meetup 50
Southampton Functional Programming Meetup 28
Southampton Virtual Reality Meetup 248
Sunderland Sunderland Digital 287
I need the top 5 but as you can see the member count doesn't seem to be ordered, i.e. 346 before 957 etc.
I've also tried sorting the values before-hand and do:
clust.sort_values(['city', 'name'], axis=0).groupby('city').head(5)
But that returns a similar series.
I've used this one too clust.groupby(['city', 'name']).head(5)
but it gives me all the rows and not top 5. It also isn't structured so not in alphabetical order.
Please help. Thanks
I think need add ascending=[True, False] to sort_values and change column to members for sorting:
clust = clust.groupby(['city','name'], as_index=False)['members'].sum()
df = clust.sort_values(['city', 'members'], ascending=[True, False]).groupby('city').head(5)
print (df)
city name members
1 Bath Agile Bath & Bristol 957
4 Bath Machine Learning Meetup 435
0 Bath AWS Bath User Group 346
3 Bath JS 142
2 Bath Crypto Chat 47
5 Belfast 4th Industrial Revolution Challenge 609
7 Belfast Azure Meetup 205
6 Belfast Adobe Meetup 66
11 Southampton Virtual Reality Meetup 248
8 Southampton Crypto Currency Trading SouthCoast 50
9 Southampton Bitcoin and Altcoin Meetup 50
10 Southampton Functional Programming Meetup 28
12 Sunderland Sunderland Digital 287
I have a dataset where I am trying to extract the simple town name from the longer messy version shown here. Most of them are followed by parentheses " (.*", but some do not follow this pattern and end in ":" (see line 200). Finally, there are some that do not have parentheses but split parts with a comma "," (see line 240, 246).
'Region'
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater (Bridgewater State College)[2]
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill (Boston College)
200 The Colleges of Worcester Consortium:
201 Dudley (Nichols College)
240 Faribault, South Central College
241 Mankato (Minnesota State University, Mankato),...
242 Marshall (Southwest Minnesota State University...
243 Moorhead (Minnesota State University, Moorhead...
244 Morris (University of Minnesota Morris)[2]
245 Northfield (Carleton College, St. Olaf College...
246 North Mankato, South Central College
247 St. Cloud (St. Cloud State University, The Col...
248 St. Joseph (College of Saint Benedict)[2]
249 St. Peter (Gustavus Adolphus College)[2]
What I would ideally like to see is:
'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
My code currently is:
df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
But this gives me the weird result of not getting the parentheses right:
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
I have also tried:
df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
I am not sure exactly how to extract the string using all three patterns at the same time. Would be open to a two line solution as well.
Thanks (apologies if this is formatted poorly!)
You may just extract any 0 or more chars other than :, , or ( at the beginning of a string with
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
If you are working with Python 2.x, use (?u) at the beginning of the pattern so that the word boundary \b could also match the right places in a Unicode string.
Details
^ - start of a string
([^:(,]*) - Group 1: zero or more (*) consecutive occurrences of any char other than (the [^...] forms a negated character class) :, ( and ,.
\b - a word boundary.
See the regex demo and a Python 3 demo below:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>
Since you only have three possible delimiters, you can take advantage of chained split(), since split will return the unmodified string if the delimiter is not found.
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
You can use df.apply to do the same transformation for your strings.
Use this regular expression:
([\w\s.]+)(?<!\s)
You can remove the negative look-behind (?<!\s) at the end if you don't care about trailing white spaces.
per_of_runs_all_bowl is a series that looks like this;
Abdur Razzak 44.915254
Ajit Agarkar 31.250000
Albie Morkel 41.538462
Alok Kapali 16.666667
Andre Nel 50.000000
Andrew Flintoff 43.636364
Andrew Symonds 20.833333
Brad Hodge 41.666667
Brett Lee 42.763158
Chamara Silva 41.666667
Chaminda Vaas 49.541284
Chamu Chibhabha 44.736842
Chris Gayle 25.000000
Chris Martin 50.000000
Chris Schofield 38.461538
...
data1.groupby(['bowler']).size() looks like this;
Abdur Razzak 118
Ajit Agarkar 48
Albie Morkel 65
Alok Kapali 12
Andre Nel 24
Andrew Flintoff 110
Andrew Symonds 72
Brad Hodge 12
Brett Lee 152
Chamara Silva 12
Chaminda Vaas 109
Chamu Chibhabha 38
Chris Gayle 24
Chris Martin 92
Chris Schofield 78
...
per_of_runs_all_bowl.loc[(data1.groupby(['bowler']).size() > 60)] returns the 'percent of runs' where the the .size() is greater than 60.. like this;
Abdur Razzak 44.915254
Albie Morkel 41.538462
Andrew Flintoff 43.636364
Andrew Symonds 20.833333
Brett Lee 42.763158
Chaminda Vaas 49.541284
Chris Martin 50.000000
Chris Schofield 38.461538
Daniel Vettori 42.758621
Dilhara Fernando 61.467890
Dimitri Mascarenhas 30.952381
Gayan Wijekoon 25.000000
Harbhajan Singh 32.394366
Irfan Pathan 45.652174
Jacob Oram 23.750000
James Anderson 48.484848
...
How do I get the 'percent of runs' returned along with the size like this?
Abdur Razzak 44.915254 118
Albie Morkel 41.538462 65
Andrew Flintoff 43.636364 110
Andrew Symonds 20.833333 72
Brett Lee 42.763158 152
I couldn't think of anything else but this;
I've created a new DataFrame;
new_df = pd.DataFrame({'size':data1.groupby(['bowler']).size(),'per':list(per_of_runs_all_bowl.values)})
and then filtered based on the size;
new_df_fil = new_df[new_df['size'] > 60]
per size
bowler
Abdur Razzak 4.237288 118
Albie Morkel 9.230769 65
Andrew Flintoff 8.181818 110
Andrew Symonds 15.277778 72
Brett Lee 10.526316 152
But is this efficient? I'm sure there must be 'pythonic' & 'panda-ic' ways of doing this!
Try this:
df = pd.DataFrame({'size': data1.groupby(['bowler']).size(), 'percent of runs':per_of_runs_all_bowl})
df[df['size'] > 60]
And this solution is fairly efficient.