Scraping data from TeamRankings.com - python

I want to scrape some NBA data from TeamRankings.com for my program in python. Here is an example link:
https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03
I only need the "Last 3" column data. I want to be able to set the date to whatever I want with a constant variable. There are a few other data points I want that are on different links but I will be able to figure that part out if this gets figured out.
I have tried using https://github.com/tymiguel/TeamRankingsWebScraper but it is outdated and did not work for me.

The easiest way will be to use pandas.read_html:
import pandas as pd
url = 'https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03'
df = pd.read_html(url)[0]
print(df)
Prints:
Rank Team 2022 Last 3 Last 1 Home Away 2021
0 1 Brooklyn 58.8% 64.5% 68.3% 59.4% 58.1% 54.2%
1 2 Denver 57.8% 62.8% 52.2% 59.5% 56.4% 55.5%
2 3 Boston 56.8% 54.6% 51.1% 58.2% 55.1% 54.0%
3 4 Sacramento 56.3% 56.9% 48.4% 59.1% 53.4% 52.5%
4 5 Golden State 56.3% 53.2% 52.5% 56.9% 55.6% 55.4%
5 6 Dallas 56.0% 59.5% 50.0% 55.8% 56.2% 54.0%
6 7 Portland 55.5% 58.6% 65.5% 57.3% 54.3% 51.5%
7 8 Minnesota 55.3% 52.1% 59.2% 55.7% 54.9% 53.8%
8 9 Utah 55.3% 53.9% 53.7% 58.1% 53.0% 55.1%
9 10 Philadelphia 55.3% 57.3% 56.4% 54.5% 56.2% 53.6%
10 11 Cleveland 55.1% 57.7% 60.9% 56.7% 53.1% 53.7%
11 12 Washington 54.6% 61.4% 56.9% 54.7% 54.5% 53.2%
12 13 Chicago 54.6% 57.3% 54.7% 55.7% 53.5% 53.7%
13 14 Indiana 54.5% 60.3% 53.8% 56.1% 52.8% 53.1%
14 15 New Orleans 54.4% 52.5% 56.5% 56.2% 52.5% 51.8%
15 16 Phoenix 54.1% 51.6% 44.8% 54.8% 53.5% 55.0%
16 17 LA Clippers 54.1% 57.8% 52.2% 52.3% 55.8% 53.0%
17 18 LA Lakers 54.0% 56.6% 53.8% 53.7% 54.3% 53.7%
18 19 San Antonio 53.1% 54.6% 47.4% 53.4% 52.8% 52.7%
19 20 Orlando 52.9% 48.0% 44.5% 54.6% 50.9% 50.2%
20 21 Milwaukee 52.8% 45.5% 42.2% 55.0% 50.4% 54.0%
21 22 Memphis 52.8% 54.0% 51.0% 53.8% 51.8% 52.1%
22 23 Miami 52.6% 54.6% 52.9% 53.1% 52.1% 54.0%
23 24 New York 52.2% 51.4% 57.4% 53.9% 50.6% 51.3%
24 25 Atlanta 52.2% 51.5% 53.7% 51.5% 53.0% 54.2%
25 26 Okla City 52.2% 50.9% 44.6% 52.6% 51.7% 49.7%
26 27 Detroit 51.5% 52.3% 45.1% 52.7% 50.5% 49.4%
27 28 Toronto 51.1% 51.3% 52.7% 51.3% 50.8% 51.0%
28 29 Houston 51.0% 50.0% 51.8% 50.2% 51.6% 53.4%
29 30 Charlotte 50.3% 52.0% 51.1% 49.3% 51.2% 54.3%
If you want only Last 3 column:
print(df[['Team', 'Last 3']])
Prints:
Team Last 3
0 Brooklyn 64.5%
1 Denver 62.8%
2 Boston 54.6%
3 Sacramento 56.9%
...

Related

Extract a table from website and save it as csv file

I want to extract the table with div element as ind-mp_info to a csv file. You can find it when you expand the COVID-19 Statewise Status tab.
The website link is- https://www.mygov.in/covid-19/
The code-
# importing the libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen
import csv
import pandas as pd
html = urlopen("https://www.mygov.in/covid-19/")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"class":"ind-mp_info"})
rows = table.findAll("tr")
with open("editors.csv", "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll(["td", "th"]):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
You can get that json directly and convert to dataframe.
import requests
import pandas as pd
import time
url = 'https://www.mygov.in/sites/default/files/covid/vaccine/vaccine_counts_today.json'
payload = {
'timestamp':int(time.time())}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['vacc_st_data'])
df.to_csv('editors.csv',index=False)
Output:
print(df.to_string())
st_name state_id covid_state_name covid_state_id dose1 dose2 total_doses last_dose1 last_dose2 last_total_doses
0 Andaman and Nicobar 1 Andaman and Nicobar 35 216987 95049 312036 216053 94601 310654
1 Andhra Pradesh 2 Andhra Pradesh 28 17826178 6298023 24124201 17633289 6216600 23849889
2 Arunachal Pradesh 3 Arunachal Pradesh 12 694525 189292 883817 692532 186724 879256
3 Assam 4 Assam 18 10706685 2259908 12966593 10510804 2209605 12720409
4 Bihar 5 Bihar 10 23504063 4521066 28025129 23366654 4488385 27855039
5 Chandigarh 6 Chandigarh 4 706582 226814 933396 700329 223569 923898
6 Chhattisgarh 7 Chhattisgarh 22 10000779 2632861 12633640 9979765 2611797 12591562
7 Dadra and Nagar Haveli and Daman and Diu 8 Dadra and Nagar Haveli and Daman and Diu 26 587610 81895 669505 584400 80828 665228
8 Delhi 9 Delhi 7 7930829 3057218 10988047 7835664 3000596 10836260
9 Goa 10 Goa 30 1098302 307364 1405666 1094394 302521 1396915
10 Gujarat 11 Gujarat 24 28535938 9139961 37675899 28113725 9054592 37168317
11 Haryana 12 Haryana 6 10152822 2974775 13127597 10090001 2924714 13014715
12 Himachal Pradesh 13 Himachal Pradesh 2 4335980 1406772 5742752 4249932 1382642 5632574
13 Jammu and Kashmir 14 Jammu and Kashmir 1 5376054 1508641 6884695 5325806 1491262 6817068
14 Jharkhand 15 Jharkhand 20 8450135 2018319 10468454 8388534 1997186 10385720
15 Karnataka 16 Karnataka 29 26000864 7509346 33510210 25860894 7437119 33298013
16 Kerala 17 Kerala 32 15759471 6442507 22201978 15672348 6427551 22099899
17 Ladakh 18 Ladakh 37 188876 70779 259655 188699 70337 259036
18 Lakshadweep 19 Lakshadweep 31 51371 17296 68667 51165 17170 68335
19 Madhya Pradesh 20 Madhya Pradesh 23 29817764 5750317 35568081 29752302 5736096 35488398
20 Maharashtra 21 Maharashtra 27 35261712 12239857 47501569 35044144 12114068 47158212
21 Manipur 22 Manipur 14 1163534 251078 1414612 1159499 246753 1406252
22 Meghalaya 23 Meghalaya 17 946600 238582 1185182 938984 232152 1171136
23 Mizoram 24 Mizoram 15 656018 209089 865107 654946 206780 861726
24 Nagaland 25 Nagaland 13 634479 162621 797100 632129 159436 791565
25 Odisha 26 Odisha 21 14222570 4264500 18487070 13971009 4202596 18173605
26 Puducherry 27 Puducherry 34 604872 152636 757508 601608 151744 753352
27 Punjab 28 Punjab 3 8222725 2303559 10526284 8202118 2287403 10489521
28 Rajasthan 29 Rajasthan 8 27226185 8464839 35691024 27017475 8377435 35394910
29 Sikkim 30 Sikkim 11 498609 152574 651183 497851 151538 649389
30 Tamil Nadu 31 Tamil Nadu 33 21024528 4734496 25759024 20857302 4689811 25547113
31 Telangana 32 Telengana 36 11714148 4019069 15733217 11649833 3966309 15616142
32 Tripura 33 Tripura 16 2417276 808369 3225645 2411801 804137 3215938
33 Uttar Pradesh 34 Uttar Pradesh 9 46430534 8618231 55048765 45976210 8518342 54494552
34 Uttarakhand 35 Uttarakhand 5 5171059 1622492 6793551 5071246 1596762 6668008
35 West Bengal 36 West Bengal 19 23559058 9184539 32743597 23264439 9134008 32398447
36 Miscellaneous 37 Miscellaneous 38 1900366 1549702 3450068 1900173 1549042 3449215

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:
city Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
city W/L Ratio
0 Boston 2.500000
1 Buffalo 0.555556
2 Calgary 1.057143
3 Chicago 0.846154
4 Columbus 1.500000
5 Dallas–Fort Worth 1.312500
6 Denver 1.433333
7 Detroit 0.769231
8 Edmonton 0.900000
9 Las Vegas 2.125000
10 Los Angeles 1.655862
11 Miami–Fort Lauderdale 1.466667
12 Minneapolis-Saint Paul 1.730769
13 Montreal 0.725000
14 Nashville 2.944444
15 New York 1.517241
16 New York City 0.908870
17 Ottawa 0.651163
18 Philadelphia 1.615385
19 Phoenix 0.707317
20 Pittsburgh 1.620690
21 Raleigh 1.028571
22 San Francisco Bay Area 1.666667
23 St. Louis 1.375000
24 Tampa Bay 2.347826
25 Toronto 1.884615
26 Vancouver 0.775000
27 Washington, D.C. 1.884615
28 Winnipeg 2.600000
And I do a join like this:
result = pd.merge(df, nhl_df , on="city")
The result should have 28 rows, instead I have 24 rows.
One of the missing one is for example Miami-Fort Lauderdale
I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?
city Population W/L Ratio
0 New York City 20153634 0.908870
1 Los Angeles 13310447 1.655862
2 San Francisco Bay Area 6657982 1.666667
3 Chicago 9512999 0.846154
4 Dallas–Fort Worth 7233323 1.312500
5 Washington, D.C. 6131977 1.884615
6 Philadelphia 6070500 1.615385
7 Boston 4794447 2.500000
8 Denver 2853077 1.433333
9 Phoenix 4661537 0.707317
10 Detroit 4297617 0.769231
11 Toronto 5928040 1.884615
12 Pittsburgh 2342299 1.620690
13 St. Louis 2807002 1.375000
14 Nashville 1865298 2.944444
15 Buffalo 1132804 0.555556
16 Montreal 4098927 0.725000
17 Vancouver 2463431 0.775000
18 Columbus 2041520 1.500000
19 Calgary 1392609 1.057143
20 Ottawa 1323783 0.651163
21 Edmonton 1321426 0.900000
22 Winnipeg 778489 2.600000
23 Las Vegas 2155664 2.125000
24 Raleigh 1302946 1.028571
I think here is possible check if same chars by integer that represents the character in function ord, here are different – with code 150 and – with code 8211, so it is reason why values not matched:
a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale
print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale
print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
You can try copy values for replace for select correct - value:
#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

Pandas - read a text file

I have a text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
I'm trying to read it into a pandas data frame like this:
df = pd.read_table('assist1.txt',
sep='\s+',
skiprows=6,
header=0,)
This code throws an exception - pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 31, saw 8.
I guess that's because of the space between the first and last name of the player (should be the value of the Player column).
Is there a way to achieve this?
Furthermore, it is a part of a larger text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Table
================================================================================================
Pos Team Pld Won Drn Lst For Ag Won Drn Lst For Ag Pts
--------------------------------------------------------------------------------------------------
1st C Man Utd 38 15 4 0 41 4 10 4 5 34 20 83
--------------------------------------------------------------------------------------------------
2nd Arsenal 38 15 2 2 38 9 11 3 5 28 14 83
3rd Leeds 38 15 4 0 33 8 9 4 6 36 37 80
4th Liverpool 38 13 4 2 25 7 9 2 8 26 24 72
5th Chelsea 38 16 1 2 44 18 4 5 10 24 33 66
6th Newcastle 38 11 5 3 40 23 7 3 9 25 33 62
7th Blackburn 38 11 3 5 36 24 5 5 9 23 30 56
8th Middlesbrough 38 9 7 3 31 19 5 6 8 20 29 55
9th Sunderland 38 8 5 6 31 30 8 2 9 22 25 55
10th West Ham 38 11 3 5 31 17 3 7 9 14 29 52
11th Tottenham 38 10 3 6 35 26 4 5 10 23 35 50
12th Leicester 38 7 5 7 23 20 6 4 9 26 28 48
13th Fulham 38 7 5 7 39 35 5 7 7 33 44 48
14th Ipswich 38 9 4 6 23 22 3 3 13 14 34 43
15th Charlton 38 5 5 9 18 26 5 4 10 16 30 39
16th Everton 38 8 4 7 30 28 1 5 13 11 36 36
17th Aston Villa 38 2 8 9 19 28 5 6 8 21 26 35
--------------------------------------------------------------------------------------------------
18th R Derby 38 6 4 9 25 28 3 3 13 14 39 34
19th R Southampton 38 5 7 7 34 34 1 4 14 12 35 29
20th R Bolton 38 6 3 10 25 31 1 4 14 15 40 28
================================================================================================
2001/2 Goals
================================================================================================
Pos Player Club Apps Gls
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 25
2nd Alan Shearer Newcastle 36 25
3rd Ruud van Nistelrooy Man Utd 26 23
4th Steve Marlet Fulham 38 20
5th Jimmy Floyd Hasselbaink Chelsea 30 (1) 20
6th Les Ferdinand Sunderland 27 (2) 17
7th Kevin Phillips Sunderland 36 17
8th Frédéric Kanouté West Ham 32 (3) 14
9th Marcus Bent Blackburn 28 (4) 13
10th Alen Boksic Middlesbrough 36 13
11th Eidur Gudjohnsen Chelsea 28 (3) 13
12th Luis Boa Morte Fulham 36 13
13th Michael Owen Liverpool 32 (1) 12
14th Dwight Yorke Man Utd 29 (1) 11
15th Henrik Pedersen Bolton 36 11
16th Juan Pablo Angel Aston Villa 34 (2) 11
17th Juan Sebastián Verón Man Utd 29 (2) 11
18th Shaun Bartlett Charlton 35 10
19th Matt Jansen Blackburn 28 (5) 10
20th Duncan Ferguson Everton 28 (5) 10
21st Ian Harte Leeds 37 10
22nd Bosko Balaban Aston Villa 36 10
23rd Robbie Fowler Liverpool 25 (3) 10
24th Georgi Kinkladze Derby 36 (1) 10
25th Hamilton Ricard Middlesbrough 28 (2) 10
26th Robert Pires Arsenal 24 (3) 9
27th Andrew Cole Man Utd 15 (5) 9
28th Rod Wallace Bolton 31 9
29th James Beattie Southampton 28 (1) 9
30th Robbie Keane Leeds 28 (8) 9
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
================================================================================================
2001/2 Average Rating
================================================================================================
Pos Player Club Apps Av R
-------------------------------------------------------------------------
1st Ruud van Nistelrooy Man Utd 26 8.54
2nd Thierry Henry Arsenal 34 8.09
3rd Alan Shearer Newcastle 36 7.97
4th Kieron Dyer Newcastle 33 7.94
5th Steve Marlet Fulham 38 7.89
6th Ian Harte Leeds 37 7.86
7th Andrew Cole Man Utd 15 (5) 7.85
8th Roy Keane Man Utd 19 7.84
9th Les Ferdinand Sunderland 27 (2) 7.83
10th Juan Sebastián Verón Man Utd 29 (2) 7.81
11th Eidur Gudjohnsen Chelsea 28 (3) 7.77
12th Jesper Grønkjær Chelsea 34 7.76
13th Michaël Silvestre Man Utd 32 7.72
14th Dean Gordon Middlesbrough 30 (1) 7.71
15th Michael Owen Liverpool 32 (1) 7.70
16th Patrick Vieira Arsenal 29 7.69
17th Robert Pires Arsenal 24 (3) 7.67
18th Ryan Giggs Man Utd 32 7.66
19th Dwight Yorke Man Utd 29 (1) 7.63
20th Mario Stanic Chelsea 29 (3) 7.63
21st Frédéric Kanouté West Ham 32 (3) 7.57
22nd Mark Viduka Leeds 21 7.57
23rd David Beckham Man Utd 29 7.55
24th Jimmy Floyd Hasselbaink Chelsea 30 (1) 7.55
25th Martin Taylor Blackburn 14 (8) 7.55
26th Titus Bramble Ipswich 33 7.55
27th Sol Campbell Arsenal 20 (1) 7.52
28th Mario Melchiot Chelsea 19 (2) 7.52
29th Stephane Henchoz Liverpool 29 7.52
30th Rio Ferdinand Leeds 36 (1) 7.51
================================================================================================
2001/2 Man of Match
================================================================================================
Pos Player Club Apps MoM
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 8
2nd Ruud van Nistelrooy Man Utd 26 8
3rd Kieron Dyer Newcastle 33 6
4th Les Ferdinand Sunderland 27 (2) 6
5th Steve Marlet Fulham 38 6
6th Eidur Gudjohnsen Chelsea 28 (3) 6
7th Ian Harte Leeds 37 5
8th Richie Wellens Leicester 20 (9) 5
9th Henrik Pedersen Bolton 36 5
10th Alan Shearer Newcastle 36 5
11th Michael Owen Liverpool 32 (1) 4
12th Dean Gordon Middlesbrough 30 (1) 4
13th Matt Jansen Blackburn 28 (5) 4
14th Marcus Bent Blackburn 28 (4) 4
15th Kevin Campbell Everton 27 (4) 4
16th Titus Bramble Ipswich 33 4
17th Roy Keane Man Utd 19 4
18th Frédéric Kanouté West Ham 32 (3) 4
19th Patrick Vieira Arsenal 29 4
20th Hermann Hreidarsson Ipswich 34 4
21st Dennis Bergkamp Arsenal 22 (9) 4
22nd Jimmy Floyd Hasselbaink Chelsea 30 (1) 4
23rd Claus Lundekvam Southampton 27 (2) 4
24th Robert Pires Arsenal 24 (3) 3
25th Shaun Bartlett Charlton 35 3
26th Kevin Phillips Sunderland 36 3
27th Lucas Radebe Leeds 31 (1) 3
28th Ragnvald Soma West Ham 27 (3) 3
29th Dean Richards Tottenham 34 3
30th Wayne Quinn Liverpool 25 (4) 3
Ideally I would like to run a function that creates a data frame out of each table above, but can't figure it out.
Thanks
Thanks
another way you can specify the seperator as more than one space, and skiprows as a list of rows. I tried this and it gave me your expected output. You can write simple script to find which lines to be skipped and which to be considered.
df = pd.read_table('assist1.txt', sep='\s\s+', skiprows=[0,1,2,3,4,5,6,7,8,10], header=0,engine='python')
You're using whitespace as a delimiter, but this is fixed-length delimited, not whitespace delimited. You should google fixed-length parsing, e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html.

How to parse downloaded HTML files and create lists

I've downloaded some HTML files I am wanting to parse. I was able to parse the files but now I want to make some lists so I can make a scatter plot. I'm totally new to Python so I am not sure how to make these into lists.
I tried setting a variable equal to the text I got from the the column.
for y in range (1977, 2020, 1):
tmp = random.random()*5.0
print ('Sleep for ', tmp, ' seconds')
time.sleep(tmp)
url = 'https://www.basketball-reference.com/teams/IND/'+ str(y) +'_games.html'
print ('Download from :', url)
#dowlnload
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
fileout = 'YEARS/'+str(y)+'.html'
print ('Save to : ', fileout, '\n')
#save file to disk
f = open(fileout,'w')
f.write(html.decode('utf-8'))
f.close()
#parse
for year in range (1977, 2019, 1):
filein = 'YEARS/' + str(year) + '.html'
soup = BeautifulSoup(open(filein), 'lxml')
entries = soup.find_all('tr', attrs={'class' : ''})
for entry in entries:
#print entry
columns = entry.find_all('td')
if len (columns)>4 :
#print ('C0: ', columns[4])
where = columns[4].get_text()
#print ('C1: ', columns[5])
opponent = columns[5].get_text()
#print ('C2: ', columns[6])
WL = columns[6].get_text()
#print ('C3: ', columns[8])
PacerScore = columns[8].get_text()
#print ('C4: ', columns[9])
OpponentScore = columns[9].get_text()
tt = where+'|::|'+opponent+'|::|'+WL+'|::|'+PacerScore+'|::|'+OpponentScore
print (tt)
x = PacerScore
y = OpponentScore
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
I tried using read_html from pandas as well but I was screwing something up and could not get it to work. It kept telling me feature not found.
#parse
for y in range (1977, 2019, 1):
filein = 'YEARS/' + str(y) + '.html'
soup = BeautifulSoup(open(filein), 'r')
table = BeautifulSoup(open('YEARS/' + str(y) + '.html','r').read()).find('table')
df = pd.read_html(table)
Any advice or pointers would be greatly appreciated.
If you are using pandas' .read_html(), you don't need to use beautifulsoup to find the table tags. Pandas does that for you. You're also doing a ton of work to first save the html, then parse the html. Why not parse the html straight a way, and then if you wanted, just save that table?
Then you can plot using the table.
import requests
import pandas as pd
import numpy as np
import time
import random
headers={'User-Agent': 'Mozilla/5.0'}
for year in range (1977, 2020, 1):
tmp = random.random()*5.0
print ('Sleep for ', tmp, ' seconds')
time.sleep(tmp)
url = 'https://www.basketball-reference.com/teams/IND/'+ str(year) +'_games.html'
response = requests.get(url, headers=headers)
tables = pd.read_html(url)
table = tables[0]
table = table[table['G'] != 'G']
table = table[['Unnamed: 5', 'Opponent','Unnamed: 7','Tm','Opp']]
table.columns = ['Where','Opponent','WL','PacerScore','OpponentScore']
table['Where'] = np.where(table.Where == '#', 'Away', 'Home')
print ('Download table from :', url)
table.to_csv('YEARS/' + str(year) + '.csv')
Your table will look like this, they you can just do:
x = table['PacerScore']
y = table['OpponentScore']
to get your x and y values for your scatter plot.
Output:
print (table.to_string())
Where Opponent WL PacerScore OpponentScore Season
0 Home Memphis Grizzlies W 111 83 2019
1 Away Milwaukee Bucks L 101 118 2019
2 Home Brooklyn Nets W 132 112 2019
3 Away Minnesota Timberwolves L 91 101 2019
4 Away San Antonio Spurs W 116 96 2019
5 Away Cleveland Cavaliers W 119 107 2019
6 Home Portland Trail Blazers L 93 103 2019
7 Away New York Knicks W 107 101 2019
8 Away Chicago Bulls W 107 105 2019
9 Home Boston Celtics W 102 101 2019
10 Home Houston Rockets L 94 98 2019
11 Home Philadelphia 76ers L 94 100 2019
12 Away Miami Heat W 110 102 2019
13 Away Houston Rockets L 103 115 2019
14 Home Miami Heat W 99 91 2019
15 Home Atlanta Hawks W 97 89 2019
16 Home Utah Jazz W 121 94 2019
17 Away Charlotte Hornets L 109 127 2019
18 Home San Antonio Spurs L 100 111 2019
19 Away Utah Jazz W 121 88 2019
21 Away Phoenix Suns W 109 104 2019
22 Away Los Angeles Lakers L 96 104 2019
23 Away Sacramento Kings L 110 111 2019
24 Home Chicago Bulls W 96 90 2019
25 Away Orlando Magic W 112 90 2019
26 Home Sacramento Kings W 107 97 2019
27 Home Washington Wizards W 109 101 2019
28 Home Milwaukee Bucks W 113 97 2019
29 Away Philadelphia 76ers W 113 101 2019
30 Home New York Knicks W 110 99 2019
31 Home Cleveland Cavaliers L 91 92 2019
32 Away Toronto Raptors L 96 99 2019
33 Away Brooklyn Nets W 114 106 2019
34 Home Washington Wizards W 105 89 2019
35 Away Atlanta Hawks W 129 121 2019
36 Home Detroit Pistons W 125 88 2019
37 Home Atlanta Hawks W 116 108 2019
38 Away Chicago Bulls W 119 116 2019
39 Away Toronto Raptors L 105 121 2019
40 Away Cleveland Cavaliers W 123 115 2019
42 Away Boston Celtics L 108 135 2019
43 Away New York Knicks W 121 106 2019
44 Home Phoenix Suns W 131 97 2019
45 Home Philadelphia 76ers L 96 120 2019
46 Home Dallas Mavericks W 111 99 2019
47 Home Charlotte Hornets W 120 95 2019
48 Home Toronto Raptors W 110 106 2019
49 Away Memphis Grizzlies L 103 106 2019
50 Home Golden State Warriors L 100 132 2019
51 Away Washington Wizards L 89 107 2019
52 Away Orlando Magic L 100 107 2019
53 Away Miami Heat W 95 88 2019
54 Away New Orleans Pelicans W 109 107 2019
55 Home Los Angeles Lakers W 136 94 2019
56 Home Los Angeles Clippers W 116 92 2019
57 Home Cleveland Cavaliers W 105 90 2019
58 Home Charlotte Hornets W 99 90 2019
59 Home Milwaukee Bucks L 97 106 2019
60 Home New Orleans Pelicans W 126 111 2019
61 Away Washington Wizards W 119 112 2019
63 Away Detroit Pistons L 109 113 2019
64 Away Dallas Mavericks L 101 110 2019
65 Home Minnesota Timberwolves W 122 115 2019
66 Home Orlando Magic L 112 117 2019
67 Home Chicago Bulls W 105 96 2019
68 Away Milwaukee Bucks L 98 117 2019
69 Away Philadelphia 76ers L 89 106 2019
70 Home New York Knicks W 103 98 2019
71 Home Oklahoma City Thunder W 108 106 2019
72 Away Denver Nuggets L 100 102 2019
73 Away Portland Trail Blazers L 98 106 2019
74 Away Los Angeles Clippers L 109 115 2019
75 Away Golden State Warriors L 89 112 2019
76 Home Denver Nuggets NaN NaN NaN 2019
77 Away Oklahoma City Thunder NaN NaN NaN 2019
78 Away Boston Celtics NaN NaN NaN 2019
79 Home Orlando Magic NaN NaN NaN 2019
80 Home Detroit Pistons NaN NaN NaN 2019
81 Away Detroit Pistons NaN NaN NaN 2019
82 Home Boston Celtics NaN NaN NaN 2019
84 Home Brooklyn Nets NaN NaN NaN 2019
85 Away Atlanta Hawks NaN NaN NaN 2019

Count number of counties per state using python {census}

I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY

Categories

Resources