Reformatting scraped selenium table

Reformatting scraped selenium table - python

I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:
from selenium import webdriver
import re
import pandas as pd
driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []
for i in infotable:
ilist.append(i.text)
infolist = ilist[0]
for i in matches:
match.append(i.text)
driver.close()
home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])
df = pd.DataFrame({'home' : home, 'away' : away})
date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)
In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.
My thinking is: for child/element "under the date", date = last_found_date.
Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).
Should I be incorporating another program/method to retain order of tags/elements of the table?

You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:
from selenium import webdriver
import pandas as pd
driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(#class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
df = pd.DataFrame(data)
print(df)
Prints:
away date event home
0 Washington Capitals 25 Apr 2015 Play Offs New York Islanders
1 Minnesota Wild 25 Apr 2015 Play Offs St.Louis Blues
2 Ottawa Senators 25 Apr 2015 Play Offs Montreal Canadiens
3 Pittsburgh Penguins 25 Apr 2015 Play Offs New York Rangers
4 Calgary Flames 24 Apr 2015 Play Offs Vancouver Canucks
5 Chicago Blackhawks 24 Apr 2015 Play Offs Nashville Predators
6 Tampa Bay Lightning 24 Apr 2015 Play Offs Detroit Red Wings
7 New York Islanders 24 Apr 2015 Play Offs Washington Capitals
8 St.Louis Blues 23 Apr 2015 Play Offs Minnesota Wild
9 Anaheim Ducks 23 Apr 2015 Play Offs Winnipeg Jets
10 Montreal Canadiens 23 Apr 2015 Play Offs Ottawa Senators
11 New York Rangers 23 Apr 2015 Play Offs Pittsburgh Penguins
12 Vancouver Canucks 22 Apr 2015 Play Offs Calgary Flames
13 Nashville Predators 22 Apr 2015 Play Offs Chicago Blackhawks
14 Washington Capitals 22 Apr 2015 Play Offs New York Islanders
15 Tampa Bay Lightning 22 Apr 2015 Play Offs Detroit Red Wings
16 Anaheim Ducks 21 Apr 2015 Play Offs Winnipeg Jets
17 St.Louis Blues 21 Apr 2015 Play Offs Minnesota Wild
18 New York Rangers 21 Apr 2015 Play Offs Pittsburgh Penguins
19 Vancouver Canucks 20 Apr 2015 Play Offs Calgary Flames
20 Montreal Canadiens 20 Apr 2015 Play Offs Ottawa Senators
21 Nashville Predators 19 Apr 2015 Play Offs Chicago Blackhawks
22 Washington Capitals 19 Apr 2015 Play Offs New York Islanders
23 Winnipeg Jets 19 Apr 2015 Play Offs Anaheim Ducks
24 Pittsburgh Penguins 19 Apr 2015 Play Offs New York Rangers
25 Minnesota Wild 18 Apr 2015 Play Offs St.Louis Blues
26 Detroit Red Wings 18 Apr 2015 Play Offs Tampa Bay Lightning
27 Calgary Flames 18 Apr 2015 Play Offs Vancouver Canucks
28 Chicago Blackhawks 18 Apr 2015 Play Offs Nashville Predators
29 Ottawa Senators 18 Apr 2015 Play Offs Montreal Canadiens
30 New York Islanders 18 Apr 2015 Play Offs Washington Capitals
31 Winnipeg Jets 17 Apr 2015 Play Offs Anaheim Ducks
32 Minnesota Wild 17 Apr 2015 Play Offs St.Louis Blues
33 Detroit Red Wings 17 Apr 2015 Play Offs Tampa Bay Lightning
34 Pittsburgh Penguins 17 Apr 2015 Play Offs New York Rangers
35 Calgary Flames 16 Apr 2015 Play Offs Vancouver Canucks
36 Chicago Blackhawks 16 Apr 2015 Play Offs Nashville Predators
37 Ottawa Senators 16 Apr 2015 Play Offs Montreal Canadiens
38 New York Islanders 16 Apr 2015 Play Offs Washington Capitals
39 Edmonton Oilers 12 Apr 2015 Not specified Vancouver Canucks
40 Anaheim Ducks 12 Apr 2015 Not specified Arizona Coyotes
41 Chicago Blackhawks 12 Apr 2015 Not specified Colorado Avalanche
42 Nashville Predators 12 Apr 2015 Not specified Dallas Stars
43 Boston Bruins 12 Apr 2015 Not specified Tampa Bay Lightning
44 Pittsburgh Penguins 12 Apr 2015 Not specified Buffalo Sabres
45 Detroit Red Wings 12 Apr 2015 Not specified Carolina Hurricanes
46 New Jersey Devils 12 Apr 2015 Not specified Florida Panthers
47 Columbus Blue Jackets 12 Apr 2015 Not specified New York Islanders
48 Montreal Canadiens 12 Apr 2015 Not specified Toronto Maple Leafs
49 Calgary Flames 11 Apr 2015 Not specified Winnipeg Jets

Related

Scraping data from TeamRankings.com

I want to scrape some NBA data from TeamRankings.com for my program in python. Here is an example link:
https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03
I only need the "Last 3" column data. I want to be able to set the date to whatever I want with a constant variable. There are a few other data points I want that are on different links but I will be able to figure that part out if this gets figured out.
I have tried using https://github.com/tymiguel/TeamRankingsWebScraper but it is outdated and did not work for me.

The easiest way will be to use pandas.read_html:
import pandas as pd
url = 'https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03'
df = pd.read_html(url)[0]
print(df)
Prints:
Rank Team 2022 Last 3 Last 1 Home Away 2021
0 1 Brooklyn 58.8% 64.5% 68.3% 59.4% 58.1% 54.2%
1 2 Denver 57.8% 62.8% 52.2% 59.5% 56.4% 55.5%
2 3 Boston 56.8% 54.6% 51.1% 58.2% 55.1% 54.0%
3 4 Sacramento 56.3% 56.9% 48.4% 59.1% 53.4% 52.5%
4 5 Golden State 56.3% 53.2% 52.5% 56.9% 55.6% 55.4%
5 6 Dallas 56.0% 59.5% 50.0% 55.8% 56.2% 54.0%
6 7 Portland 55.5% 58.6% 65.5% 57.3% 54.3% 51.5%
7 8 Minnesota 55.3% 52.1% 59.2% 55.7% 54.9% 53.8%
8 9 Utah 55.3% 53.9% 53.7% 58.1% 53.0% 55.1%
9 10 Philadelphia 55.3% 57.3% 56.4% 54.5% 56.2% 53.6%
10 11 Cleveland 55.1% 57.7% 60.9% 56.7% 53.1% 53.7%
11 12 Washington 54.6% 61.4% 56.9% 54.7% 54.5% 53.2%
12 13 Chicago 54.6% 57.3% 54.7% 55.7% 53.5% 53.7%
13 14 Indiana 54.5% 60.3% 53.8% 56.1% 52.8% 53.1%
14 15 New Orleans 54.4% 52.5% 56.5% 56.2% 52.5% 51.8%
15 16 Phoenix 54.1% 51.6% 44.8% 54.8% 53.5% 55.0%
16 17 LA Clippers 54.1% 57.8% 52.2% 52.3% 55.8% 53.0%
17 18 LA Lakers 54.0% 56.6% 53.8% 53.7% 54.3% 53.7%
18 19 San Antonio 53.1% 54.6% 47.4% 53.4% 52.8% 52.7%
19 20 Orlando 52.9% 48.0% 44.5% 54.6% 50.9% 50.2%
20 21 Milwaukee 52.8% 45.5% 42.2% 55.0% 50.4% 54.0%
21 22 Memphis 52.8% 54.0% 51.0% 53.8% 51.8% 52.1%
22 23 Miami 52.6% 54.6% 52.9% 53.1% 52.1% 54.0%
23 24 New York 52.2% 51.4% 57.4% 53.9% 50.6% 51.3%
24 25 Atlanta 52.2% 51.5% 53.7% 51.5% 53.0% 54.2%
25 26 Okla City 52.2% 50.9% 44.6% 52.6% 51.7% 49.7%
26 27 Detroit 51.5% 52.3% 45.1% 52.7% 50.5% 49.4%
27 28 Toronto 51.1% 51.3% 52.7% 51.3% 50.8% 51.0%
28 29 Houston 51.0% 50.0% 51.8% 50.2% 51.6% 53.4%
29 30 Charlotte 50.3% 52.0% 51.1% 49.3% 51.2% 54.3%
If you want only Last 3 column:
print(df[['Team', 'Last 3']])
Prints:
Team Last 3
0 Brooklyn 64.5%
1 Denver 62.8%
2 Boston 54.6%
3 Sacramento 56.9%
...

Combining Pandas DataFrames With Multiple Reference Columns

I'm trying to combine two pandas DataFrames to update the first one based on criteria from the second. Here is a sample of the two dataframes:
df1
year
2016 CALIFORNIA CLINTON, HILLARY
2016 CALIFORNIA TRUMP, DONALD J.
2016 CALIFORNIA JOHNSON, GARY
2016 CALIFORNIA STEIN, JILL
2016 CALIFORNIA WRITE-IN
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA
2016 TEXAS TRUMP, DONALD J.
2016 TEXAS CLINTON, HILLARY
2016 TEXAS JOHNSON, GARY
2016 TEXAS STEIN, JILL
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W.
1988 CALIFORNIA DUKAKIS, MICHAEL
1988 CALIFORNIA PAUL, RONALD ""RON""
1988 CALIFORNIA FULANI, LENORA
1988 TEXAS BUSH, GEORGE H.W.
1988 TEXAS DUKAKIS, MICHAEL
1988 TEXAS PAUL, RONALD ""RON""
1988 TEXAS FULANI, LENORA
df2
year
1988 CALIFORNIA 47
1988 TEXAS 29
...
2016 CALIFORNIA 55
2016 TEXAS 38
There are values for every election year from 2020 to 1972 that includes all candidates and all states in a similar format. There are other columns in df1 but they aren't relevant to what I'm trying to do.
My expected result is:
year
2016 CALIFORNIA CLINTON, HILLARY 55
2016 CALIFORNIA TRUMP, DONALD J. 55
2016 CALIFORNIA JOHNSON, GARY 55
2016 CALIFORNIA STEIN, JILL 55
2016 CALIFORNIA WRITE-IN 55
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
2016 TEXAS TRUMP, DONALD J. 38
2016 TEXAS CLINTON, HILLARY 38
2016 TEXAS JOHNSON, GARY 38
2016 TEXAS STEIN, JILL 38
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W. 47
1988 CALIFORNIA DUKAKIS, MICHAEL 47
1988 CALIFORNIA PAUL, RONALD ""RON"" 47
1988 CALIFORNIA FULANI, LENORA 47
1988 TEXAS BUSH, GEORGE H.W. 29
1988 TEXAS DUKAKIS, MICHAEL 29
1988 TEXAS PAUL, RONALD ""RON"" 29
1988 TEXAS FULANI, LENORA 29
I want to match up the electoral_votes column in df2 with the year and state columns in df1 so it puts the correct value. I got some assistance and was able to match it up when there is only one column being matched (you can see the question and answer here) but I am having trouble matching it up with the two points of reference (year and state). If I use the code linked as is it returns the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have tried apply, map, applymap, merge, etc and haven't been able to figure it out. Thanks in advance for the help!

I believe what you are looking for is left_merge. You should specify the common columns within on=[....], that the merge should be based on.
# Imports
import pandas as pd
# Specify two columns in the "on".
pd.merge(df1,
df2,
how='left',
on=['year','state'])
Out[1821]:
year state candidate votes
0 2016 CALIFORNIA CLINTON, HILLARY 55
1 2016 CALIFORNIA TRUMP, DONALD J. 55
2 2016 CALIFORNIA JOHNSON, GARY 55
3 2016 CALIFORNIA STEIN, JILL 55
4 2016 CALIFORNIA WRITE-IN 55
5 2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
6 2016 TEXAS TRUMP, DONALD J. 38
7 2016 TEXAS CLINTON, HILLARY 38
8 2016 TEXAS JOHNSON, GARY 38
9 2016 TEXAS STEIN, JILL 38
10 1988 CALIFORNIA BUSH, GEORGE H.W. 47
11 1988 CALIFORNIA DUKAKIS, MICHAEL 47
12 1988 CALIFORNIA PAUL, RONALD ""RON"" 47
13 1988 CALIFORNIA FULANI, LENORA 47
14 1988 TEXAS BUSH, GEORGE H.W. 29
15 1988 TEXAS DUKAKIS, MICHAEL 29
16 1988 TEXAS PAUL, RONALD ""RON"" 29
17 1988 TEXAS FULANI, LENORA 29
The above code could be written as:
pd.merge(df1,
df2,
how='left',
left_on=['year','state'],
right_on=['year','state'])
but since the columns are the same in the 2 dfs, we can use on = ['year', 'state']

An alternate way to write -
merged_df = df1.merge(df2, on=['year', 'state'], how='left')
If you want to use only 3 columns from df1 -
df1 = pd.read_csv('<name_of_the_CSV_file>', usecols=['year', 'state', 'candidate'])

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:
city Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
city W/L Ratio
0 Boston 2.500000
1 Buffalo 0.555556
2 Calgary 1.057143
3 Chicago 0.846154
4 Columbus 1.500000
5 Dallas–Fort Worth 1.312500
6 Denver 1.433333
7 Detroit 0.769231
8 Edmonton 0.900000
9 Las Vegas 2.125000
10 Los Angeles 1.655862
11 Miami–Fort Lauderdale 1.466667
12 Minneapolis-Saint Paul 1.730769
13 Montreal 0.725000
14 Nashville 2.944444
15 New York 1.517241
16 New York City 0.908870
17 Ottawa 0.651163
18 Philadelphia 1.615385
19 Phoenix 0.707317
20 Pittsburgh 1.620690
21 Raleigh 1.028571
22 San Francisco Bay Area 1.666667
23 St. Louis 1.375000
24 Tampa Bay 2.347826
25 Toronto 1.884615
26 Vancouver 0.775000
27 Washington, D.C. 1.884615
28 Winnipeg 2.600000
And I do a join like this:
result = pd.merge(df, nhl_df , on="city")
The result should have 28 rows, instead I have 24 rows.
One of the missing one is for example Miami-Fort Lauderdale
I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?
city Population W/L Ratio
0 New York City 20153634 0.908870
1 Los Angeles 13310447 1.655862
2 San Francisco Bay Area 6657982 1.666667
3 Chicago 9512999 0.846154
4 Dallas–Fort Worth 7233323 1.312500
5 Washington, D.C. 6131977 1.884615
6 Philadelphia 6070500 1.615385
7 Boston 4794447 2.500000
8 Denver 2853077 1.433333
9 Phoenix 4661537 0.707317
10 Detroit 4297617 0.769231
11 Toronto 5928040 1.884615
12 Pittsburgh 2342299 1.620690
13 St. Louis 2807002 1.375000
14 Nashville 1865298 2.944444
15 Buffalo 1132804 0.555556
16 Montreal 4098927 0.725000
17 Vancouver 2463431 0.775000
18 Columbus 2041520 1.500000
19 Calgary 1392609 1.057143
20 Ottawa 1323783 0.651163
21 Edmonton 1321426 0.900000
22 Winnipeg 778489 2.600000
23 Las Vegas 2155664 2.125000
24 Raleigh 1302946 1.028571

I think here is possible check if same chars by integer that represents the character in function ord, here are different – with code 150 and – with code 8211, so it is reason why values not matched:
a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale
print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale
print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
You can try copy values for replace for select correct - value:
#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

Pandas - read a text file

I have a text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
I'm trying to read it into a pandas data frame like this:
df = pd.read_table('assist1.txt',
sep='\s+',
skiprows=6,
header=0,)
This code throws an exception - pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 31, saw 8.
I guess that's because of the space between the first and last name of the player (should be the value of the Player column).
Is there a way to achieve this?
Furthermore, it is a part of a larger text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Table
================================================================================================
Pos Team Pld Won Drn Lst For Ag Won Drn Lst For Ag Pts
--------------------------------------------------------------------------------------------------
1st C Man Utd 38 15 4 0 41 4 10 4 5 34 20 83
--------------------------------------------------------------------------------------------------
2nd Arsenal 38 15 2 2 38 9 11 3 5 28 14 83
3rd Leeds 38 15 4 0 33 8 9 4 6 36 37 80
4th Liverpool 38 13 4 2 25 7 9 2 8 26 24 72
5th Chelsea 38 16 1 2 44 18 4 5 10 24 33 66
6th Newcastle 38 11 5 3 40 23 7 3 9 25 33 62
7th Blackburn 38 11 3 5 36 24 5 5 9 23 30 56
8th Middlesbrough 38 9 7 3 31 19 5 6 8 20 29 55
9th Sunderland 38 8 5 6 31 30 8 2 9 22 25 55
10th West Ham 38 11 3 5 31 17 3 7 9 14 29 52
11th Tottenham 38 10 3 6 35 26 4 5 10 23 35 50
12th Leicester 38 7 5 7 23 20 6 4 9 26 28 48
13th Fulham 38 7 5 7 39 35 5 7 7 33 44 48
14th Ipswich 38 9 4 6 23 22 3 3 13 14 34 43
15th Charlton 38 5 5 9 18 26 5 4 10 16 30 39
16th Everton 38 8 4 7 30 28 1 5 13 11 36 36
17th Aston Villa 38 2 8 9 19 28 5 6 8 21 26 35
--------------------------------------------------------------------------------------------------
18th R Derby 38 6 4 9 25 28 3 3 13 14 39 34
19th R Southampton 38 5 7 7 34 34 1 4 14 12 35 29
20th R Bolton 38 6 3 10 25 31 1 4 14 15 40 28
================================================================================================
2001/2 Goals
================================================================================================
Pos Player Club Apps Gls
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 25
2nd Alan Shearer Newcastle 36 25
3rd Ruud van Nistelrooy Man Utd 26 23
4th Steve Marlet Fulham 38 20
5th Jimmy Floyd Hasselbaink Chelsea 30 (1) 20
6th Les Ferdinand Sunderland 27 (2) 17
7th Kevin Phillips Sunderland 36 17
8th Frédéric Kanouté West Ham 32 (3) 14
9th Marcus Bent Blackburn 28 (4) 13
10th Alen Boksic Middlesbrough 36 13
11th Eidur Gudjohnsen Chelsea 28 (3) 13
12th Luis Boa Morte Fulham 36 13
13th Michael Owen Liverpool 32 (1) 12
14th Dwight Yorke Man Utd 29 (1) 11
15th Henrik Pedersen Bolton 36 11
16th Juan Pablo Angel Aston Villa 34 (2) 11
17th Juan Sebastián Verón Man Utd 29 (2) 11
18th Shaun Bartlett Charlton 35 10
19th Matt Jansen Blackburn 28 (5) 10
20th Duncan Ferguson Everton 28 (5) 10
21st Ian Harte Leeds 37 10
22nd Bosko Balaban Aston Villa 36 10
23rd Robbie Fowler Liverpool 25 (3) 10
24th Georgi Kinkladze Derby 36 (1) 10
25th Hamilton Ricard Middlesbrough 28 (2) 10
26th Robert Pires Arsenal 24 (3) 9
27th Andrew Cole Man Utd 15 (5) 9
28th Rod Wallace Bolton 31 9
29th James Beattie Southampton 28 (1) 9
30th Robbie Keane Leeds 28 (8) 9
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
================================================================================================
2001/2 Average Rating
================================================================================================
Pos Player Club Apps Av R
-------------------------------------------------------------------------
1st Ruud van Nistelrooy Man Utd 26 8.54
2nd Thierry Henry Arsenal 34 8.09
3rd Alan Shearer Newcastle 36 7.97
4th Kieron Dyer Newcastle 33 7.94
5th Steve Marlet Fulham 38 7.89
6th Ian Harte Leeds 37 7.86
7th Andrew Cole Man Utd 15 (5) 7.85
8th Roy Keane Man Utd 19 7.84
9th Les Ferdinand Sunderland 27 (2) 7.83
10th Juan Sebastián Verón Man Utd 29 (2) 7.81
11th Eidur Gudjohnsen Chelsea 28 (3) 7.77
12th Jesper Grønkjær Chelsea 34 7.76
13th Michaël Silvestre Man Utd 32 7.72
14th Dean Gordon Middlesbrough 30 (1) 7.71
15th Michael Owen Liverpool 32 (1) 7.70
16th Patrick Vieira Arsenal 29 7.69
17th Robert Pires Arsenal 24 (3) 7.67
18th Ryan Giggs Man Utd 32 7.66
19th Dwight Yorke Man Utd 29 (1) 7.63
20th Mario Stanic Chelsea 29 (3) 7.63
21st Frédéric Kanouté West Ham 32 (3) 7.57
22nd Mark Viduka Leeds 21 7.57
23rd David Beckham Man Utd 29 7.55
24th Jimmy Floyd Hasselbaink Chelsea 30 (1) 7.55
25th Martin Taylor Blackburn 14 (8) 7.55
26th Titus Bramble Ipswich 33 7.55
27th Sol Campbell Arsenal 20 (1) 7.52
28th Mario Melchiot Chelsea 19 (2) 7.52
29th Stephane Henchoz Liverpool 29 7.52
30th Rio Ferdinand Leeds 36 (1) 7.51
================================================================================================
2001/2 Man of Match
================================================================================================
Pos Player Club Apps MoM
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 8
2nd Ruud van Nistelrooy Man Utd 26 8
3rd Kieron Dyer Newcastle 33 6
4th Les Ferdinand Sunderland 27 (2) 6
5th Steve Marlet Fulham 38 6
6th Eidur Gudjohnsen Chelsea 28 (3) 6
7th Ian Harte Leeds 37 5
8th Richie Wellens Leicester 20 (9) 5
9th Henrik Pedersen Bolton 36 5
10th Alan Shearer Newcastle 36 5
11th Michael Owen Liverpool 32 (1) 4
12th Dean Gordon Middlesbrough 30 (1) 4
13th Matt Jansen Blackburn 28 (5) 4
14th Marcus Bent Blackburn 28 (4) 4
15th Kevin Campbell Everton 27 (4) 4
16th Titus Bramble Ipswich 33 4
17th Roy Keane Man Utd 19 4
18th Frédéric Kanouté West Ham 32 (3) 4
19th Patrick Vieira Arsenal 29 4
20th Hermann Hreidarsson Ipswich 34 4
21st Dennis Bergkamp Arsenal 22 (9) 4
22nd Jimmy Floyd Hasselbaink Chelsea 30 (1) 4
23rd Claus Lundekvam Southampton 27 (2) 4
24th Robert Pires Arsenal 24 (3) 3
25th Shaun Bartlett Charlton 35 3
26th Kevin Phillips Sunderland 36 3
27th Lucas Radebe Leeds 31 (1) 3
28th Ragnvald Soma West Ham 27 (3) 3
29th Dean Richards Tottenham 34 3
30th Wayne Quinn Liverpool 25 (4) 3
Ideally I would like to run a function that creates a data frame out of each table above, but can't figure it out.
Thanks
Thanks

another way you can specify the seperator as more than one space, and skiprows as a list of rows. I tried this and it gave me your expected output. You can write simple script to find which lines to be skipped and which to be considered.
df = pd.read_table('assist1.txt', sep='\s\s+', skiprows=[0,1,2,3,4,5,6,7,8,10], header=0,engine='python')

You're using whitespace as a delimiter, but this is fixed-length delimited, not whitespace delimited. You should google fixed-length parsing, e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html.

Pandas - How to get index values from a dataframe

I have a pandas dataframe of the form
Start Date End Date President Party
0 04 March 1921 02 August 1923 Warren G Harding Republican
1 03 August 1923 04 March 1929 Calvin Coolidge Republican
2 05 March 1929 04 March 1933 Herbert Hoover Republican
3 05 March 1933 12 April 1945 Franklin D Roosevelt Democratic
4 13 April 1945 20 January 1953 Harry S Truman Democratic
5 21 January 1953 20 January 1961 Dwight D Eisenhower Republican
6 21 January 1961 22 November 1963 John F Kennedy Democratic
7 23 November 1963 20 January 1969 Lydon B Johnson Democratic
8 21 January 1969 09 August 1974 Richard Nixon Republican
9 10 August 1974 20 January 1977 Gerald Ford Republican
10 21 January 1977 20 January 1981 Jimmy Carter Democratic
11 21 January 1981 20 January 1989 Ronald Reagan Republican
12 21 January 1989 20 January 1993 George H W Bush Republican
13 21 January 1993 20 January 2001 Bill Clinton Democratic
14 21 January 2001 20 January 2009 George W Bush Republican
15 21 January 2009 20 January 2017 Barack Obama Democratic
16 21 January 2017 20 May 2017 Donald Trump Republican
I want to extract the index values for Party=Republican and store them in a list.
Is there a Pandas function to do this quickly?

df.index[df.Party == 'Republican`]
You can call .tolist() on the result if you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.