Pandas - How to get index values from a dataframe - python

I have a pandas dataframe of the form
Start Date End Date President Party
0 04 March 1921 02 August 1923 Warren G Harding Republican
1 03 August 1923 04 March 1929 Calvin Coolidge Republican
2 05 March 1929 04 March 1933 Herbert Hoover Republican
3 05 March 1933 12 April 1945 Franklin D Roosevelt Democratic
4 13 April 1945 20 January 1953 Harry S Truman Democratic
5 21 January 1953 20 January 1961 Dwight D Eisenhower Republican
6 21 January 1961 22 November 1963 John F Kennedy Democratic
7 23 November 1963 20 January 1969 Lydon B Johnson Democratic
8 21 January 1969 09 August 1974 Richard Nixon Republican
9 10 August 1974 20 January 1977 Gerald Ford Republican
10 21 January 1977 20 January 1981 Jimmy Carter Democratic
11 21 January 1981 20 January 1989 Ronald Reagan Republican
12 21 January 1989 20 January 1993 George H W Bush Republican
13 21 January 1993 20 January 2001 Bill Clinton Democratic
14 21 January 2001 20 January 2009 George W Bush Republican
15 21 January 2009 20 January 2017 Barack Obama Democratic
16 21 January 2017 20 May 2017 Donald Trump Republican
I want to extract the index values for Party=Republican and store them in a list.
Is there a Pandas function to do this quickly?

df.index[df.Party == 'Republican`]
You can call .tolist() on the result if you want.

Related

How to create a dictionary such that output contains the list and the paragraph values in a dictionary [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 months ago.
Improve this question
it will be great help if i will be able to know how to resolve the issues to get the proper output in json file
r = requests.get('https://www.iomfsa.im/enforcement/disqualified-directors/')
soup = BeautifulSoup(r.content, 'html.parser')
paragraphs=[]
length=soup.findAll("strong")
for leng in length:
paragraphs.append(leng.next_sibling)
paragraph = [i for i in paragraphs if i is not None]
print(paragraph)
list=['name','address','DOB','POD','DOD','Particulars of Disqualification Order or Undertaking']
expected output
You can do something like below:
from bs4 import BeautifulSoup
import requests
import json
import re
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
url = 'https://www.iomfsa.im/enforcement/disqualified-directors/'
df_list = []
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'xml')
# print(soup)
dismissed_dirs = soup.select('section.accordion-item')
for d in dismissed_dirs:
# print(d)
name = d.find('strong', text=re.compile('^Name:')).next_sibling
address = d.find('strong', text=re.compile('^Address')).next_sibling
dob = d.find('strong', text=re.compile('^Date of Birth:')).next_sibling
pod = d.find('strong', text=re.compile('^Period of Disqualification:')).next_sibling
dod = d.find('strong', text=re.compile('^Dates of Disqualification:')).parent.text
particulars = d.find('strong', text=re.compile('^Particulars')).find_next('a').text
df_list.append((name, address, dob, pod, dod, particulars))
df = pd.DataFrame(df_list, columns = ['name', 'address', 'dob', 'pod', 'dod', 'particulars'])
print(df)
print('--------------')
print(df.to_dict(orient='records'))
This will return both a dataframe (which makes more sense, visually), as well as a dictionary, as you requested:
name
address
dob
pod
dod
particulars
0
John Trevor Roche Baines
c/o Isle of Man Prison, Jurby, Isle of Man
19 Dec 1939
15 Years 0 Months 0 Days
Dates of Disqualification: From 15 Jul 2010 To 15 Jul 2025
Section 2 Company Officers (Disqualification) Act 2009
1
Ralph Stephen Brunswick
Valdfrieden, Ballaugh Glen, Ballaugh, Isle of Man IM7 5JB
13 Years 6 Months 0 Days
Dates of Disqualification: From 4 Mar 2009 To 4 Sep 2022
Section 2 Company Officers (Disqualification) Act 2009
2
Fenella Jane Carter
30 North Quay, Douglas, Isle of Man IM1 4LD
6 August 1984
6 Years 0 Months 0 Days
Dates of Disqualification: From 5 July 2018 to 5 July 2024
Section 2 Company Officers (Disqualification) Act 2009
3
Richard Alan Costain
St Patrick's Close, Jurby, Isle of Man
13 Sep 1951
15 Years 0 Months 0 Days
Dates of Disqualification: From 16 Feb 2017 To 16 Feb 2032
Section 2 Company Officers (Disqualification) Act 2009
4
Paul Deighton
Isle of Man Prison, St Patrick’s Close, Coast Road IM7 3JP
23 December 1966
12 Years 0 Months 0 Days
Dates of Disqualification: From 8 August 2020 to 7 August 2032
Section 4 Company Officers (Disqualification) Act 2009
5
Jamie Alexander Irving
Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN
Not known
7 Years 0 Months 0 Days
Dates of Disqualification: From 26 Feb 2018 To 26 Feb 2025
Section 4 Company Officers (Disqualification) Act 2009
6
Jonathan Frank Edward Irving
Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN
Not known
8 Years 0 Months 0 Days
Dates of Disqualification: From 26 Feb 2018 To 26 Feb 2026
Section 4 Company Officers (Disqualification) Act 2009
7
Duncan Frank Ellis Jones
Harpers Glen, Hillberry Green, Douglas, Isle of Man IM2 6DE
10 Aug 1959
13 Years 0 Months 0 Days
Dates of Disqualification: From 25 Apr 2011 To 25 Apr 2024
Section 2 Company Officers (Disqualification) Act 2009
8
Lynn Keig
Croit-e-Quill, Lonan, Isle of Man, IM4 7JG
28 June 1956
Years 0 Months 0 Days
Dates of Disqualification: From 29 Jun 2017 To 28 Jun 2023
Section 2 Company Officers (Disqualification) Act 2009
9
Richard Ian Kissack
6 Falcon Cliff Court, Douglas, Isle of Man, IM2 4AQ (currently Mr Kissack is residing at HM IOM Prison, Jurby, Isle of Man)
30 October 1968
5 Years 11 Months 13 Days
Dates of Disqualification: From 31 December 2021 to 13 December 2027
Section 2 Company Officers (Disqualification) Act 2009
10
Alan Louis
Not known
23 November 1965
12 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2031
Section 2 Company Officers (Disqualification) Act 2009
11
Phillip Sean McCarthy
Cedar Lodge, Main Road, Crosby IM4 4BH
5 October 1979
8 Years 0 Months 0 Days
Dates of Disqualification: From 28 November 2019 to 27 November 2027
Section 2 Company Officers (Disqualification) Act 2009
12
John McCauley
Not known
30 March 1955
5 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2024
Section 2 Company Officers (Disqualification) Act 2009
13
Dirk Frederik Mudge
92, Daan Bekker Street, Windhoek, Namibia
23 December 1976
8 Years 0 Months 0 Days
Dates of Disqualification: From 17 November 2018 to 16 November 2026
Section 2 Company Officers (Disqualification) Act 2009
14
Lukas Nakos
Not known
6 March 1976
6 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2025
Section 2 Company Officers (Disqualification) Act 2009
15
Andrew Mark Rouse
13 Reayrt Ny Chrink, Crosby, Isle of Man IM4 2EA
24 Jan 1977
5 Years 0 Months 0 Days
Dates of Disqualification: From 28 Feb 2018 To 28 Feb 2023
Section 2 Company Officers (Disqualification) Act 2009
--------------
[{'name': 'John Trevor Roche Baines', 'address': 'c/o Isle of Man Prison, Jurby, Isle of Man', 'dob': '19 Dec 1939', 'pod': '15 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 15 Jul 2010 To 15 Jul 2025', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Ralph Stephen Brunswick', 'address': 'Valdfrieden, Ballaugh Glen, Ballaugh, Isle of Man IM7 5JB', 'dob': None, 'pod': '13 Years 6 Months 0 Days', 'dod': 'Dates of Disqualification: From 4 Mar 2009 To 4 Sep 2022', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Fenella Jane Carter', 'address': '30 North Quay, Douglas, Isle of Man IM1 4LD', 'dob': '6 August 1984', 'pod': '6 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa05 July 2018 to 5 July 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Richard Alan Costain', 'address': "St Patrick's Close, Jurby, Isle of Man", 'dob': '13 Sep 1951', 'pod': '15 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 16 Feb 2017 To 16 Feb 2032', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Paul Deighton', 'address': 'Isle of Man Prison, St Patrick’s Close, Coast Road IM7 3JP', 'dob': '23 December 1966', 'pod': '12 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa08 August 2020 to 7 August 2032', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Jamie Alexander Irving', 'address': 'Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN', 'dob': 'Not known', 'pod': '7 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa026 Feb 2018 To\xa026 Feb 2025', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Jonathan Frank Edward Irving', 'address': 'Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN', 'dob': 'Not known', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa026 Feb 2018 To\xa026 Feb 2026', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Duncan Frank Ellis Jones', 'address': 'Harpers Glen, Hillberry Green, Douglas, Isle of Man IM2 6DE', 'dob': '10 Aug 1959', 'pod': '13 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 25 Apr 2011 To 25 Apr 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Lynn Keig', 'address': 'Croit-e-Quill, Lonan, Isle of Man, IM4 7JG', 'dob': '28 June 1956', 'pod': ' Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 Jun 2017 To 28 Jun 2023', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Richard Ian Kissack', 'address': '6 Falcon Cliff Court, Douglas, Isle of Man, IM2 4AQ (currently Mr Kissack is residing at HM IOM Prison, Jurby, Isle of Man)', 'dob': <strong>30 October 1968</strong>, 'pod': '5 Years 11 Months 13 Days', 'dod': 'Dates of Disqualification: From\xa031 December 2021 to 13 December 2027', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Alan Louis', 'address': 'Not known', 'dob': '23 November 1965', 'pod': '12 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2031', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Phillip Sean McCarthy', 'address': 'Cedar Lodge, Main Road, Crosby IM4 4BH', 'dob': '5 October 1979', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa028 November 2019 to 27 November 2027', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' John McCauley', 'address': 'Not known', 'dob': '30 March 1955', 'pod': '5 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Dirk Frederik Mudge', 'address': '92, Daan Bekker Street, Windhoek, Namibia', 'dob': '23 December 1976', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa017 November 2018 to\xa016 November 2026', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Lukas Nakos', 'address': 'Not known', 'dob': '6 March 1976', 'pod': '6 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2025', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Andrew Mark Rouse', 'address': '13 Reayrt Ny Chrink, Crosby, Isle of Man IM4 2EA', 'dob': '24 Jan 1977', 'pod': '5 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa028 Feb 2018 To\xa028 Feb 2023', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}]

Combining Pandas DataFrames With Multiple Reference Columns

I'm trying to combine two pandas DataFrames to update the first one based on criteria from the second. Here is a sample of the two dataframes:
df1
year
2016 CALIFORNIA CLINTON, HILLARY
2016 CALIFORNIA TRUMP, DONALD J.
2016 CALIFORNIA JOHNSON, GARY
2016 CALIFORNIA STEIN, JILL
2016 CALIFORNIA WRITE-IN
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA
2016 TEXAS TRUMP, DONALD J.
2016 TEXAS CLINTON, HILLARY
2016 TEXAS JOHNSON, GARY
2016 TEXAS STEIN, JILL
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W.
1988 CALIFORNIA DUKAKIS, MICHAEL
1988 CALIFORNIA PAUL, RONALD ""RON""
1988 CALIFORNIA FULANI, LENORA
1988 TEXAS BUSH, GEORGE H.W.
1988 TEXAS DUKAKIS, MICHAEL
1988 TEXAS PAUL, RONALD ""RON""
1988 TEXAS FULANI, LENORA
df2
year
1988 CALIFORNIA 47
1988 TEXAS 29
...
2016 CALIFORNIA 55
2016 TEXAS 38
There are values for every election year from 2020 to 1972 that includes all candidates and all states in a similar format. There are other columns in df1 but they aren't relevant to what I'm trying to do.
My expected result is:
year
2016 CALIFORNIA CLINTON, HILLARY 55
2016 CALIFORNIA TRUMP, DONALD J. 55
2016 CALIFORNIA JOHNSON, GARY 55
2016 CALIFORNIA STEIN, JILL 55
2016 CALIFORNIA WRITE-IN 55
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
2016 TEXAS TRUMP, DONALD J. 38
2016 TEXAS CLINTON, HILLARY 38
2016 TEXAS JOHNSON, GARY 38
2016 TEXAS STEIN, JILL 38
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W. 47
1988 CALIFORNIA DUKAKIS, MICHAEL 47
1988 CALIFORNIA PAUL, RONALD ""RON"" 47
1988 CALIFORNIA FULANI, LENORA 47
1988 TEXAS BUSH, GEORGE H.W. 29
1988 TEXAS DUKAKIS, MICHAEL 29
1988 TEXAS PAUL, RONALD ""RON"" 29
1988 TEXAS FULANI, LENORA 29
I want to match up the electoral_votes column in df2 with the year and state columns in df1 so it puts the correct value. I got some assistance and was able to match it up when there is only one column being matched (you can see the question and answer here) but I am having trouble matching it up with the two points of reference (year and state). If I use the code linked as is it returns the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have tried apply, map, applymap, merge, etc and haven't been able to figure it out. Thanks in advance for the help!
I believe what you are looking for is left_merge. You should specify the common columns within on=[....], that the merge should be based on.
# Imports
import pandas as pd
# Specify two columns in the "on".
pd.merge(df1,
df2,
how='left',
on=['year','state'])
Out[1821]:
year state candidate votes
0 2016 CALIFORNIA CLINTON, HILLARY 55
1 2016 CALIFORNIA TRUMP, DONALD J. 55
2 2016 CALIFORNIA JOHNSON, GARY 55
3 2016 CALIFORNIA STEIN, JILL 55
4 2016 CALIFORNIA WRITE-IN 55
5 2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
6 2016 TEXAS TRUMP, DONALD J. 38
7 2016 TEXAS CLINTON, HILLARY 38
8 2016 TEXAS JOHNSON, GARY 38
9 2016 TEXAS STEIN, JILL 38
10 1988 CALIFORNIA BUSH, GEORGE H.W. 47
11 1988 CALIFORNIA DUKAKIS, MICHAEL 47
12 1988 CALIFORNIA PAUL, RONALD ""RON"" 47
13 1988 CALIFORNIA FULANI, LENORA 47
14 1988 TEXAS BUSH, GEORGE H.W. 29
15 1988 TEXAS DUKAKIS, MICHAEL 29
16 1988 TEXAS PAUL, RONALD ""RON"" 29
17 1988 TEXAS FULANI, LENORA 29
The above code could be written as:
pd.merge(df1,
df2,
how='left',
left_on=['year','state'],
right_on=['year','state'])
but since the columns are the same in the 2 dfs, we can use on = ['year', 'state']
An alternate way to write -
merged_df = df1.merge(df2, on=['year', 'state'], how='left')
If you want to use only 3 columns from df1 -
df1 = pd.read_csv('<name_of_the_CSV_file>', usecols=['year', 'state', 'candidate'])

Extract Date, Append by Number of Games

I am currently web scraping the college football schedule by week.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
I would like a date column. I am having difficulty extracting the date and duplicate the date corresponding to number of games for that date and append to a python list.
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
Dates are like headers for each table. I would like each date corresponding to the correct game. And also include "postponed' for the postponed games.
My plan is to automate this code for each week.
Thanks ahead.
*Post Answer
Beautiful and well done. How would I pull venues especially with postponed, using your code?
My original code was:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
You can use .find_previous() to find date for current tow:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
And saves data.csv (screenshot from LibreOffice):
EDIT: To pare "Venue" column, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020

Pandas - read a text file

I have a text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
I'm trying to read it into a pandas data frame like this:
df = pd.read_table('assist1.txt',
sep='\s+',
skiprows=6,
header=0,)
This code throws an exception - pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 31, saw 8.
I guess that's because of the space between the first and last name of the player (should be the value of the Player column).
Is there a way to achieve this?
Furthermore, it is a part of a larger text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Table
================================================================================================
Pos Team Pld Won Drn Lst For Ag Won Drn Lst For Ag Pts
--------------------------------------------------------------------------------------------------
1st C Man Utd 38 15 4 0 41 4 10 4 5 34 20 83
--------------------------------------------------------------------------------------------------
2nd Arsenal 38 15 2 2 38 9 11 3 5 28 14 83
3rd Leeds 38 15 4 0 33 8 9 4 6 36 37 80
4th Liverpool 38 13 4 2 25 7 9 2 8 26 24 72
5th Chelsea 38 16 1 2 44 18 4 5 10 24 33 66
6th Newcastle 38 11 5 3 40 23 7 3 9 25 33 62
7th Blackburn 38 11 3 5 36 24 5 5 9 23 30 56
8th Middlesbrough 38 9 7 3 31 19 5 6 8 20 29 55
9th Sunderland 38 8 5 6 31 30 8 2 9 22 25 55
10th West Ham 38 11 3 5 31 17 3 7 9 14 29 52
11th Tottenham 38 10 3 6 35 26 4 5 10 23 35 50
12th Leicester 38 7 5 7 23 20 6 4 9 26 28 48
13th Fulham 38 7 5 7 39 35 5 7 7 33 44 48
14th Ipswich 38 9 4 6 23 22 3 3 13 14 34 43
15th Charlton 38 5 5 9 18 26 5 4 10 16 30 39
16th Everton 38 8 4 7 30 28 1 5 13 11 36 36
17th Aston Villa 38 2 8 9 19 28 5 6 8 21 26 35
--------------------------------------------------------------------------------------------------
18th R Derby 38 6 4 9 25 28 3 3 13 14 39 34
19th R Southampton 38 5 7 7 34 34 1 4 14 12 35 29
20th R Bolton 38 6 3 10 25 31 1 4 14 15 40 28
================================================================================================
2001/2 Goals
================================================================================================
Pos Player Club Apps Gls
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 25
2nd Alan Shearer Newcastle 36 25
3rd Ruud van Nistelrooy Man Utd 26 23
4th Steve Marlet Fulham 38 20
5th Jimmy Floyd Hasselbaink Chelsea 30 (1) 20
6th Les Ferdinand Sunderland 27 (2) 17
7th Kevin Phillips Sunderland 36 17
8th Frédéric Kanouté West Ham 32 (3) 14
9th Marcus Bent Blackburn 28 (4) 13
10th Alen Boksic Middlesbrough 36 13
11th Eidur Gudjohnsen Chelsea 28 (3) 13
12th Luis Boa Morte Fulham 36 13
13th Michael Owen Liverpool 32 (1) 12
14th Dwight Yorke Man Utd 29 (1) 11
15th Henrik Pedersen Bolton 36 11
16th Juan Pablo Angel Aston Villa 34 (2) 11
17th Juan Sebastián Verón Man Utd 29 (2) 11
18th Shaun Bartlett Charlton 35 10
19th Matt Jansen Blackburn 28 (5) 10
20th Duncan Ferguson Everton 28 (5) 10
21st Ian Harte Leeds 37 10
22nd Bosko Balaban Aston Villa 36 10
23rd Robbie Fowler Liverpool 25 (3) 10
24th Georgi Kinkladze Derby 36 (1) 10
25th Hamilton Ricard Middlesbrough 28 (2) 10
26th Robert Pires Arsenal 24 (3) 9
27th Andrew Cole Man Utd 15 (5) 9
28th Rod Wallace Bolton 31 9
29th James Beattie Southampton 28 (1) 9
30th Robbie Keane Leeds 28 (8) 9
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
================================================================================================
2001/2 Average Rating
================================================================================================
Pos Player Club Apps Av R
-------------------------------------------------------------------------
1st Ruud van Nistelrooy Man Utd 26 8.54
2nd Thierry Henry Arsenal 34 8.09
3rd Alan Shearer Newcastle 36 7.97
4th Kieron Dyer Newcastle 33 7.94
5th Steve Marlet Fulham 38 7.89
6th Ian Harte Leeds 37 7.86
7th Andrew Cole Man Utd 15 (5) 7.85
8th Roy Keane Man Utd 19 7.84
9th Les Ferdinand Sunderland 27 (2) 7.83
10th Juan Sebastián Verón Man Utd 29 (2) 7.81
11th Eidur Gudjohnsen Chelsea 28 (3) 7.77
12th Jesper Grønkjær Chelsea 34 7.76
13th Michaël Silvestre Man Utd 32 7.72
14th Dean Gordon Middlesbrough 30 (1) 7.71
15th Michael Owen Liverpool 32 (1) 7.70
16th Patrick Vieira Arsenal 29 7.69
17th Robert Pires Arsenal 24 (3) 7.67
18th Ryan Giggs Man Utd 32 7.66
19th Dwight Yorke Man Utd 29 (1) 7.63
20th Mario Stanic Chelsea 29 (3) 7.63
21st Frédéric Kanouté West Ham 32 (3) 7.57
22nd Mark Viduka Leeds 21 7.57
23rd David Beckham Man Utd 29 7.55
24th Jimmy Floyd Hasselbaink Chelsea 30 (1) 7.55
25th Martin Taylor Blackburn 14 (8) 7.55
26th Titus Bramble Ipswich 33 7.55
27th Sol Campbell Arsenal 20 (1) 7.52
28th Mario Melchiot Chelsea 19 (2) 7.52
29th Stephane Henchoz Liverpool 29 7.52
30th Rio Ferdinand Leeds 36 (1) 7.51
================================================================================================
2001/2 Man of Match
================================================================================================
Pos Player Club Apps MoM
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 8
2nd Ruud van Nistelrooy Man Utd 26 8
3rd Kieron Dyer Newcastle 33 6
4th Les Ferdinand Sunderland 27 (2) 6
5th Steve Marlet Fulham 38 6
6th Eidur Gudjohnsen Chelsea 28 (3) 6
7th Ian Harte Leeds 37 5
8th Richie Wellens Leicester 20 (9) 5
9th Henrik Pedersen Bolton 36 5
10th Alan Shearer Newcastle 36 5
11th Michael Owen Liverpool 32 (1) 4
12th Dean Gordon Middlesbrough 30 (1) 4
13th Matt Jansen Blackburn 28 (5) 4
14th Marcus Bent Blackburn 28 (4) 4
15th Kevin Campbell Everton 27 (4) 4
16th Titus Bramble Ipswich 33 4
17th Roy Keane Man Utd 19 4
18th Frédéric Kanouté West Ham 32 (3) 4
19th Patrick Vieira Arsenal 29 4
20th Hermann Hreidarsson Ipswich 34 4
21st Dennis Bergkamp Arsenal 22 (9) 4
22nd Jimmy Floyd Hasselbaink Chelsea 30 (1) 4
23rd Claus Lundekvam Southampton 27 (2) 4
24th Robert Pires Arsenal 24 (3) 3
25th Shaun Bartlett Charlton 35 3
26th Kevin Phillips Sunderland 36 3
27th Lucas Radebe Leeds 31 (1) 3
28th Ragnvald Soma West Ham 27 (3) 3
29th Dean Richards Tottenham 34 3
30th Wayne Quinn Liverpool 25 (4) 3
Ideally I would like to run a function that creates a data frame out of each table above, but can't figure it out.
Thanks
Thanks
another way you can specify the seperator as more than one space, and skiprows as a list of rows. I tried this and it gave me your expected output. You can write simple script to find which lines to be skipped and which to be considered.
df = pd.read_table('assist1.txt', sep='\s\s+', skiprows=[0,1,2,3,4,5,6,7,8,10], header=0,engine='python')
You're using whitespace as a delimiter, but this is fixed-length delimited, not whitespace delimited. You should google fixed-length parsing, e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html.

Reformatting scraped selenium table

I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:
from selenium import webdriver
import re
import pandas as pd
driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []
for i in infotable:
ilist.append(i.text)
infolist = ilist[0]
for i in matches:
match.append(i.text)
driver.close()
home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])
df = pd.DataFrame({'home' : home, 'away' : away})
date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)
In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.
My thinking is: for child/element "under the date", date = last_found_date.
Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).
Should I be incorporating another program/method to retain order of tags/elements of the table?
You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:
from selenium import webdriver
import pandas as pd
driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(#class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
df = pd.DataFrame(data)
print(df)
Prints:
away date event home
0 Washington Capitals 25 Apr 2015 Play Offs New York Islanders
1 Minnesota Wild 25 Apr 2015 Play Offs St.Louis Blues
2 Ottawa Senators 25 Apr 2015 Play Offs Montreal Canadiens
3 Pittsburgh Penguins 25 Apr 2015 Play Offs New York Rangers
4 Calgary Flames 24 Apr 2015 Play Offs Vancouver Canucks
5 Chicago Blackhawks 24 Apr 2015 Play Offs Nashville Predators
6 Tampa Bay Lightning 24 Apr 2015 Play Offs Detroit Red Wings
7 New York Islanders 24 Apr 2015 Play Offs Washington Capitals
8 St.Louis Blues 23 Apr 2015 Play Offs Minnesota Wild
9 Anaheim Ducks 23 Apr 2015 Play Offs Winnipeg Jets
10 Montreal Canadiens 23 Apr 2015 Play Offs Ottawa Senators
11 New York Rangers 23 Apr 2015 Play Offs Pittsburgh Penguins
12 Vancouver Canucks 22 Apr 2015 Play Offs Calgary Flames
13 Nashville Predators 22 Apr 2015 Play Offs Chicago Blackhawks
14 Washington Capitals 22 Apr 2015 Play Offs New York Islanders
15 Tampa Bay Lightning 22 Apr 2015 Play Offs Detroit Red Wings
16 Anaheim Ducks 21 Apr 2015 Play Offs Winnipeg Jets
17 St.Louis Blues 21 Apr 2015 Play Offs Minnesota Wild
18 New York Rangers 21 Apr 2015 Play Offs Pittsburgh Penguins
19 Vancouver Canucks 20 Apr 2015 Play Offs Calgary Flames
20 Montreal Canadiens 20 Apr 2015 Play Offs Ottawa Senators
21 Nashville Predators 19 Apr 2015 Play Offs Chicago Blackhawks
22 Washington Capitals 19 Apr 2015 Play Offs New York Islanders
23 Winnipeg Jets 19 Apr 2015 Play Offs Anaheim Ducks
24 Pittsburgh Penguins 19 Apr 2015 Play Offs New York Rangers
25 Minnesota Wild 18 Apr 2015 Play Offs St.Louis Blues
26 Detroit Red Wings 18 Apr 2015 Play Offs Tampa Bay Lightning
27 Calgary Flames 18 Apr 2015 Play Offs Vancouver Canucks
28 Chicago Blackhawks 18 Apr 2015 Play Offs Nashville Predators
29 Ottawa Senators 18 Apr 2015 Play Offs Montreal Canadiens
30 New York Islanders 18 Apr 2015 Play Offs Washington Capitals
31 Winnipeg Jets 17 Apr 2015 Play Offs Anaheim Ducks
32 Minnesota Wild 17 Apr 2015 Play Offs St.Louis Blues
33 Detroit Red Wings 17 Apr 2015 Play Offs Tampa Bay Lightning
34 Pittsburgh Penguins 17 Apr 2015 Play Offs New York Rangers
35 Calgary Flames 16 Apr 2015 Play Offs Vancouver Canucks
36 Chicago Blackhawks 16 Apr 2015 Play Offs Nashville Predators
37 Ottawa Senators 16 Apr 2015 Play Offs Montreal Canadiens
38 New York Islanders 16 Apr 2015 Play Offs Washington Capitals
39 Edmonton Oilers 12 Apr 2015 Not specified Vancouver Canucks
40 Anaheim Ducks 12 Apr 2015 Not specified Arizona Coyotes
41 Chicago Blackhawks 12 Apr 2015 Not specified Colorado Avalanche
42 Nashville Predators 12 Apr 2015 Not specified Dallas Stars
43 Boston Bruins 12 Apr 2015 Not specified Tampa Bay Lightning
44 Pittsburgh Penguins 12 Apr 2015 Not specified Buffalo Sabres
45 Detroit Red Wings 12 Apr 2015 Not specified Carolina Hurricanes
46 New Jersey Devils 12 Apr 2015 Not specified Florida Panthers
47 Columbus Blue Jackets 12 Apr 2015 Not specified New York Islanders
48 Montreal Canadiens 12 Apr 2015 Not specified Toronto Maple Leafs
49 Calgary Flames 11 Apr 2015 Not specified Winnipeg Jets

Categories

Resources