I am trying to perform web scraping using Python on the ESPN website to extract historical NFL football game results scores only into a csv file. I’m unable to find a way to add the dates as displayed in the desired output. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:
NFL Website:
https://www.espn.com/nfl/scoreboard/_/week/17/year/2022/seasontype/2
Current Output:
Week #, Away Team, Away Score, Home Team, Home Score
Week 17, Cowboys, 27, Titans, 13
Week 17, Cardinals, 19, Falcons, 20
Week 17, Bears, 10, Lions, 41
Desired Game Results Output:
Week #, Date, Away Team, Away Score, Home Team, Home Score
Week 17, 12/29/2022, Cowboys, 27, Titans, 13
Week 17, 1/1/2023, Cardinals, 19, Falcons, 20
Week 17, 1/1/2023, Bears, 10, Lions, 41
Code:
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
daterange = 1
url_list = []
while daterange < 19:
url = "https://www.espn.com/nfl/scoreboard/_/week/"+str(daterange)+"/year/2022/seasontype/2"
url_list.append(url)
daterange = daterange + 1
j = 1
away_team = []
home_team = []
away_team_score = []
home_team_score = []
week = []
for url in url_list:
response = urlopen(url)
urlname = requests.get(url)
bs = bs4.BeautifulSoup(urlname.text,'lxml')
print(response.url)
i = 0
while True:
try:
name = bs.findAll('div',{'class':'ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db'})[i]
except Exception:
break
name = name.get_text()
try:
score = bs.findAll('div',{'class':'ScoreCell__Score h4 clr-gray-01 fw-heavy tar ScoreCell_Score--scoreboard pl2'})[i]
except Exception:
break
score = score.get_text()
if i%2 == 0:
away_team.append(name)
away_team_score.append(score)
else:
home_team.append(name)
home_team_score.append(score)
week.append("week "+str(j))
i = i + 1
j = j + 1
web_scraping = list (zip(week, home_team, home_team_score, away_team, away_team_score))
web_scraping_df = pd.DataFrame(web_scraping, columns = ['week','home_team','home_team_score','away_team','away_team_score'])
web_scraping_df
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
week = 17
url = f'https://www.espn.com/nfl/scoreboard/_/week/{week}/year/2022/seasontype/2'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for board in soup.select('.ScoreboardScoreCell'):
title = board.find_previous(class_='Card__Header__Title').text
teams = [t.text for t in board.select('.ScoreCell__TeamName')]
scores = [s.text for s in board.select('.ScoreCell__Score')] or ['-', '-']
all_data.append((week, title, teams[0], scores[0], teams[1], scores[1]))
df = pd.DataFrame(all_data, columns=['Week', 'Date', 'Team 1', 'Score 1', 'Team 2', 'Score 2'])
print(df.to_markdown(index=False))
Prints:
Week
Date
Team 1
Score 1
Team 2
Score 2
17
Thursday, December 29, 2022
Cowboys
27
Titans
13
17
Sunday, January 1, 2023
Cardinals
19
Falcons
20
17
Sunday, January 1, 2023
Bears
10
Lions
41
17
Sunday, January 1, 2023
Broncos
24
Chiefs
27
17
Sunday, January 1, 2023
Dolphins
21
Patriots
23
17
Sunday, January 1, 2023
Colts
10
Giants
38
17
Sunday, January 1, 2023
Saints
20
Eagles
10
17
Sunday, January 1, 2023
Panthers
24
Buccaneers
30
17
Sunday, January 1, 2023
Browns
24
Commanders
10
17
Sunday, January 1, 2023
Jaguars
31
Texans
3
17
Sunday, January 1, 2023
49ers
37
Raiders
34
17
Sunday, January 1, 2023
Jets
6
Seahawks
23
17
Sunday, January 1, 2023
Vikings
17
Packers
41
17
Sunday, January 1, 2023
Rams
10
Chargers
31
17
Sunday, January 1, 2023
Steelers
16
Ravens
13
17
Monday, January 2, 2023
Bills
-
Bengals
-
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015
I am working with this PDF document of about 80 pages. It lists all 1,984 US senators from US history in chronological order. I have extracted the text of the document using PyPDF2. The text is now assigned to a variable as a single, long string. Here is a segment:
Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)
Notice that the name, party affiliation, state, end of service date, and rank of each senator normally appear in a patterned segment. Here are some examples:
Rodney, Daniel (Adams-DE) January 12, 1827 282
Bateman, Ephraim (Adams-NJ) January 12, 1829 283
Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293
But there are also some exceptions, such as these:
Smith, William (R-SC) March 3, 1831 (First served 1816-1823)
Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827)
In these cases the rank is given when the senator is first listed.
My question is, how can I extract the basic information on each senator (name, party, state, end of service, rank)? I believe I need to loop through the string, finding all instances of a regular expression that captures the patterns, and assign each instance to a list within a list. The end result would be a list of lists that I could transform into a dataframe in pandas.
You can try the following approach::
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
Workflow:
Split the input on , surrounded by two Names:
Use the regex [a-zA-Z]+\,\s+[a-zA-Z]+
Surround the regex by parenthesis because the split key (e.g. the names) need to be kept
Apply regex using re.split
Remove first element that is empty space
Here, we have all the lines bu divided in two elements. We need to aggregate two consecutive element. The topic Create a 2D list out of 1D list answer this step.
Now content can be extracted from each row. Here, we use re.findall with regex (.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$. They are 4 groups:
Group 1 selects everything till a parenthesis: (.*?)\s+\(
Group 2 selects everything till the closing parenthesis: (.*?)\)
Group 3 selects everything till a year (e.g. 4 numbers): (.*?\d{4})
Group 4 selects everything till the end: (.*?)$
For a better understanding of regex, I advice you to see online regex such as regex101.com to visualize the results...
Create the dataframe
Next steps, apply more specific cleanings and separation on dataset such as removing comma on name with:
df["name"] = df["name"].str.replace(r'\,$','')
Code + illustration
# import module
import pandas as pd
import re
d = "Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)"
# Step 1
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
print(d)
# ['Silsbee, Nathaniel', ' (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel', ' (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim', ' (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John', ' (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William', ' (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry', ' M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac', ' D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan', ' (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel', ' A. (Adams/AJ-CT) March 3, 1833 287 ',
# 'McLane, Louis', ' (J-DE) April 16, 1829 288 ',
# 'Parris, Albion', ' K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John', ' (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel', ' (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver', ' H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob', ' (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James', ' (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles', ' E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John', ' (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon', ' (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 2
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
print(d)
# ['Silsbee, Nathaniel, (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel, (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim, (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John, (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William, (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry, M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac, D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan, (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel, A. (Adams/AJ-CT) March 3, 1833 287 ', 'McLane, Louis, (J-DE) April 16, 1829 288 ',
# 'Parris, Albion, K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John, (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel, (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver, H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob, (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James, (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles, E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John, (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon, (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 3
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
[print(_) for _ in d]
# ('Silsbee, Nathaniel,', 'Adams/AJ-MA', 'March 3, 1835', ' 281 November 8 ')
# ('Rodney, Daniel,', 'Adams-DE', 'January 12, 1827', ' 282 November 9 ')
# ('Bateman, Ephraim,', 'Adams-NJ', 'January 12, 1829', ' 283 November 27 ')
# ('McKinley, John,', 'J-AL', 'March 3, 1831', ' 284 (Served again 1837) November 29 ')
# ('Smith, William,', 'R-SC', 'March 3, 1831', ' (First served 1816-1823) * * * 1827 * * * January 12 ')
# ('Ridgely, Henry, M.', 'J-DE', 'March 3, 1829', ' 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ')
# ('Barnard, Isaac, D.', 'J-PA', 'December 6, 1831', ' 286 ')
# ('Ellis, Powhatan,', 'J-MS', 'July 16, 1832', ' (First served 1825-1826) ')
# ('Foot, Samuel, A.', 'Adams/AJ-CT', 'March 3, 1833', ' 287 ')
# ('McLane, Louis,', 'J-DE', 'April 16, 1829', ' 288 ')
# ('Parris, Albion, K.', 'J-ME', 'August 26, 1828', ' 289 ')
# ('Tyler, John,', 'J/AJ-VA', 'February 29, 1836', ' 290 December 17 ')
# ('Webster, Daniel,', 'Adams/AJ/W-MA', 'February 22, 1841', ' 291 (Served again 1845) * * * 1828 * * * November 7 ')
# ('Prince, Oliver, H.', 'J-GA', 'March 3, 1829', ' 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ')
# ('Burnet, Jacob,', 'Adams/AJ-OH', 'March 3, 1831', ' 293 December 15 ')
# ('Iredell, James,', 'J-NC', 'March 3, 1831', ' 294 * * * 1829 * * * January 15 ')
# ('Dudley, Charles, E.', 'J-NY', 'March 3, 1833', ' 295 ')
# ('Holmes, John,', 'Adams/AJ-ME', 'March 3, 1833', ' (First served 1820-1827) January 30 ')
# ('Dickerson, Mahlon,', 'R/CR/J-NJ', 'March 3, 1833', ' (First served 1817-1829)')
# Step 4
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
print(df)
# name party/state date_to_convert rank_to_clean
# 0 Silsbee, Nathaniel Adams/AJ-MA March 3, 1835 281 November 8
# 1 Rodney, Daniel Adams-DE January 12, 1827 282 November 9
# 2 Bateman, Ephraim Adams-NJ January 12, 1829 283 November 27
# 3 McKinley, John J-AL March 3, 1831 284 (Served again 1837) November 29
# 4 Smith, William R-SC March 3, 1831 (First served 1816-1823) * * * 1827 * * * ...
# 5 Ridgely, Henry, M. J-DE March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO M...
# 6 Barnard, Isaac, D. J-PA December 6, 1831 286
# 7 Ellis, Powhatan J-MS July 16, 1832 (First served 1825-1826)
# 8 Foot, Samuel, A. Adams/AJ-CT March 3, 1833 287
# 9 McLane, Louis J-DE April 16, 1829 288
# 10 Parris, Albion, K. J-ME August 26, 1828 289
# 11 Tyler, John J/AJ-VA February 29, 1836 290 December 17
# 12 Webster, Daniel Adams/AJ/W-MA February 22, 1841 291 (Served again 1845) * * * 1828 * * * ...
# 13 Prince, Oliver, H. J-GA March 3, 1829 292 Start of Initial Senate Service Name/...
# 14 Burnet, Jacob Adams/AJ-OH March 3, 1831 293 December 15
# 15 Iredell, James J-NC March 3, 1831 294 * * * 1829 * * * January 15
# 16 Dudley, Charles, E. J-NY March 3, 1833 295
# 17 Holmes, John Adams/AJ-ME March 3, 1833 (First served 1820-1827) January 30
# 18 Dickerson, Mahlon R/CR/J-NJ March 3, 1833 (First served 1817-1829)