Getting an error while Scraping the dates

Getting an error while Scraping the dates - python

I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)

The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])

As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...

Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent

Related

Web Scraping ESPN NFL webpage with Python

I am trying to perform web scraping using Python on the ESPN website to extract historical NFL football game results scores only into a csv file. I’m unable to find a way to add the dates as displayed in the desired output. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:
NFL Website:
https://www.espn.com/nfl/scoreboard/_/week/17/year/2022/seasontype/2
Current Output:
Week #, Away Team, Away Score, Home Team, Home Score
Week 17, Cowboys, 27, Titans, 13
Week 17, Cardinals, 19, Falcons, 20
Week 17, Bears, 10, Lions, 41
Desired Game Results Output:
Week #, Date, Away Team, Away Score, Home Team, Home Score
Week 17, 12/29/2022, Cowboys, 27, Titans, 13
Week 17, 1/1/2023, Cardinals, 19, Falcons, 20
Week 17, 1/1/2023, Bears, 10, Lions, 41
Code:
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
daterange = 1
url_list = []
while daterange < 19:
url = "https://www.espn.com/nfl/scoreboard/_/week/"+str(daterange)+"/year/2022/seasontype/2"
url_list.append(url)
daterange = daterange + 1
j = 1
away_team = []
home_team = []
away_team_score = []
home_team_score = []
week = []
for url in url_list:
response = urlopen(url)
urlname = requests.get(url)
bs = bs4.BeautifulSoup(urlname.text,'lxml')
print(response.url)
i = 0
while True:
try:
name = bs.findAll('div',{'class':'ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db'})[i]
except Exception:
break
name = name.get_text()
try:
score = bs.findAll('div',{'class':'ScoreCell__Score h4 clr-gray-01 fw-heavy tar ScoreCell_Score--scoreboard pl2'})[i]
except Exception:
break
score = score.get_text()
if i%2 == 0:
away_team.append(name)
away_team_score.append(score)
else:
home_team.append(name)
home_team_score.append(score)
week.append("week "+str(j))
i = i + 1
j = j + 1
web_scraping = list (zip(week, home_team, home_team_score, away_team, away_team_score))
web_scraping_df = pd.DataFrame(web_scraping, columns = ['week','home_team','home_team_score','away_team','away_team_score'])
web_scraping_df

Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
week = 17
url = f'https://www.espn.com/nfl/scoreboard/_/week/{week}/year/2022/seasontype/2'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for board in soup.select('.ScoreboardScoreCell'):
title = board.find_previous(class_='Card__Header__Title').text
teams = [t.text for t in board.select('.ScoreCell__TeamName')]
scores = [s.text for s in board.select('.ScoreCell__Score')] or ['-', '-']
all_data.append((week, title, teams[0], scores[0], teams[1], scores[1]))
df = pd.DataFrame(all_data, columns=['Week', 'Date', 'Team 1', 'Score 1', 'Team 2', 'Score 2'])
print(df.to_markdown(index=False))
Prints:
Week
Date
Team 1
Score 1
Team 2
Score 2
17
Thursday, December 29, 2022
Cowboys
27
Titans
13
17
Sunday, January 1, 2023
Cardinals
19
Falcons
20
17
Sunday, January 1, 2023
Bears
10
Lions
41
17
Sunday, January 1, 2023
Broncos
24
Chiefs
27
17
Sunday, January 1, 2023
Dolphins
21
Patriots
23
17
Sunday, January 1, 2023
Colts
10
Giants
38
17
Sunday, January 1, 2023
Saints
20
Eagles
10
17
Sunday, January 1, 2023
Panthers
24
Buccaneers
30
17
Sunday, January 1, 2023
Browns
24
Commanders
10
17
Sunday, January 1, 2023
Jaguars
31
Texans
3
17
Sunday, January 1, 2023
49ers
37
Raiders
34
17
Sunday, January 1, 2023
Jets
6
Seahawks
23
17
Sunday, January 1, 2023
Vikings
17
Packers
41
17
Sunday, January 1, 2023
Rams
10
Chargers
31
17
Sunday, January 1, 2023
Steelers
16
Ravens
13
17
Monday, January 2, 2023
Bills
-
Bengals
-

Python: Getting a table in CSV from a website without a table class

I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.

The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093

Delete first of two headers - Python

i have a dataframe that looks like this:
SP_removed
Date Removed
Date Ticker Security
0 January 7, 2021 TIF Tiffany & Co.
1 December 21, 2020 AIV Apartment Investment & Management
2 October 12, 2020 NBL Noble Energy
3 October 9, 2020 NaN NaN
4 October 7, 2020 ETFC E*TRADE Financial
... ... ... ...
258 December 5, 2000 OI Owens-Illinois
259 December 5, 2000 GRA W.R. Grace
260 December 5, 2000 CCK Crown Holdings
261 July 27, 2000 RAD RiteAid
262 December 7, 1999 LDW Laidlaw
263 rows × 3 columns
I want to delete the first header (Date and Removed), but everything i've tried so far didn't work.
Thanks!

Try to use droplevel() to get rid of the first row column names:
df.columns = df.columns.droplevel()
Example
import pandas as pd
import numpy as np
header = pd.MultiIndex.from_product([['location1','location2'],
['S1','S2','S3']],
names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6),
index=['a','b','c','d','e'],
columns=header)
df.columns = df.columns.droplevel()
df

Python Selenium Text Convert into Data Frame

I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()

If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015

Extract instances of a patterned text sequence from a very long string using Python

I am working with this PDF document of about 80 pages. It lists all 1,984 US senators from US history in chronological order. I have extracted the text of the document using PyPDF2. The text is now assigned to a variable as a single, long string. Here is a segment:
Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)
Notice that the name, party affiliation, state, end of service date, and rank of each senator normally appear in a patterned segment. Here are some examples:
Rodney, Daniel (Adams-DE) January 12, 1827 282
Bateman, Ephraim (Adams-NJ) January 12, 1829 283
Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293
But there are also some exceptions, such as these:
Smith, William (R-SC) March 3, 1831 (First served 1816-1823)
Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827)
In these cases the rank is given when the senator is first listed.
My question is, how can I extract the basic information on each senator (name, party, state, end of service, rank)? I believe I need to loop through the string, finding all instances of a regular expression that captures the patterns, and assign each instance to a list within a list. The end result would be a list of lists that I could transform into a dataframe in pandas.

You can try the following approach::
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
Workflow:
Split the input on , surrounded by two Names:
Use the regex [a-zA-Z]+\,\s+[a-zA-Z]+
Surround the regex by parenthesis because the split key (e.g. the names) need to be kept
Apply regex using re.split
Remove first element that is empty space
Here, we have all the lines bu divided in two elements. We need to aggregate two consecutive element. The topic Create a 2D list out of 1D list answer this step.
Now content can be extracted from each row. Here, we use re.findall with regex (.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$. They are 4 groups:
Group 1 selects everything till a parenthesis: (.*?)\s+\(
Group 2 selects everything till the closing parenthesis: (.*?)\)
Group 3 selects everything till a year (e.g. 4 numbers): (.*?\d{4})
Group 4 selects everything till the end: (.*?)$
For a better understanding of regex, I advice you to see online regex such as regex101.com to visualize the results...
Create the dataframe
Next steps, apply more specific cleanings and separation on dataset such as removing comma on name with:
df["name"] = df["name"].str.replace(r'\,$','')
Code + illustration
# import module
import pandas as pd
import re
d = "Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)"
# Step 1
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
print(d)
# ['Silsbee, Nathaniel', ' (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel', ' (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim', ' (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John', ' (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William', ' (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry', ' M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac', ' D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan', ' (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel', ' A. (Adams/AJ-CT) March 3, 1833 287 ',
# 'McLane, Louis', ' (J-DE) April 16, 1829 288 ',
# 'Parris, Albion', ' K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John', ' (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel', ' (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver', ' H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob', ' (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James', ' (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles', ' E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John', ' (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon', ' (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 2
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
print(d)
# ['Silsbee, Nathaniel, (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel, (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim, (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John, (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William, (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry, M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac, D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan, (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel, A. (Adams/AJ-CT) March 3, 1833 287 ', 'McLane, Louis, (J-DE) April 16, 1829 288 ',
# 'Parris, Albion, K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John, (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel, (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver, H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob, (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James, (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles, E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John, (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon, (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 3
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
[print(_) for _ in d]
# ('Silsbee, Nathaniel,', 'Adams/AJ-MA', 'March 3, 1835', ' 281 November 8 ')
# ('Rodney, Daniel,', 'Adams-DE', 'January 12, 1827', ' 282 November 9 ')
# ('Bateman, Ephraim,', 'Adams-NJ', 'January 12, 1829', ' 283 November 27 ')
# ('McKinley, John,', 'J-AL', 'March 3, 1831', ' 284 (Served again 1837) November 29 ')
# ('Smith, William,', 'R-SC', 'March 3, 1831', ' (First served 1816-1823) * * * 1827 * * * January 12 ')
# ('Ridgely, Henry, M.', 'J-DE', 'March 3, 1829', ' 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ')
# ('Barnard, Isaac, D.', 'J-PA', 'December 6, 1831', ' 286 ')
# ('Ellis, Powhatan,', 'J-MS', 'July 16, 1832', ' (First served 1825-1826) ')
# ('Foot, Samuel, A.', 'Adams/AJ-CT', 'March 3, 1833', ' 287 ')
# ('McLane, Louis,', 'J-DE', 'April 16, 1829', ' 288 ')
# ('Parris, Albion, K.', 'J-ME', 'August 26, 1828', ' 289 ')
# ('Tyler, John,', 'J/AJ-VA', 'February 29, 1836', ' 290 December 17 ')
# ('Webster, Daniel,', 'Adams/AJ/W-MA', 'February 22, 1841', ' 291 (Served again 1845) * * * 1828 * * * November 7 ')
# ('Prince, Oliver, H.', 'J-GA', 'March 3, 1829', ' 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ')
# ('Burnet, Jacob,', 'Adams/AJ-OH', 'March 3, 1831', ' 293 December 15 ')
# ('Iredell, James,', 'J-NC', 'March 3, 1831', ' 294 * * * 1829 * * * January 15 ')
# ('Dudley, Charles, E.', 'J-NY', 'March 3, 1833', ' 295 ')
# ('Holmes, John,', 'Adams/AJ-ME', 'March 3, 1833', ' (First served 1820-1827) January 30 ')
# ('Dickerson, Mahlon,', 'R/CR/J-NJ', 'March 3, 1833', ' (First served 1817-1829)')
# Step 4
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
print(df)
# name party/state date_to_convert rank_to_clean
# 0 Silsbee, Nathaniel Adams/AJ-MA March 3, 1835 281 November 8
# 1 Rodney, Daniel Adams-DE January 12, 1827 282 November 9
# 2 Bateman, Ephraim Adams-NJ January 12, 1829 283 November 27
# 3 McKinley, John J-AL March 3, 1831 284 (Served again 1837) November 29
# 4 Smith, William R-SC March 3, 1831 (First served 1816-1823) * * * 1827 * * * ...
# 5 Ridgely, Henry, M. J-DE March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO M...
# 6 Barnard, Isaac, D. J-PA December 6, 1831 286
# 7 Ellis, Powhatan J-MS July 16, 1832 (First served 1825-1826)
# 8 Foot, Samuel, A. Adams/AJ-CT March 3, 1833 287
# 9 McLane, Louis J-DE April 16, 1829 288
# 10 Parris, Albion, K. J-ME August 26, 1828 289
# 11 Tyler, John J/AJ-VA February 29, 1836 290 December 17
# 12 Webster, Daniel Adams/AJ/W-MA February 22, 1841 291 (Served again 1845) * * * 1828 * * * ...
# 13 Prince, Oliver, H. J-GA March 3, 1829 292 Start of Initial Senate Service Name/...
# 14 Burnet, Jacob Adams/AJ-OH March 3, 1831 293 December 15
# 15 Iredell, James J-NC March 3, 1831 294 * * * 1829 * * * January 15
# 16 Dudley, Charles, E. J-NY March 3, 1833 295
# 17 Holmes, John Adams/AJ-ME March 3, 1833 (First served 1820-1827) January 30
# 18 Dickerson, Mahlon R/CR/J-NJ March 3, 1833 (First served 1817-1829)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting an error while Scraping the dates - python

The find_all function can return an empty list, which can lead you to getting an error. You can simple check this: all_dates = [] for container in containers: date_container = container.find_all('span', attrs={'class': 'date'}) all_dates.extend([date.text for date in date_container])

Related

Web Scraping ESPN NFL webpage with Python

Python: Getting a table in CSV from a website without a table class

Delete first of two headers - Python

Python Selenium Text Convert into Data Frame

Extract instances of a patterned text sequence from a very long string using Python

Categories

Resources