I am currently web scraping the college football schedule by week.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
I would like a date column. I am having difficulty extracting the date and duplicate the date corresponding to number of games for that date and append to a python list.
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
Dates are like headers for each table. I would like each date corresponding to the correct game. And also include "postponed' for the postponed games.
My plan is to automate this code for each week.
Thanks ahead.
*Post Answer
Beautiful and well done. How would I pull venues especially with postponed, using your code?
My original code was:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
You can use .find_previous() to find date for current tow:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
And saves data.csv (screenshot from LibreOffice):
EDIT: To pare "Venue" column, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 months ago.
Improve this question
it will be great help if i will be able to know how to resolve the issues to get the proper output in json file
r = requests.get('https://www.iomfsa.im/enforcement/disqualified-directors/')
soup = BeautifulSoup(r.content, 'html.parser')
paragraphs=[]
length=soup.findAll("strong")
for leng in length:
paragraphs.append(leng.next_sibling)
paragraph = [i for i in paragraphs if i is not None]
print(paragraph)
list=['name','address','DOB','POD','DOD','Particulars of Disqualification Order or Undertaking']
expected output
You can do something like below:
from bs4 import BeautifulSoup
import requests
import json
import re
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
url = 'https://www.iomfsa.im/enforcement/disqualified-directors/'
df_list = []
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'xml')
# print(soup)
dismissed_dirs = soup.select('section.accordion-item')
for d in dismissed_dirs:
# print(d)
name = d.find('strong', text=re.compile('^Name:')).next_sibling
address = d.find('strong', text=re.compile('^Address')).next_sibling
dob = d.find('strong', text=re.compile('^Date of Birth:')).next_sibling
pod = d.find('strong', text=re.compile('^Period of Disqualification:')).next_sibling
dod = d.find('strong', text=re.compile('^Dates of Disqualification:')).parent.text
particulars = d.find('strong', text=re.compile('^Particulars')).find_next('a').text
df_list.append((name, address, dob, pod, dod, particulars))
df = pd.DataFrame(df_list, columns = ['name', 'address', 'dob', 'pod', 'dod', 'particulars'])
print(df)
print('--------------')
print(df.to_dict(orient='records'))
This will return both a dataframe (which makes more sense, visually), as well as a dictionary, as you requested:
name
address
dob
pod
dod
particulars
0
John Trevor Roche Baines
c/o Isle of Man Prison, Jurby, Isle of Man
19 Dec 1939
15 Years 0 Months 0 Days
Dates of Disqualification: From 15 Jul 2010 To 15 Jul 2025
Section 2 Company Officers (Disqualification) Act 2009
1
Ralph Stephen Brunswick
Valdfrieden, Ballaugh Glen, Ballaugh, Isle of Man IM7 5JB
13 Years 6 Months 0 Days
Dates of Disqualification: From 4 Mar 2009 To 4 Sep 2022
Section 2 Company Officers (Disqualification) Act 2009
2
Fenella Jane Carter
30 North Quay, Douglas, Isle of Man IM1 4LD
6 August 1984
6 Years 0 Months 0 Days
Dates of Disqualification: From 5 July 2018 to 5 July 2024
Section 2 Company Officers (Disqualification) Act 2009
3
Richard Alan Costain
St Patrick's Close, Jurby, Isle of Man
13 Sep 1951
15 Years 0 Months 0 Days
Dates of Disqualification: From 16 Feb 2017 To 16 Feb 2032
Section 2 Company Officers (Disqualification) Act 2009
4
Paul Deighton
Isle of Man Prison, St Patrick’s Close, Coast Road IM7 3JP
23 December 1966
12 Years 0 Months 0 Days
Dates of Disqualification: From 8 August 2020 to 7 August 2032
Section 4 Company Officers (Disqualification) Act 2009
5
Jamie Alexander Irving
Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN
Not known
7 Years 0 Months 0 Days
Dates of Disqualification: From 26 Feb 2018 To 26 Feb 2025
Section 4 Company Officers (Disqualification) Act 2009
6
Jonathan Frank Edward Irving
Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN
Not known
8 Years 0 Months 0 Days
Dates of Disqualification: From 26 Feb 2018 To 26 Feb 2026
Section 4 Company Officers (Disqualification) Act 2009
7
Duncan Frank Ellis Jones
Harpers Glen, Hillberry Green, Douglas, Isle of Man IM2 6DE
10 Aug 1959
13 Years 0 Months 0 Days
Dates of Disqualification: From 25 Apr 2011 To 25 Apr 2024
Section 2 Company Officers (Disqualification) Act 2009
8
Lynn Keig
Croit-e-Quill, Lonan, Isle of Man, IM4 7JG
28 June 1956
Years 0 Months 0 Days
Dates of Disqualification: From 29 Jun 2017 To 28 Jun 2023
Section 2 Company Officers (Disqualification) Act 2009
9
Richard Ian Kissack
6 Falcon Cliff Court, Douglas, Isle of Man, IM2 4AQ (currently Mr Kissack is residing at HM IOM Prison, Jurby, Isle of Man)
30 October 1968
5 Years 11 Months 13 Days
Dates of Disqualification: From 31 December 2021 to 13 December 2027
Section 2 Company Officers (Disqualification) Act 2009
10
Alan Louis
Not known
23 November 1965
12 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2031
Section 2 Company Officers (Disqualification) Act 2009
11
Phillip Sean McCarthy
Cedar Lodge, Main Road, Crosby IM4 4BH
5 October 1979
8 Years 0 Months 0 Days
Dates of Disqualification: From 28 November 2019 to 27 November 2027
Section 2 Company Officers (Disqualification) Act 2009
12
John McCauley
Not known
30 March 1955
5 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2024
Section 2 Company Officers (Disqualification) Act 2009
13
Dirk Frederik Mudge
92, Daan Bekker Street, Windhoek, Namibia
23 December 1976
8 Years 0 Months 0 Days
Dates of Disqualification: From 17 November 2018 to 16 November 2026
Section 2 Company Officers (Disqualification) Act 2009
14
Lukas Nakos
Not known
6 March 1976
6 Years 0 Months 0 Days
Dates of Disqualification: From 29 April 2019 to 28 April 2025
Section 2 Company Officers (Disqualification) Act 2009
15
Andrew Mark Rouse
13 Reayrt Ny Chrink, Crosby, Isle of Man IM4 2EA
24 Jan 1977
5 Years 0 Months 0 Days
Dates of Disqualification: From 28 Feb 2018 To 28 Feb 2023
Section 2 Company Officers (Disqualification) Act 2009
--------------
[{'name': 'John Trevor Roche Baines', 'address': 'c/o Isle of Man Prison, Jurby, Isle of Man', 'dob': '19 Dec 1939', 'pod': '15 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 15 Jul 2010 To 15 Jul 2025', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Ralph Stephen Brunswick', 'address': 'Valdfrieden, Ballaugh Glen, Ballaugh, Isle of Man IM7 5JB', 'dob': None, 'pod': '13 Years 6 Months 0 Days', 'dod': 'Dates of Disqualification: From 4 Mar 2009 To 4 Sep 2022', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Fenella Jane Carter', 'address': '30 North Quay, Douglas, Isle of Man IM1 4LD', 'dob': '6 August 1984', 'pod': '6 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa05 July 2018 to 5 July 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Richard Alan Costain', 'address': "St Patrick's Close, Jurby, Isle of Man", 'dob': '13 Sep 1951', 'pod': '15 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 16 Feb 2017 To 16 Feb 2032', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Paul Deighton', 'address': 'Isle of Man Prison, St Patrick’s Close, Coast Road IM7 3JP', 'dob': '23 December 1966', 'pod': '12 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa08 August 2020 to 7 August 2032', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Jamie Alexander Irving', 'address': 'Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN', 'dob': 'Not known', 'pod': '7 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa026 Feb 2018 To\xa026 Feb 2025', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Jonathan Frank Edward Irving', 'address': 'Meadowcourt, The Links, Douglas Road, Peel, Isle of Man, IM5 1LN', 'dob': 'Not known', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa026 Feb 2018 To\xa026 Feb 2026', 'particulars': 'Section\xa04 Company Officers (Disqualification) Act 2009'}, {'name': 'Duncan Frank Ellis Jones', 'address': 'Harpers Glen, Hillberry Green, Douglas, Isle of Man IM2 6DE', 'dob': '10 Aug 1959', 'pod': '13 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From 25 Apr 2011 To 25 Apr 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Lynn Keig', 'address': 'Croit-e-Quill, Lonan, Isle of Man, IM4 7JG', 'dob': '28 June 1956', 'pod': ' Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 Jun 2017 To 28 Jun 2023', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Richard Ian Kissack', 'address': '6 Falcon Cliff Court, Douglas, Isle of Man, IM2 4AQ (currently Mr Kissack is residing at HM IOM Prison, Jurby, Isle of Man)', 'dob': <strong>30 October 1968</strong>, 'pod': '5 Years 11 Months 13 Days', 'dod': 'Dates of Disqualification: From\xa031 December 2021 to 13 December 2027', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Alan Louis', 'address': 'Not known', 'dob': '23 November 1965', 'pod': '12 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2031', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Phillip Sean McCarthy', 'address': 'Cedar Lodge, Main Road, Crosby IM4 4BH', 'dob': '5 October 1979', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa028 November 2019 to 27 November 2027', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' John McCauley', 'address': 'Not known', 'dob': '30 March 1955', 'pod': '5 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2024', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Dirk Frederik Mudge', 'address': '92, Daan Bekker Street, Windhoek, Namibia', 'dob': '23 December 1976', 'pod': '8 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa017 November 2018 to\xa016 November 2026', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': ' Lukas Nakos', 'address': 'Not known', 'dob': '6 March 1976', 'pod': '6 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa029 April 2019 to 28 April 2025', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}, {'name': 'Andrew Mark Rouse', 'address': '13 Reayrt Ny Chrink, Crosby, Isle of Man IM4 2EA', 'dob': '24 Jan 1977', 'pod': '5 Years 0 Months 0 Days', 'dod': 'Dates of Disqualification: From\xa028 Feb 2018 To\xa028 Feb 2023', 'particulars': 'Section 2 Company Officers (Disqualification) Act 2009'}]
This is the website
I want all the text from this website saved as a csv file with the following column names
Description | Posted Date | Expiry Date
import requests as rq
from bs4 import BeautifulSoup
url = "https://www.rfpmart.in/"
response = rq.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
div_text = soup.find('div',{'id':'home'}).text
I am a beginner a I have done this far.
Thanks in advance
Good start. now you need to iterate through the specific tags with that content within the first <div> tags you are pulling, followed by constructing the dataframe.
Code:
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.rfpmart.in/"
response = rq.get(url)
soup = BeautifulSoup(response.text,'html.parser')
endPage = int(soup.find('p',{'class':'counter'}).text.split()[-1])
desc = []
post = []
exp = []
urls = []
for page in range(1,endPage+1):
print ('Aquiring page: %s'%page)
url = "https://www.rfpmart.in/home-%s.html" %page
response = rq.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
div = soup.find('div',{'id':'home'})
items = div.find_all('li')
for each in items:
desc.append(each.find('div', {'class':'description-category'}).text.strip())
post.append(each.find('span',{'class':'post-date'}).text.strip())
exp.append(each.find('span',{'class':'expiry-date'}).text.strip())
urls.append(each.find('a')['href'])
df = pd.DataFrame(list(zip(desc, post, exp, urls)),
columns =['Description','Posted Date','Expiry Date', 'URL'])
Output:
print (df)
Description ... Expiry Date
0 EXTRA-8365-USA (Oakland, California) - Perinat... ... Wednesday, 15 July, 2020
1 EXTRA-8364-USA (Santa Cruz, California) - Prof... ... Thursday, 25 June, 2020
2 SW-33969-USA (California) - Java Software Deve... ... Wednesday, 10 June, 2020
3 SW-33968-USA (California) - Electronic Agenda ... ... Monday, 15 June, 2020
4 WRT-1475-USA (Beebe, Arkansas) - Title III Gra... ... Monday, 15 June, 2020
5 DE-3174-USA (California) - Paper to Digital Do... ... Thursday, 9 July, 2020
6 TRANSLATION-3588-USA (California) - Translatio... ... Tuesday, 9 June, 2020
7 ACCT-9172-USA (San Angelo, Texas) - Audit Serv... ... Friday, 12 June, 2020
8 ACCT-9173-USA (Texas) - Audit Services for Tax... ... Monday, 1 June, 2020
9 ESTATE-3249-USA (San Antonio, Texas) - Real Es... ... Thursday, 25 June, 2020
10 SW-33980-USA (New Jersey) - Document Managemen... ... Thursday, 28 May, 2020
11 FOOD-5243-USA (Garden City, New York) - Food S... ... Thursday, 11 June, 2020
12 EXTRA-8369-USA (Camden, New Jersey) - Port Pla... ... Friday, 19 June, 2020
13 TRANSPORT-3411- USA (Aztec, New Mexico) - Inma... ... Thursday, 4 June, 2020
14 CSE-7613-USA (Lincoln, Nebraska) - Event Secur... ... Thursday, 11 June, 2020
15 TRANSPORT-3410-USA (Bedford, Texas) - Third Pa... ... Tuesday, 16 June, 2020
16 ITES-2658-USA (Salem, New Jersey) - Informatio... ... Friday, 12 June, 2020
17 MRB-16554-USA (North Carolina) - Paid Media Su... ... Friday, 19 June, 2020
18 MB-4214-USA (New York) - Claims Auditor Servic... ... Tuesday, 9 June, 2020
19 TRANS-1529-USA (New York) - Court Reporting an... ... Thursday, 18 June, 2020
20 CSE-7612-USA (Frederick, Maryland) - Security ... ... Thursday, 18 June, 2020
21 CSE-7611-USA (Fall River, Massachusetts) - Vid... ... Thursday, 11 June, 2020
22 SW-33960-USA (North Carolina) - Communications... ... Friday, 26 June, 2020
23 SW-33958-USA (Colorado) - Cloud Based Human Re... ... Tuesday, 30 June, 2020
24 SURG-1921-USA (Louisiana) - N95 Mask Supplies-... ... Wednesday, 27 May, 2020
25 MRB-16555-USA (Kentucky) - Legislative Video P... ... Tuesday, 2 June, 2020
26 SW-33959-USA (North Carolina) - Transit Mobile... ... Friday, 12 June, 2020
27 SW-33975-USA (Massachusetts) - Comprehensive S... ... Friday, 12 June, 2020
28 MB-4215-USA (Massachusetts) - Ambulance Billin... ... Thursday, 11 June, 2020
29 NET-3016-USA (Leander, Texas) - Signal Communi... ... Thursday, 4 June, 2020
.. ... ... ...
70 ACCT-9168-Canada (Alberta) - Occupational Heal... ... Wednesday, 10 June, 2020
71 ACCT-9167-USA (Atlanta, Georgia) - Audit Servi... ... Thursday, 25 June, 2020
72 STAFF-4366-USA (Houston, Texas) - RFI for Mobi... ... Thursday, 28 May, 2020
73 MEDI-1919-USA (Macon, Georgia) - Medical Clini... ... Thursday, 25 June, 2020
74 CSE-7607-USA (Washington, DC) - RFI for Securi... ... Saturday, 6 June, 2020
75 STAFF-4367-USA (Calhoun, Georgia) - Nursing Se... ... Monday, 22 June, 2020
76 CR-0991-USA (Saint Paul, Minnesota) - Court Re... ... Tuesday, 2 June, 2020
77 SW-33956-USA (Washington) - Asset Tracking Sof... ... Friday, 29 May, 2020
78 SW-33957-USA (Georgia) - Automated Weather Obs... ... Friday, 26 June, 2020
79 SW-33955-USA (Texas) - Moodle Programming Serv... ... Thursday, 4 June, 2020
80 MRB-16551-USA (Texas) - Comprehensive Integrat... ... Thursday, 28 May, 2020
81 ITES-2656-USA (Fort Wayne, Indiana) - RFI for ... ... Monday, 8 June, 2020
82 CSE-7608-USA (Hawaii) - Library Rules Enforcem... ... Friday, 5 June, 2020
83 DRA-1532-USA (Hawaii) - Arrest Charge Disposit... ... Wednesday, 10 June, 2020
84 MRB-16553-USA (Michigan) - Outreach and Affirm... ... Tuesday, 16 June, 2020
85 FOOD-5240-USA (Washington, DC.) - Inmate Food ... ... Wednesday, 17 June, 2020
86 SW-33970-USA (Washington, DC.) - ADA Compliant... ... Friday, 29 May, 2020
87 PM-12591-USA (Philadelphia, Pennsylvania) - Pr... ... Monday, 1 June, 2020
88 PM-12590-USA (Washington, DC.) - Printing Serv... ... Monday, 1 June, 2020
89 EXTRA-8363-USA (Fort Lauderdale, Florida) - Ma... ... Monday, 22 June, 2020
90 SW-33963-USA (Florida) - Information Managemen... ... Monday, 15 June, 2020
91 EXTRA-8362-USA (Sarasota, Florida) - American ... ... Wednesday, 17 June, 2020
92 SW-33954-USA (Wisconsin) - RFI for Application... ... Tuesday, 12 May, 2020
93 LEGAL-4987-USA (Charleston, West Virginia) - O... ... Wednesday, 24 June, 2020
94 PM-12588-USA (Encinitas, California) - Managed... ... Wednesday, 10 June, 2020
95 ESTATE-3246-USA (Charlotte, North Carolina) - ... ... Tuesday, 16 June, 2020
96 DISPOS-3793-USA (Fortana, California) -Removal... ... Thursday, 4 June, 2020
97 SW-33952-USA (New York) - GPS Tracking System-... ... Thursday, 11 June, 2020
98 SW-33953-Canada (Ontario) - VMware Support Ser... ... Thursday, 4 June, 2020
99 SW-33951- USA (South Carolina) - Professional ... ... Thursday, 25 June, 2020
[100 rows x 3 columns]
I have a pandas dataframe of the form
Start Date End Date President Party
0 04 March 1921 02 August 1923 Warren G Harding Republican
1 03 August 1923 04 March 1929 Calvin Coolidge Republican
2 05 March 1929 04 March 1933 Herbert Hoover Republican
3 05 March 1933 12 April 1945 Franklin D Roosevelt Democratic
4 13 April 1945 20 January 1953 Harry S Truman Democratic
5 21 January 1953 20 January 1961 Dwight D Eisenhower Republican
6 21 January 1961 22 November 1963 John F Kennedy Democratic
7 23 November 1963 20 January 1969 Lydon B Johnson Democratic
8 21 January 1969 09 August 1974 Richard Nixon Republican
9 10 August 1974 20 January 1977 Gerald Ford Republican
10 21 January 1977 20 January 1981 Jimmy Carter Democratic
11 21 January 1981 20 January 1989 Ronald Reagan Republican
12 21 January 1989 20 January 1993 George H W Bush Republican
13 21 January 1993 20 January 2001 Bill Clinton Democratic
14 21 January 2001 20 January 2009 George W Bush Republican
15 21 January 2009 20 January 2017 Barack Obama Democratic
16 21 January 2017 20 May 2017 Donald Trump Republican
I want to extract the index values for Party=Republican and store them in a list.
Is there a Pandas function to do this quickly?
df.index[df.Party == 'Republican`]
You can call .tolist() on the result if you want.
Consider the following sites (site1, site2, site3) which have a number of different tables.
I am using read_html to scrap the tables into a single table as follows:
import multiprocessing
links = ['site1.com','site2.com','site3.com']
def process_url(url):
return pd.concat(pd.read_html(url), ignore_index=False)
pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)
With the above procedure I am getting a single table. Although is what I expected, it would be helpful to add a flag or a "table counter", just to not lose the reference of the table (e.g. which row belongs or corresponds to which table). So, how to add the number of the table to a row?.
Something like this, the same single table, but with a table_num column:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date table_num
1 Allied Bank Mulberry AR 91.0 Today's Bank September 23, 2016 October 17, 2016 1
2 The Woodbury Banking Company Woodbury GA 11297.0 United Bank August 19, 2016 October 17, 2016 1
3 First CornerStone Bank King of Prussia PA 35312.0 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 1
4 Trust Company Bank Memphis TN 9956.0 The Bank of Fayette County April 29, 2016 September 6, 2016 2
5 North Milwaukee State Bank Milwaukee WI 20364.0 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 2
6 Hometown National Bank Longview WA 35156.0 Twin City Bank October 2, 2015 April 13, 2016 3
7 The Bank of Georgia Peachtree City GA 35259.0 Fidelity Bank October 2, 2015 October 24, 2016 3
8 Premier Bank Denver CO 34112.0 United Fidelity Bank, fsb July 10, 2015 August 17, 2016 3
9 Edgebrook Bank Chicago IL 57772.0 Republic Bank of Chicago May 8, 2015 July 12, 2016 3
10 Doral Bank NaN NaN NaN NaN NaN NaN 4
11 En Espanol San Juan PR 32102.0 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 4
12 Capitol City Bank & Trust Company Atlanta GA 33938.0 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015 4
13 Valley Bank Fort Lauderdale FL 21793.0 Landmark Bank, National Association June 20, 2014 June 29, 2015 5
14 Valley Bank Moline IL 10450.0 Great Southern Bank June 20, 2014 June 26, 2015 5
15 Slavie Federal Savings Bank Bel Air MD 32368.0 Bay Bank, FSB May 3, 2014 June 15, 2015 5
16 Columbia Savings Bank Cincinnati OH 32284.0 United Fidelity Bank, fsb May 23, 2014 November 10, 2016 6
17 AztecAmerica Bank NaN NaN NaN NaN NaN NaN 6
18 En Espanol Berwyn IL 57866.0 Republic Bank of Chicago May 16, 2014 October 20, 2016 6
For instance, if there are two tables in site1, the function must assign 0 to all the rows of table1, and with regards to table2 in site1 the function must assign 1 to all the rows of table2.
On the other hand, if site2 has two tables, the function must assign 3 to all the rows of table1 and 4 to table2 for all the tables that live in site2.
Also, is it possible to use assign() or other method to get the reference of each row (e.g. the table of provenance)?
try to change your process_url() function as follows:
def process_url(url):
return pd.concat([x.assign(table_num=i)
for i,x in enumerate(pd.read_html(url))],
ignore_index=False)
I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:
from selenium import webdriver
import re
import pandas as pd
driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []
for i in infotable:
ilist.append(i.text)
infolist = ilist[0]
for i in matches:
match.append(i.text)
driver.close()
home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])
df = pd.DataFrame({'home' : home, 'away' : away})
date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)
In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.
My thinking is: for child/element "under the date", date = last_found_date.
Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).
Should I be incorporating another program/method to retain order of tags/elements of the table?
You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:
from selenium import webdriver
import pandas as pd
driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(#class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
df = pd.DataFrame(data)
print(df)
Prints:
away date event home
0 Washington Capitals 25 Apr 2015 Play Offs New York Islanders
1 Minnesota Wild 25 Apr 2015 Play Offs St.Louis Blues
2 Ottawa Senators 25 Apr 2015 Play Offs Montreal Canadiens
3 Pittsburgh Penguins 25 Apr 2015 Play Offs New York Rangers
4 Calgary Flames 24 Apr 2015 Play Offs Vancouver Canucks
5 Chicago Blackhawks 24 Apr 2015 Play Offs Nashville Predators
6 Tampa Bay Lightning 24 Apr 2015 Play Offs Detroit Red Wings
7 New York Islanders 24 Apr 2015 Play Offs Washington Capitals
8 St.Louis Blues 23 Apr 2015 Play Offs Minnesota Wild
9 Anaheim Ducks 23 Apr 2015 Play Offs Winnipeg Jets
10 Montreal Canadiens 23 Apr 2015 Play Offs Ottawa Senators
11 New York Rangers 23 Apr 2015 Play Offs Pittsburgh Penguins
12 Vancouver Canucks 22 Apr 2015 Play Offs Calgary Flames
13 Nashville Predators 22 Apr 2015 Play Offs Chicago Blackhawks
14 Washington Capitals 22 Apr 2015 Play Offs New York Islanders
15 Tampa Bay Lightning 22 Apr 2015 Play Offs Detroit Red Wings
16 Anaheim Ducks 21 Apr 2015 Play Offs Winnipeg Jets
17 St.Louis Blues 21 Apr 2015 Play Offs Minnesota Wild
18 New York Rangers 21 Apr 2015 Play Offs Pittsburgh Penguins
19 Vancouver Canucks 20 Apr 2015 Play Offs Calgary Flames
20 Montreal Canadiens 20 Apr 2015 Play Offs Ottawa Senators
21 Nashville Predators 19 Apr 2015 Play Offs Chicago Blackhawks
22 Washington Capitals 19 Apr 2015 Play Offs New York Islanders
23 Winnipeg Jets 19 Apr 2015 Play Offs Anaheim Ducks
24 Pittsburgh Penguins 19 Apr 2015 Play Offs New York Rangers
25 Minnesota Wild 18 Apr 2015 Play Offs St.Louis Blues
26 Detroit Red Wings 18 Apr 2015 Play Offs Tampa Bay Lightning
27 Calgary Flames 18 Apr 2015 Play Offs Vancouver Canucks
28 Chicago Blackhawks 18 Apr 2015 Play Offs Nashville Predators
29 Ottawa Senators 18 Apr 2015 Play Offs Montreal Canadiens
30 New York Islanders 18 Apr 2015 Play Offs Washington Capitals
31 Winnipeg Jets 17 Apr 2015 Play Offs Anaheim Ducks
32 Minnesota Wild 17 Apr 2015 Play Offs St.Louis Blues
33 Detroit Red Wings 17 Apr 2015 Play Offs Tampa Bay Lightning
34 Pittsburgh Penguins 17 Apr 2015 Play Offs New York Rangers
35 Calgary Flames 16 Apr 2015 Play Offs Vancouver Canucks
36 Chicago Blackhawks 16 Apr 2015 Play Offs Nashville Predators
37 Ottawa Senators 16 Apr 2015 Play Offs Montreal Canadiens
38 New York Islanders 16 Apr 2015 Play Offs Washington Capitals
39 Edmonton Oilers 12 Apr 2015 Not specified Vancouver Canucks
40 Anaheim Ducks 12 Apr 2015 Not specified Arizona Coyotes
41 Chicago Blackhawks 12 Apr 2015 Not specified Colorado Avalanche
42 Nashville Predators 12 Apr 2015 Not specified Dallas Stars
43 Boston Bruins 12 Apr 2015 Not specified Tampa Bay Lightning
44 Pittsburgh Penguins 12 Apr 2015 Not specified Buffalo Sabres
45 Detroit Red Wings 12 Apr 2015 Not specified Carolina Hurricanes
46 New Jersey Devils 12 Apr 2015 Not specified Florida Panthers
47 Columbus Blue Jackets 12 Apr 2015 Not specified New York Islanders
48 Montreal Canadiens 12 Apr 2015 Not specified Toronto Maple Leafs
49 Calgary Flames 11 Apr 2015 Not specified Winnipeg Jets