Scrape data from all the pages in a website - python

This is the website
I want all the text from this website saved as a csv file with the following column names
Description | Posted Date | Expiry Date
import requests as rq
from bs4 import BeautifulSoup
url = "https://www.rfpmart.in/"
response = rq.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
div_text = soup.find('div',{'id':'home'}).text
I am a beginner a I have done this far.
Thanks in advance

Good start. now you need to iterate through the specific tags with that content within the first <div> tags you are pulling, followed by constructing the dataframe.
Code:
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.rfpmart.in/"
response = rq.get(url)
soup = BeautifulSoup(response.text,'html.parser')
endPage = int(soup.find('p',{'class':'counter'}).text.split()[-1])
desc = []
post = []
exp = []
urls = []
for page in range(1,endPage+1):
print ('Aquiring page: %s'%page)
url = "https://www.rfpmart.in/home-%s.html" %page
response = rq.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
div = soup.find('div',{'id':'home'})
items = div.find_all('li')
for each in items:
desc.append(each.find('div', {'class':'description-category'}).text.strip())
post.append(each.find('span',{'class':'post-date'}).text.strip())
exp.append(each.find('span',{'class':'expiry-date'}).text.strip())
urls.append(each.find('a')['href'])
df = pd.DataFrame(list(zip(desc, post, exp, urls)),
columns =['Description','Posted Date','Expiry Date', 'URL'])
Output:
print (df)
Description ... Expiry Date
0 EXTRA-8365-USA (Oakland, California) - Perinat... ... Wednesday, 15 July, 2020
1 EXTRA-8364-USA (Santa Cruz, California) - Prof... ... Thursday, 25 June, 2020
2 SW-33969-USA (California) - Java Software Deve... ... Wednesday, 10 June, 2020
3 SW-33968-USA (California) - Electronic Agenda ... ... Monday, 15 June, 2020
4 WRT-1475-USA (Beebe, Arkansas) - Title III Gra... ... Monday, 15 June, 2020
5 DE-3174-USA (California) - Paper to Digital Do... ... Thursday, 9 July, 2020
6 TRANSLATION-3588-USA (California) - Translatio... ... Tuesday, 9 June, 2020
7 ACCT-9172-USA (San Angelo, Texas) - Audit Serv... ... Friday, 12 June, 2020
8 ACCT-9173-USA (Texas) - Audit Services for Tax... ... Monday, 1 June, 2020
9 ESTATE-3249-USA (San Antonio, Texas) - Real Es... ... Thursday, 25 June, 2020
10 SW-33980-USA (New Jersey) - Document Managemen... ... Thursday, 28 May, 2020
11 FOOD-5243-USA (Garden City, New York) - Food S... ... Thursday, 11 June, 2020
12 EXTRA-8369-USA (Camden, New Jersey) - Port Pla... ... Friday, 19 June, 2020
13 TRANSPORT-3411- USA (Aztec, New Mexico) - Inma... ... Thursday, 4 June, 2020
14 CSE-7613-USA (Lincoln, Nebraska) - Event Secur... ... Thursday, 11 June, 2020
15 TRANSPORT-3410-USA (Bedford, Texas) - Third Pa... ... Tuesday, 16 June, 2020
16 ITES-2658-USA (Salem, New Jersey) - Informatio... ... Friday, 12 June, 2020
17 MRB-16554-USA (North Carolina) - Paid Media Su... ... Friday, 19 June, 2020
18 MB-4214-USA (New York) - Claims Auditor Servic... ... Tuesday, 9 June, 2020
19 TRANS-1529-USA (New York) - Court Reporting an... ... Thursday, 18 June, 2020
20 CSE-7612-USA (Frederick, Maryland) - Security ... ... Thursday, 18 June, 2020
21 CSE-7611-USA (Fall River, Massachusetts) - Vid... ... Thursday, 11 June, 2020
22 SW-33960-USA (North Carolina) - Communications... ... Friday, 26 June, 2020
23 SW-33958-USA (Colorado) - Cloud Based Human Re... ... Tuesday, 30 June, 2020
24 SURG-1921-USA (Louisiana) - N95 Mask Supplies-... ... Wednesday, 27 May, 2020
25 MRB-16555-USA (Kentucky) - Legislative Video P... ... Tuesday, 2 June, 2020
26 SW-33959-USA (North Carolina) - Transit Mobile... ... Friday, 12 June, 2020
27 SW-33975-USA (Massachusetts) - Comprehensive S... ... Friday, 12 June, 2020
28 MB-4215-USA (Massachusetts) - Ambulance Billin... ... Thursday, 11 June, 2020
29 NET-3016-USA (Leander, Texas) - Signal Communi... ... Thursday, 4 June, 2020
.. ... ... ...
70 ACCT-9168-Canada (Alberta) - Occupational Heal... ... Wednesday, 10 June, 2020
71 ACCT-9167-USA (Atlanta, Georgia) - Audit Servi... ... Thursday, 25 June, 2020
72 STAFF-4366-USA (Houston, Texas) - RFI for Mobi... ... Thursday, 28 May, 2020
73 MEDI-1919-USA (Macon, Georgia) - Medical Clini... ... Thursday, 25 June, 2020
74 CSE-7607-USA (Washington, DC) - RFI for Securi... ... Saturday, 6 June, 2020
75 STAFF-4367-USA (Calhoun, Georgia) - Nursing Se... ... Monday, 22 June, 2020
76 CR-0991-USA (Saint Paul, Minnesota) - Court Re... ... Tuesday, 2 June, 2020
77 SW-33956-USA (Washington) - Asset Tracking Sof... ... Friday, 29 May, 2020
78 SW-33957-USA (Georgia) - Automated Weather Obs... ... Friday, 26 June, 2020
79 SW-33955-USA (Texas) - Moodle Programming Serv... ... Thursday, 4 June, 2020
80 MRB-16551-USA (Texas) - Comprehensive Integrat... ... Thursday, 28 May, 2020
81 ITES-2656-USA (Fort Wayne, Indiana) - RFI for ... ... Monday, 8 June, 2020
82 CSE-7608-USA (Hawaii) - Library Rules Enforcem... ... Friday, 5 June, 2020
83 DRA-1532-USA (Hawaii) - Arrest Charge Disposit... ... Wednesday, 10 June, 2020
84 MRB-16553-USA (Michigan) - Outreach and Affirm... ... Tuesday, 16 June, 2020
85 FOOD-5240-USA (Washington, DC.) - Inmate Food ... ... Wednesday, 17 June, 2020
86 SW-33970-USA (Washington, DC.) - ADA Compliant... ... Friday, 29 May, 2020
87 PM-12591-USA (Philadelphia, Pennsylvania) - Pr... ... Monday, 1 June, 2020
88 PM-12590-USA (Washington, DC.) - Printing Serv... ... Monday, 1 June, 2020
89 EXTRA-8363-USA (Fort Lauderdale, Florida) - Ma... ... Monday, 22 June, 2020
90 SW-33963-USA (Florida) - Information Managemen... ... Monday, 15 June, 2020
91 EXTRA-8362-USA (Sarasota, Florida) - American ... ... Wednesday, 17 June, 2020
92 SW-33954-USA (Wisconsin) - RFI for Application... ... Tuesday, 12 May, 2020
93 LEGAL-4987-USA (Charleston, West Virginia) - O... ... Wednesday, 24 June, 2020
94 PM-12588-USA (Encinitas, California) - Managed... ... Wednesday, 10 June, 2020
95 ESTATE-3246-USA (Charlotte, North Carolina) - ... ... Tuesday, 16 June, 2020
96 DISPOS-3793-USA (Fortana, California) -Removal... ... Thursday, 4 June, 2020
97 SW-33952-USA (New York) - GPS Tracking System-... ... Thursday, 11 June, 2020
98 SW-33953-Canada (Ontario) - VMware Support Ser... ... Thursday, 4 June, 2020
99 SW-33951- USA (South Carolina) - Professional ... ... Thursday, 25 June, 2020
[100 rows x 3 columns]

Related

How to join or merge two dataframes based on different condition?

I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.

Extract Date, Append by Number of Games

I am currently web scraping the college football schedule by week.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
I would like a date column. I am having difficulty extracting the date and duplicate the date corresponding to number of games for that date and append to a python list.
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
Dates are like headers for each table. I would like each date corresponding to the correct game. And also include "postponed' for the postponed games.
My plan is to automate this code for each week.
Thanks ahead.
*Post Answer
Beautiful and well done. How would I pull venues especially with postponed, using your code?
My original code was:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
You can use .find_previous() to find date for current tow:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
And saves data.csv (screenshot from LibreOffice):
EDIT: To pare "Venue" column, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020

Pandas: Combine and average data in a column based on length of month

I have a dataframe which consists of departments, year, the month of invoice, the invoice date and the value.
I have offset the Invoice dates by business days and now what I am trying to achieve is to combine all the months that have the same number of working days (so the 'count' of each month by year) and average the value for each day.
The data I have is as follows:
Department Year Month Invoice Date Value
0 Sales 2019 March 2019-03-25 1000.00
1 Sales 2019 March 2019-03-26 2000.00
2 Sales 2019 March 2019-03-27 3000.00
3 Sales 2019 March 2019-03-28 4000.00
4 Sales 2019 March 2019-03-29 5000.00
... ... ... ... ... ...
2435 Specialist 2020 August 2020-08-27 6000.00
2436 Specialist 2020 August 2020-08-28 7000.00
2437 Specialist 2020 September 2020-09-01 8000.00
2438 Specialist 2020 September 2020-09-02 9000.00
2439 Specialist 2020 September 2020-09-07 1000.00
The count of each month is as follows:
Year Month
2019 April 21
August 21
December 20
July 23
June 20
March 5
May 21
November 21
October 23
September 21
2020 April 21
August 20
February 20
January 22
July 23
June 22
March 22
May 19
September 5
My hope is that using this count I could aggregate the data from the original df and average for example April, August, May, November, September (2019) along with April (2020) as they all have 21 working days in the month.
Producing one dataframe with each day of the month an average of the months combined for each # of days.
I hope that makes sense.
Note: Please ignore the 5 days length, just incomplete data for those months...
Thank you
EDIT: I just realised that the days wont line up for each month so my plan is to aggregate it based on whether its the first business day of the month, then the second the third etc regardless of the actual date.
ALSO (SORRY): I was hoping it could be by department!
Department Month Length Day Number Average Value
0 Sales 21 1 20000
1 Sales 21 2 5541
2 Sales 21 3 87485
3 Sales 21 4 1863
4 Sales 21 5 48687
5 Sales 21 6 486996
6 Sales 21 7 892
7 Sales 21 8 985
8 Sales 21 9 14169
9 Sales 21 10 20000
10 Sales 21 11 5541
11 Sales 21 12 87485
12 Sales 21 13 1863
13 Sales 21 14 48687
14 Sales 21 15 486996
15 Sales 21 16 892
16 Sales 21 17 985
17 Sales 21 18 14169
......
So to explain it a bit better lets take sales, and all the months which have 21 days in them, for each day in those 21 day months I am hoping to get the average of the value and get a table that looks like above.
So 'day 1' is an average of all 'day 1s' in the 21 day months (as seen in the count df)! This is to allow me to then plot a line chart profile to show average revenue value on each given day in a 21 day month. I hope this is a bit of a better explanation, apologies.
I am not really sure whether I understand your question. Maybe you could add an expected df to your question?
In the mean time would this point you in the direction you are looking for:
import pandas as pd
from random import randint
from calendar import month_name
df = pd.DataFrame({'years': [randint(1998, 2020) for x in range(10000)],
'months': [month_name[randint(1, 12)] for x in range(10000)],
'days': [randint(1, 30) for x in range(10000)],
'revenue': [randint(0, 1000) for x in range(10000)]}
)
print(df.groupby(['months', 'days'])['revenue'].mean())
Output is:
months days
April 1 475.529412
2 542.870968
3 296.045455
4 392.416667
5 475.571429
September 26 516.888889
27 539.583333
28 513.500000
29 480.724138
30 456.500000
Name: revenue, Length: 360, dtype: float64

Web Scraping with BS4, how to set a range of where to look

I am trying to scrape the "Events" section of this wikipedia page: https://en.wikipedia.org/wiki/2020. The page does not have the easiest HTML to navigate as most of the tags are not nested, but are siblings.
I want to ensure that the only data I scrape is between the two h2 tags shown below.
Here is the condensed HTML:
<h2> #I ONLY WANT TO SEARCH BETWEEN HERE
<span id="Events">Events</span>
</h2>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>
<li>
<a title="June 17"</a> #My code below is looking for this, if not found it jumps to another section
</li>
</ul>
<h3>...</h3>
<ul>...</ul>
<h2> #AND HERE. DON"T WANT TO GO PAST HERE
<span id="Predicted_and_scheduled_events">Predicted_and_scheduled_events</span>
</h2>
If it's not clear, every tag(except for span) is a sibling. My code currently works if the date is present between the two h2 tags, however if the date is not present it will go to another section of the page to pull data, which I do not want.
Here is my code:
import sys
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")
todaysNews = soup.find('a', {"title": "June 17"}) #goes to date's stories
BS has many useful functions and parameters. It is worth reading the whole documentation.
It has function to get parent element, next siblings, elements which have any title, etc.
First I search <span id="Events">Events</span>, next I get its parent element <h2> and I have start of data.
Next I can get next_siblings and run in for-loop until I get item with name h2 and I get end of data.
In for-loop I can check all ul elements and search direct li element without nested li elements (recursive=False), and inside li I can get first a which has title with any text ({"title": True})
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
# found start of data `h2`
start = soup.find('span', {'id': 'Events'}).parent
# check sibling items
for item in start.next_siblings:
# found end of data `h2`
if item.name == 'h2':
break
if item.name == 'ul':
# only direct `li` without nested `li`
for li in item.find_all('li', recursive=False):
# `a` which have `title`
a = li.find('a', {'title': True})
if a:
print(a['title'])
Result:
January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16
You can use CSS selector with "," and then check for tag name.
CSS selector h2:contains("Events") ~ ul > li will select all ul > li siblings to <h2> which contains string "Events".
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select('h2:contains("Events") ~ ul > li, h2:contains("Predicted and scheduled events")'):
if tag.name == 'li':
print(tag.a.text)
else:
break
Prints:
January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16

Pandas - How to get index values from a dataframe

I have a pandas dataframe of the form
Start Date End Date President Party
0 04 March 1921 02 August 1923 Warren G Harding Republican
1 03 August 1923 04 March 1929 Calvin Coolidge Republican
2 05 March 1929 04 March 1933 Herbert Hoover Republican
3 05 March 1933 12 April 1945 Franklin D Roosevelt Democratic
4 13 April 1945 20 January 1953 Harry S Truman Democratic
5 21 January 1953 20 January 1961 Dwight D Eisenhower Republican
6 21 January 1961 22 November 1963 John F Kennedy Democratic
7 23 November 1963 20 January 1969 Lydon B Johnson Democratic
8 21 January 1969 09 August 1974 Richard Nixon Republican
9 10 August 1974 20 January 1977 Gerald Ford Republican
10 21 January 1977 20 January 1981 Jimmy Carter Democratic
11 21 January 1981 20 January 1989 Ronald Reagan Republican
12 21 January 1989 20 January 1993 George H W Bush Republican
13 21 January 1993 20 January 2001 Bill Clinton Democratic
14 21 January 2001 20 January 2009 George W Bush Republican
15 21 January 2009 20 January 2017 Barack Obama Democratic
16 21 January 2017 20 May 2017 Donald Trump Republican
I want to extract the index values for Party=Republican and store them in a list.
Is there a Pandas function to do this quickly?
df.index[df.Party == 'Republican`]
You can call .tolist() on the result if you want.

Categories

Resources