I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.
I am currently web scraping the college football schedule by week.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
I would like a date column. I am having difficulty extracting the date and duplicate the date corresponding to number of games for that date and append to a python list.
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
Dates are like headers for each table. I would like each date corresponding to the correct game. And also include "postponed' for the postponed games.
My plan is to automate this code for each week.
Thanks ahead.
*Post Answer
Beautiful and well done. How would I pull venues especially with postponed, using your code?
My original code was:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
You can use .find_previous() to find date for current tow:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
And saves data.csv (screenshot from LibreOffice):
EDIT: To pare "Venue" column, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020
I have a dataframe which consists of departments, year, the month of invoice, the invoice date and the value.
I have offset the Invoice dates by business days and now what I am trying to achieve is to combine all the months that have the same number of working days (so the 'count' of each month by year) and average the value for each day.
The data I have is as follows:
Department Year Month Invoice Date Value
0 Sales 2019 March 2019-03-25 1000.00
1 Sales 2019 March 2019-03-26 2000.00
2 Sales 2019 March 2019-03-27 3000.00
3 Sales 2019 March 2019-03-28 4000.00
4 Sales 2019 March 2019-03-29 5000.00
... ... ... ... ... ...
2435 Specialist 2020 August 2020-08-27 6000.00
2436 Specialist 2020 August 2020-08-28 7000.00
2437 Specialist 2020 September 2020-09-01 8000.00
2438 Specialist 2020 September 2020-09-02 9000.00
2439 Specialist 2020 September 2020-09-07 1000.00
The count of each month is as follows:
Year Month
2019 April 21
August 21
December 20
July 23
June 20
March 5
May 21
November 21
October 23
September 21
2020 April 21
August 20
February 20
January 22
July 23
June 22
March 22
May 19
September 5
My hope is that using this count I could aggregate the data from the original df and average for example April, August, May, November, September (2019) along with April (2020) as they all have 21 working days in the month.
Producing one dataframe with each day of the month an average of the months combined for each # of days.
I hope that makes sense.
Note: Please ignore the 5 days length, just incomplete data for those months...
Thank you
EDIT: I just realised that the days wont line up for each month so my plan is to aggregate it based on whether its the first business day of the month, then the second the third etc regardless of the actual date.
ALSO (SORRY): I was hoping it could be by department!
Department Month Length Day Number Average Value
0 Sales 21 1 20000
1 Sales 21 2 5541
2 Sales 21 3 87485
3 Sales 21 4 1863
4 Sales 21 5 48687
5 Sales 21 6 486996
6 Sales 21 7 892
7 Sales 21 8 985
8 Sales 21 9 14169
9 Sales 21 10 20000
10 Sales 21 11 5541
11 Sales 21 12 87485
12 Sales 21 13 1863
13 Sales 21 14 48687
14 Sales 21 15 486996
15 Sales 21 16 892
16 Sales 21 17 985
17 Sales 21 18 14169
......
So to explain it a bit better lets take sales, and all the months which have 21 days in them, for each day in those 21 day months I am hoping to get the average of the value and get a table that looks like above.
So 'day 1' is an average of all 'day 1s' in the 21 day months (as seen in the count df)! This is to allow me to then plot a line chart profile to show average revenue value on each given day in a 21 day month. I hope this is a bit of a better explanation, apologies.
I am not really sure whether I understand your question. Maybe you could add an expected df to your question?
In the mean time would this point you in the direction you are looking for:
import pandas as pd
from random import randint
from calendar import month_name
df = pd.DataFrame({'years': [randint(1998, 2020) for x in range(10000)],
'months': [month_name[randint(1, 12)] for x in range(10000)],
'days': [randint(1, 30) for x in range(10000)],
'revenue': [randint(0, 1000) for x in range(10000)]}
)
print(df.groupby(['months', 'days'])['revenue'].mean())
Output is:
months days
April 1 475.529412
2 542.870968
3 296.045455
4 392.416667
5 475.571429
September 26 516.888889
27 539.583333
28 513.500000
29 480.724138
30 456.500000
Name: revenue, Length: 360, dtype: float64
I am trying to scrape the "Events" section of this wikipedia page: https://en.wikipedia.org/wiki/2020. The page does not have the easiest HTML to navigate as most of the tags are not nested, but are siblings.
I want to ensure that the only data I scrape is between the two h2 tags shown below.
Here is the condensed HTML:
<h2> #I ONLY WANT TO SEARCH BETWEEN HERE
<span id="Events">Events</span>
</h2>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>
<li>
<a title="June 17"</a> #My code below is looking for this, if not found it jumps to another section
</li>
</ul>
<h3>...</h3>
<ul>...</ul>
<h2> #AND HERE. DON"T WANT TO GO PAST HERE
<span id="Predicted_and_scheduled_events">Predicted_and_scheduled_events</span>
</h2>
If it's not clear, every tag(except for span) is a sibling. My code currently works if the date is present between the two h2 tags, however if the date is not present it will go to another section of the page to pull data, which I do not want.
Here is my code:
import sys
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")
todaysNews = soup.find('a', {"title": "June 17"}) #goes to date's stories
BS has many useful functions and parameters. It is worth reading the whole documentation.
It has function to get parent element, next siblings, elements which have any title, etc.
First I search <span id="Events">Events</span>, next I get its parent element <h2> and I have start of data.
Next I can get next_siblings and run in for-loop until I get item with name h2 and I get end of data.
In for-loop I can check all ul elements and search direct li element without nested li elements (recursive=False), and inside li I can get first a which has title with any text ({"title": True})
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
# found start of data `h2`
start = soup.find('span', {'id': 'Events'}).parent
# check sibling items
for item in start.next_siblings:
# found end of data `h2`
if item.name == 'h2':
break
if item.name == 'ul':
# only direct `li` without nested `li`
for li in item.find_all('li', recursive=False):
# `a` which have `title`
a = li.find('a', {'title': True})
if a:
print(a['title'])
Result:
January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16
You can use CSS selector with "," and then check for tag name.
CSS selector h2:contains("Events") ~ ul > li will select all ul > li siblings to <h2> which contains string "Events".
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select('h2:contains("Events") ~ ul > li, h2:contains("Predicted and scheduled events")'):
if tag.name == 'li':
print(tag.a.text)
else:
break
Prints:
January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16