I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.
Name
College
Year
Mary
Princeton
2017
Joe
Harvard
2018
Bill
Princeton
2016
Louise
Princeton
2020
Michael
Harvard
2019
Penny
Yale
2018
Harry
Yale
2015
I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:
Name
College
Year
Louise
Princeton
2020
Mary
Princeton
2017
Bill
Princeton
2016
Michael
Harvard
2019
Joe
Harvard
2018
Penny
Yale
2018
Harry
Yale
2015
So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!
Add a temporary extra column with the max year per group and sort on multiple columns:
out = (df
.assign(max_year=df.groupby('College')['Year'].transform('max'))
.sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
.drop(columns='max_year')
)
output:
Name College Year
3 Louise Princeton 2020
0 Mary Princeton 2017
2 Bill Princeton 2016
4 Michael Harvard 2019
1 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
with temporary column:
Name College Year max_year
3 Louise Princeton 2020 2020
0 Mary Princeton 2017 2020
2 Bill Princeton 2016 2020
4 Michael Harvard 2019 2019
1 Joe Harvard 2018 2019
5 Penny Yale 2018 2018
6 Harry Yale 2015 2018
You first want to sort by "College" then "Year", then keep "College" values together by using .groupby
import pandas as pd
data = [
["Mary", "Princeton", 2017],
["Joe", "Harvard", 2018],
["Bill", "Princeton", 2016],
["Louise", "Princeton", 2020],
["Michael", "Harvard", 2019],
["Penny", "Yale", 2018],
["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()
You'd get this output:
Name College Year
Penny Yale 2018
Harry Yale 2015
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018
You will have to first find the maximum among each group and set that as a column.
You can then sort by values based on max and year.
df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)
Output:
Out[60]:
Name College Year
0 Louise Princeton 2020
1 Mary Princeton 2017
2 Bill Princeton 2016
3 Michael Harvard 2019
4 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
I'm trying to combine two pandas DataFrames to update the first one based on criteria from the second. Here is a sample of the two dataframes:
df1
year
2016 CALIFORNIA CLINTON, HILLARY
2016 CALIFORNIA TRUMP, DONALD J.
2016 CALIFORNIA JOHNSON, GARY
2016 CALIFORNIA STEIN, JILL
2016 CALIFORNIA WRITE-IN
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA
2016 TEXAS TRUMP, DONALD J.
2016 TEXAS CLINTON, HILLARY
2016 TEXAS JOHNSON, GARY
2016 TEXAS STEIN, JILL
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W.
1988 CALIFORNIA DUKAKIS, MICHAEL
1988 CALIFORNIA PAUL, RONALD ""RON""
1988 CALIFORNIA FULANI, LENORA
1988 TEXAS BUSH, GEORGE H.W.
1988 TEXAS DUKAKIS, MICHAEL
1988 TEXAS PAUL, RONALD ""RON""
1988 TEXAS FULANI, LENORA
df2
year
1988 CALIFORNIA 47
1988 TEXAS 29
...
2016 CALIFORNIA 55
2016 TEXAS 38
There are values for every election year from 2020 to 1972 that includes all candidates and all states in a similar format. There are other columns in df1 but they aren't relevant to what I'm trying to do.
My expected result is:
year
2016 CALIFORNIA CLINTON, HILLARY 55
2016 CALIFORNIA TRUMP, DONALD J. 55
2016 CALIFORNIA JOHNSON, GARY 55
2016 CALIFORNIA STEIN, JILL 55
2016 CALIFORNIA WRITE-IN 55
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
2016 TEXAS TRUMP, DONALD J. 38
2016 TEXAS CLINTON, HILLARY 38
2016 TEXAS JOHNSON, GARY 38
2016 TEXAS STEIN, JILL 38
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W. 47
1988 CALIFORNIA DUKAKIS, MICHAEL 47
1988 CALIFORNIA PAUL, RONALD ""RON"" 47
1988 CALIFORNIA FULANI, LENORA 47
1988 TEXAS BUSH, GEORGE H.W. 29
1988 TEXAS DUKAKIS, MICHAEL 29
1988 TEXAS PAUL, RONALD ""RON"" 29
1988 TEXAS FULANI, LENORA 29
I want to match up the electoral_votes column in df2 with the year and state columns in df1 so it puts the correct value. I got some assistance and was able to match it up when there is only one column being matched (you can see the question and answer here) but I am having trouble matching it up with the two points of reference (year and state). If I use the code linked as is it returns the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have tried apply, map, applymap, merge, etc and haven't been able to figure it out. Thanks in advance for the help!
I believe what you are looking for is left_merge. You should specify the common columns within on=[....], that the merge should be based on.
# Imports
import pandas as pd
# Specify two columns in the "on".
pd.merge(df1,
df2,
how='left',
on=['year','state'])
Out[1821]:
year state candidate votes
0 2016 CALIFORNIA CLINTON, HILLARY 55
1 2016 CALIFORNIA TRUMP, DONALD J. 55
2 2016 CALIFORNIA JOHNSON, GARY 55
3 2016 CALIFORNIA STEIN, JILL 55
4 2016 CALIFORNIA WRITE-IN 55
5 2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
6 2016 TEXAS TRUMP, DONALD J. 38
7 2016 TEXAS CLINTON, HILLARY 38
8 2016 TEXAS JOHNSON, GARY 38
9 2016 TEXAS STEIN, JILL 38
10 1988 CALIFORNIA BUSH, GEORGE H.W. 47
11 1988 CALIFORNIA DUKAKIS, MICHAEL 47
12 1988 CALIFORNIA PAUL, RONALD ""RON"" 47
13 1988 CALIFORNIA FULANI, LENORA 47
14 1988 TEXAS BUSH, GEORGE H.W. 29
15 1988 TEXAS DUKAKIS, MICHAEL 29
16 1988 TEXAS PAUL, RONALD ""RON"" 29
17 1988 TEXAS FULANI, LENORA 29
The above code could be written as:
pd.merge(df1,
df2,
how='left',
left_on=['year','state'],
right_on=['year','state'])
but since the columns are the same in the 2 dfs, we can use on = ['year', 'state']
An alternate way to write -
merged_df = df1.merge(df2, on=['year', 'state'], how='left')
If you want to use only 3 columns from df1 -
df1 = pd.read_csv('<name_of_the_CSV_file>', usecols=['year', 'state', 'candidate'])
I am currently web scraping the college football schedule by week.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
I would like a date column. I am having difficulty extracting the date and duplicate the date corresponding to number of games for that date and append to a python list.
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
Dates are like headers for each table. I would like each date corresponding to the correct game. And also include "postponed' for the postponed games.
My plan is to automate this code for each week.
Thanks ahead.
*Post Answer
Beautiful and well done. How would I pull venues especially with postponed, using your code?
My original code was:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
You can use .find_previous() to find date for current tow:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
And saves data.csv (screenshot from LibreOffice):
EDIT: To pare "Venue" column, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020
I have a pandas dataframe with three columns:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Atlanta 2019
Ozuna St. Louis 2019
Torres New York 2019
I want to identify all column names that have more than one column c value, and then replace all column b instances with the first value in the dataframe like this:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Minnesota 2019
Ozuna Atlanta 2019
Torres New York 2019
This is definitely inefficient, but here's what I tried so far:
# get a df of just names and cities and deduplicate
df_names = df[['a','b']].drop_duplicates()
# find any multiple column b values and put them in a list
a_matches = pd.Dataframe(df_names.groupby('a')['b'].nunique())
multi_b = a_matches.index[a_matches['b'] > 1].tolist()
This gives me ['Donaldson','Ozuna'], but now I am stuck. I can't come up with a good way to generate a replacement dictionary for their corresponding values in c. I think there must be a more elegant way to get to this.
IIUC, you can try with groupby+transform with np.where:
g = df.groupby('a')
c = g['c'].transform('nunique').gt(1) # column a names that have >1 column c value
df['b'] = np.where(c,g['b'].transform('first'),df['b'])
# for a new df: new = df.assign(b=np.where(c,g['b'].transform('first'),df['b']))
print(df)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019
For the given example as #ALloz correctly pointed , you can just use:
df['b'] = df.groupby('a')['b'].transform('first')
print(df)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019