I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.
Name
College
Year
Mary
Princeton
2017
Joe
Harvard
2018
Bill
Princeton
2016
Louise
Princeton
2020
Michael
Harvard
2019
Penny
Yale
2018
Harry
Yale
2015
I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:
Name
College
Year
Louise
Princeton
2020
Mary
Princeton
2017
Bill
Princeton
2016
Michael
Harvard
2019
Joe
Harvard
2018
Penny
Yale
2018
Harry
Yale
2015
So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!
Add a temporary extra column with the max year per group and sort on multiple columns:
out = (df
.assign(max_year=df.groupby('College')['Year'].transform('max'))
.sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
.drop(columns='max_year')
)
output:
Name College Year
3 Louise Princeton 2020
0 Mary Princeton 2017
2 Bill Princeton 2016
4 Michael Harvard 2019
1 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
with temporary column:
Name College Year max_year
3 Louise Princeton 2020 2020
0 Mary Princeton 2017 2020
2 Bill Princeton 2016 2020
4 Michael Harvard 2019 2019
1 Joe Harvard 2018 2019
5 Penny Yale 2018 2018
6 Harry Yale 2015 2018
You first want to sort by "College" then "Year", then keep "College" values together by using .groupby
import pandas as pd
data = [
["Mary", "Princeton", 2017],
["Joe", "Harvard", 2018],
["Bill", "Princeton", 2016],
["Louise", "Princeton", 2020],
["Michael", "Harvard", 2019],
["Penny", "Yale", 2018],
["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()
You'd get this output:
Name College Year
Penny Yale 2018
Harry Yale 2015
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018
You will have to first find the maximum among each group and set that as a column.
You can then sort by values based on max and year.
df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)
Output:
Out[60]:
Name College Year
0 Louise Princeton 2020
1 Mary Princeton 2017
2 Bill Princeton 2016
3 Michael Harvard 2019
4 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
Related
I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.
Each application is its own row in the dataframe. It comes with the year it was submitted, and the applicant's name.
I tried using groupby to organize the data by year and name, then a variety of methods such as value_counts(), count(), max(), etc...
This is the closest I've gotten:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
It produces the following output:
app_year_start name total_apps
2015 John Smith 622
2013 John Smith 614
2014 Jane Doe 611
2016 Jon Snow 549
My desired output:
app_year_start name total_apps
2015 top_applicant max_num
2014 top_applicant max_num
2013 top_applicant max_num
2012 top_applicant max_num
Some lines of dummy data:
app_year_start name
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
I've consulted the follow SO posts:
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
Pandas groupby nlargest sum
Get max of count() function on pandas groupby objects
Some other attempts I've made:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
df3.groupby(['app_year_start','name']).count()
Any help would be appreciated. I'm also open to entirely different solutions as well.
Cross-tabulate and find max values.
(
# cross tabulate to get each applicant's number of applications
pd.crosstab(df['app_year_start'], df['name'])
# the applicant with most applications and their counts
.agg(['idxmax', 'max'], 1)
# change column names
.set_axis(['name','total_apps'], axis=1)
# flatten df
.reset_index()
)
You can use mode per group:
df.groupby('app_year_start')['name'].agg(lambda x: x.mode().iloc[0])
Or, if you want all values joined as a single string in case of a tie:
df.groupby('app_year_start')['name'].agg(lambda x: ', '.join(x.mode()))
Output:
app_year_start
2012 John Smith
2013 Jane Doe
2014 John Smith
2015 John Snow
Name: name, dtype: object
Variant of your initial code:
(df
.groupby(['app_year_start', 'name'])['name']
.agg(total_apps='count')
.sort_values(by='total_apps', ascending=False)
.reset_index()
.groupby('app_year_start', as_index=False)
.first()
)
Output:
app_year_start name total_apps
0 2012 John Smith 8
1 2013 Jane Doe 2
2 2014 John Smith 2
3 2015 John Snow 4
With value_counts and a groupby:
dfc = (df.value_counts().reset_index().groupby('app_year_start').max()
.sort_index(ascending=False).reset_index()
.rename(columns={0:'total_apps'})
)
print(dfc)
Result
app_year_start name total_apps
0 2015 John Snow 4
1 2014 John Smith 2
2 2013 Jane Doe 2
3 2012 John Snow 8
I have a pandas df like such:
Here is the input data:
[{'Region/Province': 'PHILIPPINES', 'Commodity': 'Atis [Sugarapple]', '2018 January': '..', '2018 February': '..'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Avocado', '2018 January': '..', '2018 February': '..'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Banana Bungulan, green', '2018 January': '12.57', '2018 February': '12.48'}, {'Region/Province': 'PHILIPPINES', 'Commodity': 'Banana Cavendish', '2018 January': '9.96', '2018 February': '8.8'}]
Where the columns after commodity are like this: 2018 January, 2018 February.. 2018 Annual all the way up to 2021.
But I need it like this:
Where there are repeated Commodity names, but split by year/month with the Amount being it's own column. I've tried pd.wide_to_long() and it's close to what I need, but the years become their own columns.
Any help is much appreciated
Try stack with str.split
stacked = (
df.set_index(['Region/Province', 'Commodity'])
.stack()
.reset_index(name='Amount')
)
stacked[['Year', 'Month']] = stacked['level_2'].str.split(expand=True)
stacked = stacked.drop('level_2', axis=1)
stacked:
Region/Province Commodity Amount Year Month
0 PHILIPPINES Atis [Sugarapple] .. 2018 January
1 PHILIPPINES Atis [Sugarapple] .. 2018 February
2 PHILIPPINES Avocado .. 2018 January
3 PHILIPPINES Avocado .. 2018 February
4 PHILIPPINES Banana Bungulan, green 12.57 2018 January
5 PHILIPPINES Banana Bungulan, green 12.48 2018 February
6 PHILIPPINES Banana Cavendish 9.96 2018 January
7 PHILIPPINES Banana Cavendish 8.8 2018 February
or melt and str.split
melt = df.melt(['Region/Province', 'Commodity'], value_name='Amount')
melt[['Year', 'Month']] = melt['variable'].str.split(expand=True)
melt = melt.drop('variable', axis=1)
melt:
Region/Province Commodity Amount Year Month
0 PHILIPPINES Atis [Sugarapple] .. 2018 January
1 PHILIPPINES Avocado .. 2018 January
2 PHILIPPINES Banana Bungulan, green 12.57 2018 January
3 PHILIPPINES Banana Cavendish 9.96 2018 January
4 PHILIPPINES Atis [Sugarapple] .. 2018 February
5 PHILIPPINES Avocado .. 2018 February
6 PHILIPPINES Banana Bungulan, green 12.48 2018 February
7 PHILIPPINES Banana Cavendish 8.8 2018 February
I'm trying to combine two pandas DataFrames to update the first one based on criteria from the second. Here is a sample of the two dataframes:
df1
year
2016 CALIFORNIA CLINTON, HILLARY
2016 CALIFORNIA TRUMP, DONALD J.
2016 CALIFORNIA JOHNSON, GARY
2016 CALIFORNIA STEIN, JILL
2016 CALIFORNIA WRITE-IN
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA
2016 TEXAS TRUMP, DONALD J.
2016 TEXAS CLINTON, HILLARY
2016 TEXAS JOHNSON, GARY
2016 TEXAS STEIN, JILL
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W.
1988 CALIFORNIA DUKAKIS, MICHAEL
1988 CALIFORNIA PAUL, RONALD ""RON""
1988 CALIFORNIA FULANI, LENORA
1988 TEXAS BUSH, GEORGE H.W.
1988 TEXAS DUKAKIS, MICHAEL
1988 TEXAS PAUL, RONALD ""RON""
1988 TEXAS FULANI, LENORA
df2
year
1988 CALIFORNIA 47
1988 TEXAS 29
...
2016 CALIFORNIA 55
2016 TEXAS 38
There are values for every election year from 2020 to 1972 that includes all candidates and all states in a similar format. There are other columns in df1 but they aren't relevant to what I'm trying to do.
My expected result is:
year
2016 CALIFORNIA CLINTON, HILLARY 55
2016 CALIFORNIA TRUMP, DONALD J. 55
2016 CALIFORNIA JOHNSON, GARY 55
2016 CALIFORNIA STEIN, JILL 55
2016 CALIFORNIA WRITE-IN 55
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
2016 TEXAS TRUMP, DONALD J. 38
2016 TEXAS CLINTON, HILLARY 38
2016 TEXAS JOHNSON, GARY 38
2016 TEXAS STEIN, JILL 38
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W. 47
1988 CALIFORNIA DUKAKIS, MICHAEL 47
1988 CALIFORNIA PAUL, RONALD ""RON"" 47
1988 CALIFORNIA FULANI, LENORA 47
1988 TEXAS BUSH, GEORGE H.W. 29
1988 TEXAS DUKAKIS, MICHAEL 29
1988 TEXAS PAUL, RONALD ""RON"" 29
1988 TEXAS FULANI, LENORA 29
I want to match up the electoral_votes column in df2 with the year and state columns in df1 so it puts the correct value. I got some assistance and was able to match it up when there is only one column being matched (you can see the question and answer here) but I am having trouble matching it up with the two points of reference (year and state). If I use the code linked as is it returns the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have tried apply, map, applymap, merge, etc and haven't been able to figure it out. Thanks in advance for the help!
I believe what you are looking for is left_merge. You should specify the common columns within on=[....], that the merge should be based on.
# Imports
import pandas as pd
# Specify two columns in the "on".
pd.merge(df1,
df2,
how='left',
on=['year','state'])
Out[1821]:
year state candidate votes
0 2016 CALIFORNIA CLINTON, HILLARY 55
1 2016 CALIFORNIA TRUMP, DONALD J. 55
2 2016 CALIFORNIA JOHNSON, GARY 55
3 2016 CALIFORNIA STEIN, JILL 55
4 2016 CALIFORNIA WRITE-IN 55
5 2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
6 2016 TEXAS TRUMP, DONALD J. 38
7 2016 TEXAS CLINTON, HILLARY 38
8 2016 TEXAS JOHNSON, GARY 38
9 2016 TEXAS STEIN, JILL 38
10 1988 CALIFORNIA BUSH, GEORGE H.W. 47
11 1988 CALIFORNIA DUKAKIS, MICHAEL 47
12 1988 CALIFORNIA PAUL, RONALD ""RON"" 47
13 1988 CALIFORNIA FULANI, LENORA 47
14 1988 TEXAS BUSH, GEORGE H.W. 29
15 1988 TEXAS DUKAKIS, MICHAEL 29
16 1988 TEXAS PAUL, RONALD ""RON"" 29
17 1988 TEXAS FULANI, LENORA 29
The above code could be written as:
pd.merge(df1,
df2,
how='left',
left_on=['year','state'],
right_on=['year','state'])
but since the columns are the same in the 2 dfs, we can use on = ['year', 'state']
An alternate way to write -
merged_df = df1.merge(df2, on=['year', 'state'], how='left')
If you want to use only 3 columns from df1 -
df1 = pd.read_csv('<name_of_the_CSV_file>', usecols=['year', 'state', 'candidate'])
i’ve got a python/pandas groupby that is grouped on name and looks like this:
name gender year city city total
jane female 2011 Detroit 1
2015 Chicago 1
dan male 2009 Lexington 1
bill male 2001 New York 1
2003 Buffalo 1
2000 San Francisco 1
and I want it to look like this:
name gender year1 city1 year2 city2 year3 city3 city total
jane female 2011 Detroit 2015 Chicago 2
dan male 2009 Lexington 1
bill male 2000 Chico 2001 NewYork 2003 Buffalo 3
so i want to keep the grouping by name and then order by year and make each name have only one column. it's a variation on a dummy variables maybe? i'm not even sure how to summarize it.
Consider the following sites (site1, site2, site3) which have a number of different tables.
I am using read_html to scrap the tables into a single table as follows:
import multiprocessing
links = ['site1.com','site2.com','site3.com']
def process_url(url):
return pd.concat(pd.read_html(url), ignore_index=False)
pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)
With the above procedure I am getting a single table. Although is what I expected, it would be helpful to add a flag or a "table counter", just to not lose the reference of the table (e.g. which row belongs or corresponds to which table). So, how to add the number of the table to a row?.
Something like this, the same single table, but with a table_num column:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date table_num
1 Allied Bank Mulberry AR 91.0 Today's Bank September 23, 2016 October 17, 2016 1
2 The Woodbury Banking Company Woodbury GA 11297.0 United Bank August 19, 2016 October 17, 2016 1
3 First CornerStone Bank King of Prussia PA 35312.0 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 1
4 Trust Company Bank Memphis TN 9956.0 The Bank of Fayette County April 29, 2016 September 6, 2016 2
5 North Milwaukee State Bank Milwaukee WI 20364.0 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 2
6 Hometown National Bank Longview WA 35156.0 Twin City Bank October 2, 2015 April 13, 2016 3
7 The Bank of Georgia Peachtree City GA 35259.0 Fidelity Bank October 2, 2015 October 24, 2016 3
8 Premier Bank Denver CO 34112.0 United Fidelity Bank, fsb July 10, 2015 August 17, 2016 3
9 Edgebrook Bank Chicago IL 57772.0 Republic Bank of Chicago May 8, 2015 July 12, 2016 3
10 Doral Bank NaN NaN NaN NaN NaN NaN 4
11 En Espanol San Juan PR 32102.0 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 4
12 Capitol City Bank & Trust Company Atlanta GA 33938.0 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015 4
13 Valley Bank Fort Lauderdale FL 21793.0 Landmark Bank, National Association June 20, 2014 June 29, 2015 5
14 Valley Bank Moline IL 10450.0 Great Southern Bank June 20, 2014 June 26, 2015 5
15 Slavie Federal Savings Bank Bel Air MD 32368.0 Bay Bank, FSB May 3, 2014 June 15, 2015 5
16 Columbia Savings Bank Cincinnati OH 32284.0 United Fidelity Bank, fsb May 23, 2014 November 10, 2016 6
17 AztecAmerica Bank NaN NaN NaN NaN NaN NaN 6
18 En Espanol Berwyn IL 57866.0 Republic Bank of Chicago May 16, 2014 October 20, 2016 6
For instance, if there are two tables in site1, the function must assign 0 to all the rows of table1, and with regards to table2 in site1 the function must assign 1 to all the rows of table2.
On the other hand, if site2 has two tables, the function must assign 3 to all the rows of table1 and 4 to table2 for all the tables that live in site2.
Also, is it possible to use assign() or other method to get the reference of each row (e.g. the table of provenance)?
try to change your process_url() function as follows:
def process_url(url):
return pd.concat([x.assign(table_num=i)
for i,x in enumerate(pd.read_html(url))],
ignore_index=False)