python groupby: how to move hierarchical grouped data from columns into rows? - python

i’ve got a python/pandas groupby that is grouped on name and looks like this:
name gender year city city total
jane female 2011 Detroit 1
2015 Chicago 1
dan male 2009 Lexington 1
bill male 2001 New York 1
2003 Buffalo 1
2000 San Francisco 1
and I want it to look like this:
name gender year1 city1 year2 city2 year3 city3 city total
jane female 2011 Detroit 2015 Chicago 2
dan male 2009 Lexington 1
bill male 2000 Chico 2001 NewYork 2003 Buffalo 3
so i want to keep the grouping by name and then order by year and make each name have only one column. it's a variation on a dummy variables maybe? i'm not even sure how to summarize it.

Related

Ordering Pairs of Data by date - Pandas

I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.
Name
College
Year
Mary
Princeton
2017
Joe
Harvard
2018
Bill
Princeton
2016
Louise
Princeton
2020
Michael
Harvard
2019
Penny
Yale
2018
Harry
Yale
2015
I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:
Name
College
Year
Louise
Princeton
2020
Mary
Princeton
2017
Bill
Princeton
2016
Michael
Harvard
2019
Joe
Harvard
2018
Penny
Yale
2018
Harry
Yale
2015
So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!
Add a temporary extra column with the max year per group and sort on multiple columns:
out = (df
.assign(max_year=df.groupby('College')['Year'].transform('max'))
.sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
.drop(columns='max_year')
)
output:
Name College Year
3 Louise Princeton 2020
0 Mary Princeton 2017
2 Bill Princeton 2016
4 Michael Harvard 2019
1 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
with temporary column:
Name College Year max_year
3 Louise Princeton 2020 2020
0 Mary Princeton 2017 2020
2 Bill Princeton 2016 2020
4 Michael Harvard 2019 2019
1 Joe Harvard 2018 2019
5 Penny Yale 2018 2018
6 Harry Yale 2015 2018
You first want to sort by "College" then "Year", then keep "College" values together by using .groupby
import pandas as pd
data = [
["Mary", "Princeton", 2017],
["Joe", "Harvard", 2018],
["Bill", "Princeton", 2016],
["Louise", "Princeton", 2020],
["Michael", "Harvard", 2019],
["Penny", "Yale", 2018],
["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()
You'd get this output:
Name College Year
Penny Yale 2018
Harry Yale 2015
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018
You will have to first find the maximum among each group and set that as a column.
You can then sort by values based on max and year.
df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)
Output:
Out[60]:
Name College Year
0 Louise Princeton 2020
1 Mary Princeton 2017
2 Bill Princeton 2016
3 Michael Harvard 2019
4 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015

Finding the most frequent strings and their counts for each group using pandas

I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.
Each application is its own row in the dataframe. It comes with the year it was submitted, and the applicant's name.
I tried using groupby to organize the data by year and name, then a variety of methods such as value_counts(), count(), max(), etc...
This is the closest I've gotten:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
It produces the following output:
app_year_start name total_apps
2015 John Smith 622
2013 John Smith 614
2014 Jane Doe 611
2016 Jon Snow 549
My desired output:
app_year_start name total_apps
2015 top_applicant max_num
2014 top_applicant max_num
2013 top_applicant max_num
2012 top_applicant max_num
Some lines of dummy data:
app_year_start name
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
I've consulted the follow SO posts:
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
Pandas groupby nlargest sum
Get max of count() function on pandas groupby objects
Some other attempts I've made:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
df3.groupby(['app_year_start','name']).count()
Any help would be appreciated. I'm also open to entirely different solutions as well.
Cross-tabulate and find max values.
(
# cross tabulate to get each applicant's number of applications
pd.crosstab(df['app_year_start'], df['name'])
# the applicant with most applications and their counts
.agg(['idxmax', 'max'], 1)
# change column names
.set_axis(['name','total_apps'], axis=1)
# flatten df
.reset_index()
)
You can use mode per group:
df.groupby('app_year_start')['name'].agg(lambda x: x.mode().iloc[0])
Or, if you want all values joined as a single string in case of a tie:
df.groupby('app_year_start')['name'].agg(lambda x: ', '.join(x.mode()))
Output:
app_year_start
2012 John Smith
2013 Jane Doe
2014 John Smith
2015 John Snow
Name: name, dtype: object
Variant of your initial code:
(df
.groupby(['app_year_start', 'name'])['name']
.agg(total_apps='count')
.sort_values(by='total_apps', ascending=False)
.reset_index()
.groupby('app_year_start', as_index=False)
.first()
)
Output:
app_year_start name total_apps
0 2012 John Smith 8
1 2013 Jane Doe 2
2 2014 John Smith 2
3 2015 John Snow 4
With value_counts and a groupby:
dfc = (df.value_counts().reset_index().groupby('app_year_start').max()
.sort_index(ascending=False).reset_index()
.rename(columns={0:'total_apps'})
)
print(dfc)
Result
app_year_start name total_apps
0 2015 John Snow 4
1 2014 John Smith 2
2 2013 Jane Doe 2
3 2012 John Snow 8

pandas update specific rows in specific columns in one dataframe based on another dataframe

I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0

Compare two data frames (Source Vs Target) and leave empty row if records not found in Target table (having same index number as source)

Want to compare data present in dfs "source" with 'Index' number
against dfs "Target" and if the searched index is not found in target dfs..blank row has to be printed in target table with same index key as given in source. Is any other way to achieve without loop because I need to compare dataset of 500,000 records.
Below is the source and target and expected data frames. Source data has record for index number = 3, where as target didn't have record with index number = 3.
I wanted to print blank row with same index number as source.
Source:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3 5680 Johy 25 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Target:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Expected:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Please suggest if there is any way to do it without looping as I need to compare dataset of 500,000 records.
You can reindex and fillna() with '' blank space:
Target.reindex(Source.index).fillna('')
Or:
Target.reindex(Source.index,fill_value='')
If Index is a column and not actually an index, set it as index:
Source=Source.set_index('Index')
Target=Target.set_index('Index')
Not the best way, I prefer #anky_91's way:
>>> df = pd.concat([source, target]).drop_duplicates(keep='first')
>>> df.loc[~df['Index'].isin(source['Index']) | ~df['Index'].isin(target['Index']), df.columns.drop('Index')] = ''
>>> df
Index Employee ID Employee Name Age City Country
0 1 5678 John 30 New york USA
1 2 5679 Sam 35 New york USA None
2 3
3 4 5681 Rose 70 New york USA
4 5 5682 Tom 28 New york USA None
5 6 5683 Nick 49 New york USA
6 7 5684 Ricky 20 Syney Australia
>>>

python pandas groupby sort rank/top n

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15
You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15
You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Categories

Resources