python pandas groupby sort rank/top n - python

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15

You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15

You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Related

pandas update specific rows in specific columns in one dataframe based on another dataframe

I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0

I want a pandas script to line up values from one excel sheet to another based on the values in the first spreadsheet

this question seems vague and similar questions have been asked before but Im still confused! any help will be much appreciated. I dont have a very good programing background so my wording is not great I hope you understand:
I have two excel spread sheets with 10s of thousands of data points. The first (Sheet A) has addresses eg:
house_number street suburb
0 43 Smith Street Frewville
1 45 Smith Street Frewville
2 47 Smith Street Frewville
3 49 Smith Street Frewville
4 51 Smith Street Frewville
5 53 Smith Street Frewville
6 1 Flinders St Kensington
7 3 Flinders St Kensington
8 5 Flinders St Kensington
9 7 Flinders St Kensington
the second (Sheet B) has some of the same streets but with an ID column too eg:
ID house_number street suburb
0 5509 43 Smith Street Frewville
1 5120 26 Taylor Avenue Glenside
2 4731 34 Brussels Street Frewville
3 4342 12 Brussels Street Frewville
4 3953 1 Roger Court Clifton
5 12098 4 Elizabeth St Clifton
6 2024 7 Flinders St Kensington
7 28388 10 Queens Rd Kensington
8 36533 13 Queens Rd Kensington
9 4478 346 Jefcott Street Glenside
10 52823 19 Jefcott Street Glenside
I want a pandas script that adds a column to sheet A with the ID from sheet B if the addresses are in both excel sheets
trying to remove NaN?
import pandas as pd
df1 = pd.read_excel('sheeta.xlsx')
df2 = pd.read_excel('sheetb.xlsx')
sheet1 = df1.join(
(df1.reset_index() # make a column of pandas index so join can work
.merge(df2, on=["house_number","street","suburb"], how="inner") # find fullmatches
.set_index("index") # make index same as original sheeta
.loc[:,["ID"]] # only want ID column to go back into join
.astype("Int64") # force the types that support int NaN
.dropna(sheet1)
))
print(sheet1)
Commented for explanation of approach. Have found two addresses where ID from sheet2 comes back onto sheet1
import io
sheeta = pd.read_csv(io.StringIO(""" house_number street suburb
0 43 Smith Street Frewville
1 45 Smith Street Frewville
2 47 Smith Street Frewville
3 49 Smith Street Frewville
4 51 Smith Street Frewville
5 53 Smith Street Frewville
6 1 Flinders St Kensington
7 3 Flinders St Kensington
8 5 Flinders St Kensington
9 7 Flinders St Kensington"""), sep="\s\s+", engine="python")
sheetb = pd.read_csv(io.StringIO("""ID house_number street suburb
0 5509 43 Smith Street Frewville
1 5120 26 Taylor Avenue Glenside
2 4731 34 Brussels Street Frewville
3 4342 12 Brussels Street Frewville
4 3953 1 Roger Court Clifton
5 12098 4 Elizabeth St Clifton
6 2024 7 Flinders St Kensington
7 28388 10 Queens Rd Kensington
8 36533 13 Queens Rd Kensington
9 4478 346 Jefcott Street Glenside
10 52823 19 Jefcott Street Glenside"""), sep="\s\s+", engine="python")
sheet1 = sheeta.join(
(sheeta.reset_index() # make a column of pandas index so join can work
.merge(sheetb, on=["house_number","street","suburb"], how="inner") # find fullmatches
.set_index("index") # make index same as original sheeta
.loc[:,["ID"]] # only want ID column to go back into join
.astype("Int64") # force the types that support int NaN
))
print(sheet1.to_string())
output
house_number street suburb ID
0 43 Smith Street Frewville 5509
1 45 Smith Street Frewville <NA>
2 47 Smith Street Frewville <NA>
3 49 Smith Street Frewville <NA>
4 51 Smith Street Frewville <NA>
5 53 Smith Street Frewville <NA>
6 1 Flinders St Kensington <NA>
7 3 Flinders St Kensington <NA>
8 5 Flinders St Kensington <NA>
9 7 Flinders St Kensington 2024

Conditionally populating a column with a row value in pandas

I have a data set of wages for male and female workers indicated by there name.
Male Female Male_Wage Female_Wage
James Lori 8 9
Mike Nancy 10 8
Ron Cathy 11 12
Jon Ruth 15 9
Jason Jackie 10 10
In pandas I would like to create a new column in the data frame that displays the name of the person that is the highest paid. If the condition exists that both are paid the same the value should be Same.
Male Female Male_Wage Female_Wage Highest_Paid
James Lori 8 9 Lori
Mike Nancy 10 8 Mike
Ron Cathy 11 12 Cathy
Jon Ruth 15 9 Jon
Jason Jackie 10 10 Same
I have been able to add a column and populate it with values, calculate a value based on other columns etc. but not how to fill the new column conditionally based on the value of another column with the condition of same in the instance the wages are the same is causing me trouble. I have searched for an answer quite a bit and have not found anything that covers all the elements of this situation.
Thanks for the help.
You can do this by using loc statements
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
df.loc[df['Male_Wage'] > df['Female_Wage'], 'Highest_Paid'] = df['Male']
df.loc[df['Male_Wage'] < df['Female_Wage'], 'Highest_Paid'] = df['Female']
Method 1: np.select:
We can specify our condtions and based on those condition, we get the values of Male or Female, else default='Same'
conditions = [
df['Male_Wage'] > df['Female_Wage'],
df['Female_Wage'] > df['Male_Wage']
]
choices = [df['Male'], df['Female']]
df['Highest_Paid'] = np.select(conditions, choices, default='Same')
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same
Method 2: np.where + loc
Using np.where and .loc to conditionally assign the correct value:
df['Highest_Paid'] = np.where(df['Male_Wage'] > df['Female_Wage'],
df['Male'],
df['Female'])
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same

Compare two data frames (Source Vs Target) and leave empty row if records not found in Target table (having same index number as source)

Want to compare data present in dfs "source" with 'Index' number
against dfs "Target" and if the searched index is not found in target dfs..blank row has to be printed in target table with same index key as given in source. Is any other way to achieve without loop because I need to compare dataset of 500,000 records.
Below is the source and target and expected data frames. Source data has record for index number = 3, where as target didn't have record with index number = 3.
I wanted to print blank row with same index number as source.
Source:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3 5680 Johy 25 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Target:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Expected:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Please suggest if there is any way to do it without looping as I need to compare dataset of 500,000 records.
You can reindex and fillna() with '' blank space:
Target.reindex(Source.index).fillna('')
Or:
Target.reindex(Source.index,fill_value='')
If Index is a column and not actually an index, set it as index:
Source=Source.set_index('Index')
Target=Target.set_index('Index')
Not the best way, I prefer #anky_91's way:
>>> df = pd.concat([source, target]).drop_duplicates(keep='first')
>>> df.loc[~df['Index'].isin(source['Index']) | ~df['Index'].isin(target['Index']), df.columns.drop('Index')] = ''
>>> df
Index Employee ID Employee Name Age City Country
0 1 5678 John 30 New york USA
1 2 5679 Sam 35 New york USA None
2 3
3 4 5681 Rose 70 New york USA
4 5 5682 Tom 28 New york USA None
5 6 5683 Nick 49 New york USA
6 7 5684 Ricky 20 Syney Australia
>>>

How to sort a multiindex pandas dataframe pivot table by the totals of an agg='sum' at each level of the index

Example generated with something like
df = pd.pivot_table(df, index=['name1','name2','name3'], values='total', aggfunc='sum')
No logical sort initially
name1 name2 name3 total
Bob Mario Luigi 5
John Dan 16
Dave Tom Jim 2
Joe 6
Jack 3
Jill Frank 6
Kevin 7
Should become
name1 name2 name3 total
Dave Jill Kevin 7
Frank 6
Tom Joe 6
Jack 3
Jim 2
Bob John Dan 16
Mario Luigi 5
Where Dave is on top because his "total of totals" : 24 is higher than Bob's 21. It propogates to each subsequent index as well so Jill's 13 > Tom's 11, etc. Been messing around with groupby(), sort_values(), sort_index() and determined I don't really know what I'm doing.
What you could do is create additional columns and then do a multicolumn sort.
For the additional column, transform will return a Series with the index aligned to the df so you can then add it as a new column:
from pandas import DataFrame
mydf = DataFrame({"name1":\
["bob","bob","dave","dave","dave","dave","dave"],"name2":\
["mario","john","tom","tom","tom","jill","jill"],"name3":\
["luigi","dan","jim","joe","jack","frank","kevin"],"total":[5,16,2,6,3,6,7]})
mydf["tot1"]=mydf["total"].groupby(mydf["name1"]).transform(sum)
mydf["tot2"]=mydf["total"].groupby(mydf["name2"]).transform(sum)
mydf["tot3"]=mydf["total"].groupby(mydf["name3"]).transform(sum)
mydf.sort_values(by=["tot1","tot2","tot3"],ascending=[False,False,False])
Which yields:
name1 name2 name3 total tot1 tot2 tot3
6 dave jill kevin 7 24 13 7
5 dave jill frank 6 24 13 6
3 dave tom joe 6 24 11 6
4 dave tom jack 3 24 11 3
2 dave tom jim 2 24 11 2
1 bob john dan 16 21 16 16
0 bob mario luigi 5 21 5 5

Categories

Resources