How to run a groupby based on result of other/previous groupby?

How to run a groupby based on result of other/previous groupby? - python

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?

Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G

One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

Related

Optimizing apply and lambda function with pandas

I am trying to optimize a function returning the value (wage)of a variable given a condition (largest enrollment within MSA) for every year. I thought combining apply and lambda would be efficient, but my actual dataset is large (shape of 321681x272) making the computation extremely slow. Is there a faster way of going about this ? I think vectorizing the operations instead of iterating through df could be a solution, but I am unsure of the structure it would follow as an alternative to df.apply and lambda
df = pd.DataFrame({'year': [2000, 2000, 2001, 2001],
'msa': ['NYC-Newark', 'NYC-Newark', 'NYC-Newark', 'NYC-Newark'],
'leaname':['NYC School District', 'Newark School District', 'NYC School District', 'Newark School District'],
'enroll': [100000,50000,110000,60000],
'wage': [5,2,7,3] })
def function1(x,y, var):
'''
Returns the selected variable's value for school district with largest enrollment in a given year
'''
t = df[(df['msa'] == x) & (df['year'] == y)]
e = pd.DataFrame(t.groupby(['msa',var]).mean()['enroll'])
return e.loc[e.groupby(level=[0])['enroll'].idxmax()].reset_index()[var]
df['main_city_wage'] = df.apply(lambda x: function1(x['msa'], x['year'], 'wage'), axis = 1)
Sample Output
year msa leaname enroll wage main_wage
0 2000 NYC-Newark NYC School District 100000 5 5
1 2000 NYC-Newark Newark School District 50000 2 5
2 2001 NYC-Newark NYC School District 110000 7 7
3 2001 NYC-Newark Newark School District 60000 3 7

Something like
df['main_wage'] = df.set_index('wage').groupby(['year', 'msa'])['enroll'].transform('idxmax').values

Python pandas to select a row value after group by

May I know how to select a row of values which has max count number after grouping by a column
Examples:
STATE COUNTY POPULATION
1 5571 1000
2 3421 2000
3 6781 3000
2 1234 4000
2 3344 6600
1 5566 9900
I want to find the STATE with max number of count of county, select STATE and COUNTY to show only, without POPULATION.
Answer should be, but i dont know how to code it in python. Thanks for help
STATE COUNTY
2 3

Try:
u = df.groupby('STATE')['COUNTRY'].size()
v = u[u.index==u.idxmax()].reset_index()
v:
STATE COUNTRY 0 2 3
Approach:
Group by STATE and then use nunique if you want to count distinct values or size on COUNTRY Column.
get the index of the row where count is the max.

Using groupby and to create a column with percentage frequency

Working in python, in a Jupyter notebook. I am given this dataframe
congress chamber state party
80 house TX D
80 house TX D
80 house NJ D
80 house TX R
80 senate KY R
of every congressperson since the 80th congressional term, with a bunch of information. I've narrowed it down to what's needed for this question. I want to alter the dataframe so that I have a single row for every unique combination of congressional term, chamber, state, and party affiliation, Then a new column with the number of rows that are of the associated party divided by the number of rows where everything else besides that is the same. For example, this
congress chamber state party perc
80 house TX D 0.66
80 house NJ D 1
80 house TX R 0.33
80 senate KY R 1
is what I'd want my result to look like. The perc column is the percentage of, for example, democrats elected to congress in TX in the 80th congressional election.
I've tried a few different methods I've found on here, but most of them divide the number of rows by the number of rows in the entire dataframe, rather than by just the rows that meet the 3 given criteria. Here's the latest thing I've tried:
term=80
newdf = pd.crosstab(index=df['party'], columns=df['state']).stack()/len(df[df['congress']==term])
I define term because I'll only care about one term at a time for each dataframe.
A method I tried using groupby involved the following:
newdf = df.groupby(['congress', 'chamber','state']).agg({'party': 'count'})
state_pcts = newdf.groupby('party').apply(lambda x:
100 * x / float(x.sum()))
And it does group by term, chamber, state, but it returns a number that doesn't mean anything to me, when I check what the actual results should be.

Basically, you can do the following using value_counts for each group:
def func(f):
return f['party'].value_counts(normalize=True)
df = (df
.groupby(['congress','chamber','state'])
.apply(func)
.reset_index()
.rename(columns={'party':'perc','level_3':'party'}))
print(df)
congress chamber state party perc
0 80 house NJ D 1.000000
1 80 house TX D 0.666667
2 80 house TX R 0.333333
3 80 senate KY R 1.000000

sum using group by not giving expected result

I need to sum values of one column using group by on another column and override the dataframe with those values
I have tried-
df.groupby('S/T name')['Age group (Years)Total Persons'].sum()
Dataframe to implement sum on -
S/T code S/T name city name population
1 NSW Greater sydney 1000
1 NSW rest of nsw 100
1 NSW rest of nsw 2000
2 Victoria Geelong 1200
2 Victoria Melbourne 1300
2 Victoria Melbourne 1000
Required ouput-
S/T code S/T name population
1 NSW 3100
2 Victoria 3500

You seem to be summing on the wrong column in your example, switching to population would have got you most of the way:
df.groupby('S/T name')['population'].sum()
Since you want to retain the S/T code column though you can use agg. Calling sum on your population column and mean on your S/T code column:
df.groupby('S/T name').agg({'population': 'sum', 'S/T code': 'mean'})
Output:
S/T name S/T code population
NSW 1 3100
Victoria 2 3500

Try the following code:
Solution 1
grouped_df = df.groupby('S/T name')['population'].sum()
print(grouped_df)
The above code will group results by column S/T name and give the sum of population column.
Solution 2
grouped_df1 = df.groupby('S/T name').agg({'S/Tcode':'unique','population': 'sum'})
grouped_df1

How to filter out entries in a data frame with specific and different values?

I have this real estate data:
neighborhood type_property type_negotiation price
Smallville house rent 2000
Oakville apartment for sale 100000
King Bay house for sale 250000
...
I have this groupby that identifies which values in the data set are a house for sale, and then returns the 10th and 90th percentile and quantity of these houses for each neighborhood in a new data frame called df_breakdown. The result looks like this:
neighborhood tenthpercentile ninetiethpercentile Quantity
King Bay 250000.0 250000.0 1
Smallville 99000.0 120000.0 8
Oakville 45000.0 160000.0 6
...
I now want to take this information back to my original real estate data set, and filter out all listings if it's a house for sale over the 90th percentile or below the 10th percentile in respect to the percentiles calculated for each neighborhood. For example, I would want a house in the Oakville neighborhood that has a price of 350000 filtered out.
I have used this argument before:
df1 = df[df.price < df.price.quantile(.90)]
But I don't know how to utilize it for differing values for each neighborhood, or even if it is useful to use. Thank you in advance for the help.

Probably not the most elegant but you could join the percentile aggregations to each of the real estate data.
df.join(df.groupby(‘neighborhood’).quantile([0.1,0.9]), on=‘neighborhood’)
On mobile, so forgive me if the syntax isn’t perfect.

You can set them to have same indexes, broadcast the percentiles, and just use .between
So first,
df2 = df2.set_index('neighborhood')
df = df.set_index('neighborhood')
Then, broadcast using loc
df.loc[:, 't'], df.loc[:, 'n'] = df2.tenthpercentile, df2.ninetiethpercentile
Finally,
df.price.between(df.t, df.n)
which yields
neighborhood
Smallville False
Oakville True
King Bay True
King Bay False
dtype: bool
So to filter, just slice
df[df.price.between(df.t, df.n)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run a groupby based on result of other/previous groupby? - python

Related

Optimizing apply and lambda function with pandas

Python pandas to select a row value after group by

Using groupby and to create a column with percentage frequency

sum using group by not giving expected result

How to filter out entries in a data frame with specific and different values?

Categories

Resources