sum using group by not giving expected result - python

I need to sum values of one column using group by on another column and override the dataframe with those values
I have tried-
df.groupby('S/T name')['Age group (Years)Total Persons'].sum()
Dataframe to implement sum on -
S/T code S/T name city name population
1 NSW Greater sydney 1000
1 NSW rest of nsw 100
1 NSW rest of nsw 2000
2 Victoria Geelong 1200
2 Victoria Melbourne 1300
2 Victoria Melbourne 1000
Required ouput-
S/T code S/T name population
1 NSW 3100
2 Victoria 3500

You seem to be summing on the wrong column in your example, switching to population would have got you most of the way:
df.groupby('S/T name')['population'].sum()
Since you want to retain the S/T code column though you can use agg. Calling sum on your population column and mean on your S/T code column:
df.groupby('S/T name').agg({'population': 'sum', 'S/T code': 'mean'})
Output:
S/T name S/T code population
NSW 1 3100
Victoria 2 3500

Try the following code:
Solution 1
grouped_df = df.groupby('S/T name')['population'].sum()
print(grouped_df)
The above code will group results by column S/T name and give the sum of population column.
Solution 2
grouped_df1 = df.groupby('S/T name').agg({'S/Tcode':'unique','population': 'sum'})
grouped_df1

Related

How can I create a dictionary to map individual missing values with aggregate function in Pandas?

I'm not even sure how to describe this, but I am looking for a method to sum the values of others and replace the NaN with a particular value.
My data consists of Organization, Population and Square Miles. I have population and square miles data for each county in the state, however some Organizations span across multiple counties. I merged the two data sets (organization info with pop/square miles data) but am obviously left with NaNs for the organizations that span across multiple counties.
If I create a dictionary like the following:
counties = {'Job a': ['county a', 'county b', 'county c'],
'Job b': ['county d', 'county e', 'county f'],
'Job c': ['county g', 'county h']} etc..
If I have table 1 like this:
county sq_mile population
0 County a 2500 15000
1 County b 750 400
2 County c 4000 3500
3 County d 4300 4500
4 County e 2000 1500
5 County f 1000 1500
6 County g 4300 3500
7 County h 4400 3200
8 County i 4000 3500
How can I take table 2 (which has just organizations) and fill missing data using the dictionary and associated column values?
organization sq_mile population
0 job a 7250 18900
1 job b 7300 7500
2 job c 8700 6700
3 county i 4000 3500
Try this:
counties_r = {i: k for k in counties.keys() for i in counties[k]}
df.groupby(df['County'].str.lower().map(counties_r).fillna(df['County']))\
.sum().reset_index()
Output:
County Sq. Mile Population
0 County i 4000 3500
1 Job a 7250 18900
2 Job b 7300 7500
3 Job c 8700 6700
Swap the keys and values in your dictionary the map that swapped dictionary to the county column making everything lowercase then use groupby with sum.

Python pandas to select a row value after group by

May I know how to select a row of values which has max count number after grouping by a column
Examples:
STATE COUNTY POPULATION
1 5571 1000
2 3421 2000
3 6781 3000
2 1234 4000
2 3344 6600
1 5566 9900
I want to find the STATE with max number of count of county, select STATE and COUNTY to show only, without POPULATION.
Answer should be, but i dont know how to code it in python. Thanks for help
STATE COUNTY
2 3
Try:
u = df.groupby('STATE')['COUNTRY'].size()
v = u[u.index==u.idxmax()].reset_index()
v:
STATE COUNTRY 0 2 3
Approach:
Group by STATE and then use nunique if you want to count distinct values or size on COUNTRY Column.
get the index of the row where count is the max.

Multiply columns based on two columns conditions from different dataframes?

I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources