May I know how to select a row of values which has max count number after grouping by a column
Examples:
STATE COUNTY POPULATION
1 5571 1000
2 3421 2000
3 6781 3000
2 1234 4000
2 3344 6600
1 5566 9900
I want to find the STATE with max number of count of county, select STATE and COUNTY to show only, without POPULATION.
Answer should be, but i dont know how to code it in python. Thanks for help
STATE COUNTY
2 3
Try:
u = df.groupby('STATE')['COUNTRY'].size()
v = u[u.index==u.idxmax()].reset_index()
v:
STATE COUNTRY 0 2 3
Approach:
Group by STATE and then use nunique if you want to count distinct values or size on COUNTRY Column.
get the index of the row where count is the max.
Related
I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)
I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?
I need to sum values of one column using group by on another column and override the dataframe with those values
I have tried-
df.groupby('S/T name')['Age group (Years)Total Persons'].sum()
Dataframe to implement sum on -
S/T code S/T name city name population
1 NSW Greater sydney 1000
1 NSW rest of nsw 100
1 NSW rest of nsw 2000
2 Victoria Geelong 1200
2 Victoria Melbourne 1300
2 Victoria Melbourne 1000
Required ouput-
S/T code S/T name population
1 NSW 3100
2 Victoria 3500
You seem to be summing on the wrong column in your example, switching to population would have got you most of the way:
df.groupby('S/T name')['population'].sum()
Since you want to retain the S/T code column though you can use agg. Calling sum on your population column and mean on your S/T code column:
df.groupby('S/T name').agg({'population': 'sum', 'S/T code': 'mean'})
Output:
S/T name S/T code population
NSW 1 3100
Victoria 2 3500
Try the following code:
Solution 1
grouped_df = df.groupby('S/T name')['population'].sum()
print(grouped_df)
The above code will group results by column S/T name and give the sum of population column.
Solution 2
grouped_df1 = df.groupby('S/T name').agg({'S/Tcode':'unique','population': 'sum'})
grouped_df1
Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?
Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G
One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.
I have the following use case:
I want to make a dataframe where for each row I have a column where I can see how many interactions there have been for this ID (user) in the categories. The hardest thing to me is that they can't be double counted, while a match in just one of the categories is enough to be counted as 1.
So for example I have:
richtingen id
0 Marketing, Sales 1110
1 Marketing, Sales 1110
2 Finance 220
3 Marketing, Engineering 1110
4 IT 3300
Now I want to create a third row where I can see how many times this ID has interacted with any of these categories in total. Each comma is a category on it's own so for example: "Marketing, Sales" are the two categories Marketing and Sales. To get a +1 you only need to have a match with another row where ID is the same and one of the categories matches, so for example for the index 0 it would be 3 (indexes 0, 1 and 3 match). The output data for the example should be:
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
The hard part for me seems to be that I can't all categories into new rows, as then you perhaps will start counting double. For example index 0 matches both Marketing and Sales of index 1 and I want it just to add 1, not 2.
The code I have so far is:
df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')
this only matches identical combination of categories though.
Other things I've tried:
- creating a new column with all vacancies split into an array:
df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))
and then the plan was to use something along this code in combination with groupby on id to count number of times it is true per item:
if any(t < 0 for t in x):
# do something
I couldn't get this to work either.
I tried splitting categories in new rows, or columns but then got an issue of double counting.
For example using code suggested:
df['richtingen'].str.split(', ',expand=True)
Gives me the following:
0 1 id
0 Marketing Sales 1110
1 Marketing Sales 1110
2 dDD None 220
3 Marketing Engineering 1110
4 ddsad None 3300
But then I will need to create code that goes over every row, then checks the ID, lists the values in the columns and checks if they are contained in any of the other columns (where ID is the same) and if one of them matches add 1 to freq. This code I suspect might be able with groupby, but am not sure, and can't figure it out.
(Solution suggested by Jezrael below):
If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame.
I think this solution perhaps is something similar to this, but at the moment it counts the total number of unique categories (not the unique number of interaction with categories). For example output at index 2 here is 2, while it should be 1 (as the user only interacted with the categories once).
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance, Accounting 220 2
3 Marketing, Engineering 1110 3
4 IT 3300 1
Hope I made myself clear and anyone knows how to fix this! In total there will be around 13 categories, always in one cell, but divided by a comma.
For msr_003:
id richtingen freq_x freq_y
0 220 Finance, IT 0 2
1 1110 Finance, IT 1 2
2 1110 Marketing, Sales 2 4
3 1110 Marketing, Sales 3 4
4 220 Marketing 4 1
5 220 Finance 5 2
6 1110 Marketing, Sales 6 4
7 3300 IT 7 1
8 1110 Marketing, IT 8 4
If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame:
s = (df.set_index('id')['richtingen']
.str.split(', ',expand=True)
.stack()
.groupby(level=0)
.nunique())
print (s)
id
220 1
1110 3
3300 1
dtype: int64
df['freq'] = df['id'].map(s)
print (df)
richtingen id freq
0 Marketing, Sales 1110 3
1 Marketing, Sales 1110 3
2 Finance 220 1
3 Marketing, Engineering 1110 3
4 IT 3300 1
Detail:
print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id
1110 0 Marketing
1 Sales
0 Marketing
1 Sales
220 0 Finance
1110 0 Marketing
1 Engineering
3300 0 IT
dtype: object
I just modified your code as below.
count_unique = pd.DataFrame({'richtingen' : ["Finance, IT","Finance, IT", "Marketing, Sales", "Marketing, Sales", "Marketing","Finance", "Marketing, Sales", "IT", "Marketing, IT"], 'id': [220,1110,1110, 1110,220, 220,1110,3300,1110]})
count_unique['freq'] = list(range(0,len(count_unique)))
grp = count_unique.groupby(['richtingen', 'id']).agg({'freq' : 'count' }).reset_index(level = [0,1])
pd.merge(count_unique,grp, on = ('richtingen','id'), how = 'left')
I am not that into pandas. But I think you may have some luck by adding 13 new columns based on richtingen each column containing 1 or no category . You can use dataframe.apply or a similar function to compute the values when creating the columns.
Then you can take it from there by ORing stuff...