I have a large DataFrame looking like this
name Country ...
1 Paul Germany
2 Paul Germany
3 George Italy
3 George Italy
3 George Italy
...
N John USA
I'm looking for the occurence of each element of the name column, such has
name Country Count
1 Paul Germany 2000
2 George Italy 500
...
N John USA 40000
Any idea what is the most optimal way to do it ?
Because this is quite long
df['count'] = df.groupby(['name'])['name'].transform(pd.Series.value_counts)
you can do it like this:
df.groupby(['name', 'Country']).size()
example:
import pandas as pd
df = pd.DataFrame.from_dict({'name' : ['paul', 'paul', 'George', 'George', 'George'],
'Country': ['Germany', 'Italy','Germany','Italy','Italy']})
df
output:
Country name
0 Germany paul
1 Italy paul
2 Germany George
3 Italy George
4 Italy George
Group by and get count:
df.groupby(['name', 'Country']).size()
output:
name Country
George Germany 1
Italy 2
paul Germany 1
Italy 1
If you just want to the counts with respect to the name column, you don't need to use groupby, you can just use select the name column from the DataFrame (which returns a Series object) and call value_counts() on it directly:
df['name'].value_counts()
Related
Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
Let's say I have this dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
I want to group on the "Country" columns and to keep the records with the most frequent records in the groupby object.
The expected result would be :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
I already tried extracting the prefix, then getting the mode of the prefix on the dataframe and merging rows with this mode, but I feel that a more direct and more efficient solution exists.
Here is the working sample code below for reproducible results :
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
You can try groupby and apply to filter group rows
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
Use:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
An interesting fact about the solutions proposed below is that Mayank's one is way faster. I ran it on 1000 rows on my data and got :
Mayank's solution : 0.020 seconds
Ynjxsjmh's solution : 0.402 seconds
My (OP) solution : 0.122 seconds
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.
So, here is my dataframe
import pandas as pd
cols = ['Name','Country','Income']
vals = [['Steve','USA',40000],['Matt','UK',40000],['John','USA',40000],['Martin','France',40000],]
x = pd.DataFrame(vals,columns=cols)
I have another list:
europe = ['UK','France']
I want to create a new column 'Continent' if x.Country is in europe
You need numpy.where with condition with isin:
x['Continent'] = np.where(x['Country'].isin(europe), 'Europe', 'Not Europe')
print (x)
Name Country Income Continent
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe
Or you can using isin directly
x['New Column']='Not Europe'
x.loc[x.Country.isin(europe),'New Column']='Europe'
Out[612]:
Name Country Income New Column
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe