I would like to apply different "aggfunc" logics to a pandas pivot table. Lets suppose that I have the below df.
df1 = pd.DataFrame({'Country':['Italy', 'Italy', 'Italy', 'Germany','Germany', 'Germany', 'France', 'France'],
'City':['Rome','Rome',"Florence",'Berlin', 'Munich', 'Koln', "Paris", "Paris"],
'Numbers':[100,200,300,400,500,600,700,800]})
I would like to calculate the sum of "Numbers" per City and the mean of "Numbers" based on the Country. I should get the below output.
I must use the pd.pivot. But if you have better solutions, you can ALSO suggest that.
Would you be able to help me out?
Country
City
SUM
MEAN
France
Paris
1500
750
Germany
Berlin
400
500
Germany
Köln
600
500
Germany
Munich
500
500
Italy
Florence
300
200
Italy
Rome
300
200
I have tried using the following but it obviously does not work.
pd.pivot_table(df1, values = 'Numbers', index=['Country', 'City'], aggfunc=[np.sum, np.mean])
use GroupBy.transform
new_df = \
df1.assign(
SUM = df1.groupby('City', sort=False)['Numbers'].transform('sum'),
MEAN = df1.groupby('Country', sort=False)['Numbers'].transform('mean')
).drop_duplicates(['Country', 'City']).drop('Numbers', axis=1)
Country City SUM MEAN
0 Italy Rome 300 200
1 Italy Rome 300 200
2 Italy Florence 300 200
3 Germany Berlin 400 500
4 Germany Munich 500 500
5 Germany Koln 600 500
6 France Paris 1500 750
7 France Paris 1500 750
Related
Have a dataframe where I need to check , group by and sum all the data
I have used regex function to find and group all the particular group of data starts with respective countries.
Suppose I have a dataset
Countries 31-12-17 1-1-18 2-1-18 3-1-18 Sum
India-Basic 1200 1100 800 900 4000
Sweden-Basic 1500 1300 700 1500 5000
Norway-Basic 800 400 900 900 3000
India-Exp 600 1400 300 200 2500
Sweden-Exp 1800 400 600 700 3500
Norway-Exp 1300 1600 1100 1500 4500
Expected Output :
Countries Sum
India 6500
Sweden 8500
Norway 7500
India
Use for regex solution Series.str.extract and aggregate sum:
df1 = (df.groupby(df['Countries'].str.extract('(.*)-', expand=False), sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
Alternative si split Countries by - and select first lists by str[0]:
df1 = (df.groupby(df['Countries'].str.split('-').str[0], sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
this could work - note that i only filtered for the columns that are relevant :
(df.filter(['Countries','Sum'])
.assign(Countries = lambda x: x.Countries.str.split('-').str.get(0))
.groupby('Countries')
.agg('sum')
)
Sum
Countries
India 6500
Norway 7500
Sweden 8500
I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB
Let's say I have a DataFrame that looks like this:
Bank Name House This Wk
Barc Germany 100
Barc UK 300
Barc UK 500
JPM Japan 200
JPM NYC 100
BOA LA 900
BOA LA 50
BOA LA 50
DB Italy 45
I would like to group-by Bank Name, while outputting the largest House Value as well as the total value...
For example, using the example above would result in:
Bank Name Total House This Wk
Barc 900 UK 500
JPM 300 Japan 200
BOA 1000 LA 900
DB 45 Italy 45
Essentially, it is grouping the Total by Bank Name, but also outputting the largest contributor, House, to the total and the amount contributed is This Wk.
How can I go about doing this?
In [121]: df.groupby('Bank Name', group_keys=False) \
...: .apply(lambda x: x.nlargest(1, 'This Wk').assign(Total=x['This Wk'].sum())) \
...: [['Bank Name','Total','House','This Wk']]
...:
Out[121]:
Bank Name Total House This Wk
5 BOA 1000 LA 900
2 Barc 900 UK 500
8 DB 45 Italy 45
3 JPM 300 Japan 200
You can consider df.groupby with a list of dfGroupBy.agg functions:
In [732]: out = df.groupby('Bank Name')['This Wk'].agg(['sum', 'idxmax', 'max'])\
.rename(columns={'sum' : 'Total', 'idxmax' : 'House', 'max' : 'This Wk'})\
.reset_index()
In [734]: out['House'] = df.loc[out['House'], 'House'].values; out
Out[734]:
Bank Name Total House This Wk
0 BOA 1000 LA 900
1 Barc 900 UK 500
2 DB 45 Italy 45
3 JPM 300 Japan 200
Another way using apply would be
In [17]: (df.groupby('Bank Name', sort=False)
.apply(lambda x: pd.Series(
[x['This Wk'].sum(),
x.loc[x['This Wk'].idxmax(), 'House'],
x['This Wk'].max()],
index=['Total', 'House', 'This Wk']))
.reset_index())
Out[17]:
Bank Name Total House This Wk
0 Barc 900 UK 500
1 JPM 300 Japan 200
2 BOA 1000 LA 900
3 DB 45 Italy 45
I have a large DataFrame looking like this
name Country ...
1 Paul Germany
2 Paul Germany
3 George Italy
3 George Italy
3 George Italy
...
N John USA
I'm looking for the occurence of each element of the name column, such has
name Country Count
1 Paul Germany 2000
2 George Italy 500
...
N John USA 40000
Any idea what is the most optimal way to do it ?
Because this is quite long
df['count'] = df.groupby(['name'])['name'].transform(pd.Series.value_counts)
you can do it like this:
df.groupby(['name', 'Country']).size()
example:
import pandas as pd
df = pd.DataFrame.from_dict({'name' : ['paul', 'paul', 'George', 'George', 'George'],
'Country': ['Germany', 'Italy','Germany','Italy','Italy']})
df
output:
Country name
0 Germany paul
1 Italy paul
2 Germany George
3 Italy George
4 Italy George
Group by and get count:
df.groupby(['name', 'Country']).size()
output:
name Country
George Germany 1
Italy 2
paul Germany 1
Italy 1
If you just want to the counts with respect to the name column, you don't need to use groupby, you can just use select the name column from the DataFrame (which returns a Series object) and call value_counts() on it directly:
df['name'].value_counts()
The sample data of mine may be seen here:
data=[['Australia',100],['France',200],['Germany',300],['America',400]]
What I expect may be the dataframe like this:
volume
Australia 100
France 200
Germany 300
America 400
And I've tried the following:
pd.DataFrame(data,columns=['Country','Volume'])
Country Volume
0 Australia 100
1 France 200
2 Germany 300
3 America 400
pd.DataFrame.from_items()
Howerver, I still can't get the expected result?
Is there a possible way that I can get the expected pandas dataframe structure?
Thanks for all your kindly checking in advance.
You can call set_index on the result of the dataframe:
In [2]:
data=[['Australia',100],['France',200],['Germany',300],['America',400]]
pd.DataFrame(data,columns=['Country','Volume']).set_index('Country')
Out[2]:
Volume
Country
Australia 100
France 200
Germany 300
America 400