I have this pandas df
Area Element Code Element Code Item YCode Year Value Type
39 India 5312 Area harvested 125 Cassava 2000 2000 27179.0 food
40 India 5312 Area harvested 125 Cassava 2001 2001 27794.0 food
41 India 5312 Area harvested 125 Cassava 2002 2002 21408.0 food
42 India 5312 Area harvested 125 Cassava 2003 2003 36061.0 food
43 India 5312 Area harvested 125 Cassava 2004 2004 59585.0 food
... ... ... ... ... ... ... ... ... ...
1071 India 5510 Production 567 Watermelons 2014 2014 229267.0 food
1072 India 5510 Production 567 Watermelons 2015 2015 270686.0 food
1073 India 5510 Production 567 Watermelons 2016 2016 258691.0 food
1074 India 5510 Production 567 Watermelons 2017 2017 243203.0 food
1075 India 5510 Production 567 Watermelons 2018 2018 239896.0 food
I want to get a new column that contains the sum of every Item, i.e Cassava, Watermelon over the corresponding year.
i.e if the year is 2001, sum of value of every crop
,then for next year.
I will be grateful if anyone gives a Idea
You can simply do:
df['new_col'] = df.groupby('year')['item'].transform('sum')
should get you what you need.
Related
My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106
I have an existing DataFrame which is grouped by the job title and by the year. I want to create a nested bar graph in Bokeh from this but I am confused on what to put in order to plot it properly.
The dataframe:
size
fromJobtitle year
CEO 2000 236
2001 479
2002 4
Director 2000 42
2001 609
2002 188
Employee 1998 23
1999 365
2000 2393
2001 5806
2002 817
In House Lawyer 2000 5
2001 54
Manager 1999 8
2000 979
2001 2173
2002 141
Managing Director 1998 2
1999 14
2000 130
2001 199
2002 11
President 1999 31
2000 202
2001 558
2002 198
Trader 1999 5
2000 336
2001 494
2002 61
Unknown 1999 591
2000 2960
2001 3959
2002 673
Vice President 1999 49
2000 2040
2001 3836
2002 370
An example output is:
I assume you have a DataFrame df with three columns fromJobtitle, year, size. If you have a MultiIndex, reset the Index. To use
FactorRange from bokeh, we need a list of tupels with two strings (this is imporant, floats won't work) like
[('CEO', '2000'), ('CEO', '2001'), ('CEO', '2002'), ...]
an so on.
This can be done with
df['x'] = df[['fromJobtitle', 'year']].apply(lambda x: (x[0],str(x[1])), axis=1)
And this is all the heavy part. The rest does bokeh for you.
from bokeh.plotting import show, figure, output_notebook
from bokeh.models import FactorRange
output_notebook()
p = figure(
x_range=FactorRange(*list(df["x"])),
width=1400
)
p.vbar(
x="x",
top="size",
width=0.9,
source=df,
)
show(p)
This is the generated figure
My dataframe looks like this. 3 columns. All I want to do is write a FUNCTION that, when the first two columns are inputs, the corresponding third column (GHG intensity) is the output. I want to be able to input any property name and year and achieve the corresponding GHG intensity value. I cannot stress enough that this has to be written as a function using def. Please help!
Property Name Data Year \
467 GALLERY 37 2018
477 Navy Pier, Inc. 2016
1057 GALLERY 37 2015
1491 Navy Pier, Inc. 2015
1576 GALLERY 37 2016
2469 The Chicago Theatre 2016
3581 Navy Pier, Inc. 2014
4060 Ida Noyes Hall 2015
4231 Chicago Cultural Center 2015
4501 GALLERY 37 2017
5303 Harpo Studios 2015
5450 The Chicago Theatre 2015
5556 Chicago Cultural Center 2016
6275 MARTIN LUTHER KING COMMUNITY CENTER 2015
6409 MARTIN LUTHER KING COMMUNITY CENTER 2018
6665 Ida Noyes Hall 2017
7621 Ida Noyes Hall 2018
7668 MARTIN LUTHER KING COMMUNITY CENTER 2017
7792 The Chicago Theatre 2018
7819 Ida Noyes Hall 2016
8664 MARTIN LUTHER KING COMMUNITY CENTER 2016
8701 The Chicago Theatre 2017
9575 Chicago Cultural Center 2017
10066 Chicago Cultural Center 2018
GHG Intensity (kg CO2e/sq ft)
467 7.50
477 22.50
1057 8.30
1491 23.30
1576 7.40
2469 4.50
3581 17.68
4060 11.20
4231 13.70
4501 7.90
5303 18.70
5450 NaN
5556 10.30
6275 14.10
6409 12.70
6665 8.30
7621 8.40
7668 12.10
7792 4.40
7819 10.20
8664 12.90
8701 4.40
9575 9.30
10066 7.50
Here is an example, with a a different data frame to test:
import pandas as pd
df = pd.DataFrame(data={'x': [3, 5], 'y': [4, 12]})
def func(df, arg1, arg2, arg3):
''' agr1 and arg2 are input columns; arg3 is output column.'''
df = df.copy()
df[arg3] = df[arg1] ** 2 + df[arg2] ** 2
return df
Results are:
print(func(df, 'x', 'y', 'z'))
x y z
0 3 4 25
1 5 12 169
You can try this code
def GHG_Intensity(PropertyName, Year):
Intensity = df[(df['Property Name']==PropertyName) & (df['Data Year']==Year)]['GHG Intensity (kg CO2e/sq ft)'].to_list()
return Intensity[0] if len(Intensity) else 'GHG Intensity Not Available'
print(GHG_Intensity('Navy Pier, Inc.', 2016))
here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy
I am trying to figure out how could I plot this data:
column 1 ['genres']: These are the value counts for all the genres in the table
Drama 2453
Comedy 2319
Action 1590
Horror 915
Adventure 586
Thriller 491
Documentary 432
Animation 403
Crime 380
Fantasy 272
Science Fiction 214
Romance 186
Family 144
Mystery 125
Music 100
TV Movie 78
War 59
History 44
Western 42
Foreign 9
Name: genres, dtype: int64
column 2 ['release_year']: These are the value counts for all the release years for different kind of genres
2014 699
2013 656
2015 627
2012 584
2011 540
2009 531
2008 495
2010 487
2007 438
2006 408
2005 363
2004 307
2003 281
2002 266
2001 241
2000 226
1999 224
1998 210
1996 203
1997 192
1994 184
1993 178
1995 174
1988 145
1989 136
1992 133
1991 133
1990 132
1987 125
1986 121
1985 109
1984 105
1981 82
1982 81
1983 80
1980 78
1978 65
1979 57
1977 57
1971 55
1973 55
1976 47
1974 46
1966 46
1975 44
1964 42
1970 40
1967 40
1972 40
1968 39
1965 35
1963 34
1962 32
1960 32
1969 31
1961 31
Name: release_year, dtype: int64
I need to answer the questions like - What genre is most popular from year to year? and so on
what kind of plots can be used and what is the best way to do this since there would be a lot of bins ins a single chart?
Is seaborn better for plotting such variables?
Should I divide the year data into 2 decades(1900 and 2000)?
Sample of the table:
id popularity runtime genres vote_count vote_average release_year
0 135397 32.985763 124 Action 5562 6.5 2015
1 76341 28.419936 120 Action 6185 7.1 1995
2 262500 13.112507 119 Adventure 2480 6.3 2015
3 140607 11.173104 136 Thriller 5292 7.5 2013
4 168259 9.335014 137 Action 2947 7.3 2005
You could do something like this:
Plotting histogram using seaborn for a dataframe
Personally i prefer seaborn for this kind of plots, because it's easier. But you can use matplotlib too.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# sample data
samples = 300
ids = range(samples)
gind = np.random.randint(0, 4, samples)
years = np.random.randint(1990, 2000, samples)
# create sample dataframe
gkeys = {1: 'Drama', 2: 'Comedy', 3: 'Action', 4: 'Adventure', 0: 'Thriller'}
df = pd.DataFrame(zip(ids, gind, years),
columns=['ID', 'Genre', 'Year'])
df['Genre'] = df['Genre'].replace(gkeys)
# count the year groups
res = df.groupby(['Year', 'Genre']).count()
res = res.reset_index()
# only the max values
# res_ind = res.groupby(['Year']).idxmax()
# res = res.loc[res_ind['ID'].tolist()]
# viz
sns.set(style="white")
g = sns.catplot(x='Year',
y= 'ID',
hue='Genre',
data=res,
kind='bar',
ci=None,
)
g.set_axis_labels("Year", "Count")
plt.show()
If this are to many bins in a plot, just split it up.