I have dataframe in the following format:
course_id year month student_id
'Design' 2016 1 a123
'Design' 2016 1 a124
'Design' 2016 2 a125
'Design 2016 3 a126
'Marketing' 2016 1 b123
'Marketing' 2016 2 b124
'Marketing' 2016 3 b125
'Marketing' 2016 3 b126
'Marketing' 2016 3 b127
'Marketing' 2016 4 b128
How to calculate growth of every course in every month. I.e. to have the table in the following format
Year Month 'Design' 'Marketing'
2016 1 2 1
2016 2 1 1
2016 3 1 3
2016 4 0 1
You can use pivot_table function like:
df.pivot_table(index=['year', 'month'], columns='course_id', values='student_id', aggfunc=len).fillna(0).reset_index()
Related
I have this function written in python. I want this thing show only one value.
Here's the code
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
print(df)
And of course the output is this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999 366999
2 2012 361986 366999
3 2013 329461 366999
4 2014 355464 366999
5 2015 344998 366999
6 2016 274317 366999
7 2017 200916 366999
8 2018 217246 366999
9 2019 119830 366999
10 2020 66640 366999
Since it has the same value, I want the output like this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
What should I change or add to my code?
You can use shift to generate a mask that can be used to replace duplicate consecutive values:
df.loc[df['Max Prod'] == df['Max Prod'].shift(1), 'Max Prod'] = ''
Output:
>>> df
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
You could also have the function as:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = ''
df.iloc[0, -1] = df['Production (Ton)'].max()
print(df)
Given what you have now:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
df = df['Max Prod'].drop_duplicates()
df = df.fillna('')
print(df)
Output:
Year Production-(Ton) Max Prod
0 2010 339491 549713
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
I am trying to create a fiscal quarter column in pandas. If the fiscal year-end is March, I could do this with the following code:
df['fiscalquarter'] = df['Date'].dt.to_period('Q-MAR')
However, instead of 'MAR', I want to use the column in my dataframe that has the month abbreviation. My column values are JAN, FEB, MAR, etc. How do I tell pandas to look at the value in that column (fiscalmonth) to create the quarter?
Output of print(df.head()):
ID fiscalmonth Date DateSubYear fiscalmonthabr
5021 1 2 2001-03-29 2001 FEB
5780 2 2 2001-04-03 2001 FEB
7024 3 2 2001-05-02 2001 FEB
7307 4 2 2001-05-11 2001 FEB
8076 4 2 2001-06-14 2001 FEB
I would like to see:
ID fiscalquater
5021 1 2002Q1
5780 2 2002Q1
7024 3 2002Q1
7307 4 2002Q1
8076 4 2002Q2
You can use:
fiscal_quarter = lambda x: x['Date'].to_period(x['fiscalmonthabr'])
df['fiscalquater'] = df.assign(fiscalmonthabr='Q-' + df['fiscalmonthabr'],
Date=pd.to_datetime(df['Date'])) \
.apply(fiscal_quarter, axis=1)
print(df)
# Output
ID fiscalmonth Date DateSubYear fiscalmonthabr fiscalquater
5021 1 2 2001-03-29 2001 FEB 2002Q1
5780 2 2 2001-04-03 2001 FEB 2002Q1
7024 3 2 2001-05-02 2001 FEB 2002Q1
7307 4 2 2001-05-11 2001 FEB 2002Q1
8076 4 2 2001-06-14 2001 FEB 2002Q2
I am using this data frame in excel :
I'd like to show the total sales per year.
Year Sales
2021 7
2018 6
2018 787
2018 935
2018 1 059
2018 5
2018 72
2018 2
2018 3
2019 218
2019 256
2020 2
2018 4
2021 8
2019 14
2020 3
2018 3
2018 1
2020 34
I'm using this :
df.groupby(['Year'])['Sales'].agg('sum')
And the result :
2018.0 67879351 05957223431
2019.0 21825614
2020.0 2334
2021.0 78
Do you know why I don't have the sum of the values ?
Thanks
'Sales' column is of dtype object so convert it to numeric:
df['Sales']=pd.to_numeric(df['Sales'].replace(r"\s+",'',regex=True),errors='coerce')
#df['Sales'].replace(r"\s+",'',regex=True).astype(float)
Now calculte sum():
out=df.groupby(['Year'])['Sales'].sum()
output of out:
Year
2018 2877
2019 488
2020 39
2021 15
Name: Sales, dtype: int64
I have a dataframe:
import pandas as pd
import numpy as np
ycap = [2015, 2016, 2017]
df = pd.DataFrame({'a': np.repeat(ycap, 5),
'b': np.random.randn(15)})
a b
0 2015 0.436967
1 2015 -0.539453
2 2015 -0.450282
3 2015 0.907723
4 2015 -2.279188
5 2016 1.468736
6 2016 -0.169522
7 2016 0.003501
8 2016 0.182321
9 2016 0.647310
10 2017 0.679443
11 2017 -0.154405
12 2017 -0.197271
13 2017 -0.153552
14 2017 0.518803
I would like to add column c, that would look like following:
a b c
0 2015 -0.826946 2014
1 2015 0.275072 2013
2 2015 0.735353 2012
3 2015 1.391345 2011
4 2015 0.389524 2010
5 2016 -0.944750 2015
6 2016 -1.192546 2014
7 2016 -0.247521 2013
8 2016 0.521094 2012
9 2016 0.273950 2011
10 2017 -1.199278 2016
11 2017 0.839705 2015
12 2017 0.075951 2014
13 2017 0.663696 2013
14 2017 0.398995 2012
I try to achieve this using following, however 1, need to increment within the group. How could I do it? Thanks
gp = df.groupby('a')
df['c'] = gp['a'].apply(lambda x: x-1)
Subtract column a by Series created by cumcount and last subtract 1:
df['c'] = df['a'] - df.groupby('a').cumcount() - 1
print (df)
a b c
0 2015 0.285832 2014
1 2015 -0.223318 2013
2 2015 0.620920 2012
3 2015 -0.891164 2011
4 2015 -0.719840 2010
5 2016 -0.106774 2015
6 2016 -1.230357 2014
7 2016 0.747803 2013
8 2016 -0.002320 2012
9 2016 0.062715 2011
10 2017 0.805035 2016
11 2017 -0.385647 2015
12 2017 -0.457458 2014
13 2017 -1.589365 2013
14 2017 0.013825 2012
Detail:
print (df.groupby('a').cumcount())
0 0
1 1
2 2
3 3
4 4
5 0
6 1
7 2
8 3
9 4
10 0
11 1
12 2
13 3
14 4
dtype: int64
you can do it this way:
In [8]: df['c'] = df.groupby('a')['a'].transform(lambda x: x-np.arange(1, len(x)+1))
In [9]: df
Out[9]:
a b c
0 2015 0.436967 2014
1 2015 -0.539453 2013
2 2015 -0.450282 2012
3 2015 0.907723 2011
4 2015 -2.279188 2010
5 2016 1.468736 2015
6 2016 -0.169522 2014
7 2016 0.003501 2013
8 2016 0.182321 2012
9 2016 0.647310 2011
10 2017 0.679443 2016
11 2017 -0.154405 2015
12 2017 -0.197271 2014
13 2017 -0.153552 2013
14 2017 0.518803 2012
I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()