Pandas : Groupby sum values

Pandas : Groupby sum values - python

I am using this data frame in excel :
I'd like to show the total sales per year.
Year Sales
2021 7
2018 6
2018 787
2018 935
2018 1 059
2018 5
2018 72
2018 2
2018 3
2019 218
2019 256
2020 2
2018 4
2021 8
2019 14
2020 3
2018 3
2018 1
2020 34
I'm using this :
df.groupby(['Year'])['Sales'].agg('sum')
And the result :
2018.0 67879351 05957223431
2019.0 21825614
2020.0 2334
2021.0 78
Do you know why I don't have the sum of the values ?
Thanks

'Sales' column is of dtype object so convert it to numeric:
df['Sales']=pd.to_numeric(df['Sales'].replace(r"\s+",'',regex=True),errors='coerce')
#df['Sales'].replace(r"\s+",'',regex=True).astype(float)
Now calculte sum():
out=df.groupby(['Year'])['Sales'].sum()
output of out:
Year
2018 2877
2019 488
2020 39
2021 15
Name: Sales, dtype: int64

Related

Melt a Pandas Dataframe with multiple columns [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 6 months ago.
I wanted to know if there's a way to melt a DataFrame with multiple column names.
I have this Pandas Data Frame:
Edad 2000 2001 2002 2003 ... 2017 2018 2019 2020
...
[15-25] 126675 158246 171958 188389 ... 78707 70246 65661 52209
(25-35] 65823 85059 92841 95394 ... 88479 157492 149862 122067
(35-45] 37474 48605 54593 56279 ... 65870 65798 64587 51502
(45-55] 20624 22067 25860 27601 ... 39476 40725 40566 33979
(55-65] 30240 9047 10500 10972 ... 20135 21095 21173 17242
And would like to have something like this:
Edad Year Value
[15-25] 2000 126675
[15-25] 2001 158246
[15-25] 2002 171958
[15-25] 2003 188389
I've used Melt before but I always address a value column, this time I have my values as cells and I'm having a very hard time figuring out how to address them.

You can use melt with groupby and sort like this:
df.melt(id_vars='Edad', var_name='Year').groupby(['Edad','Year']).agg({'value':'first'}).reset_index().sort_values(by=['Edad','Year'], ascending=[False,True])
Desired results:
Edad Year value
32 [15-25] 2000 126675
33 [15-25] 2001 158246
34 [15-25] 2002 171958
35 [15-25] 2003 188389
36 [15-25] 2017 78707
37 [15-25] 2018 70246
38 [15-25] 2019 65661
39 [15-25] 2020 52209
24 (55-65] 2000 30240
25 (55-65] 2001 9047
26 (55-65] 2002 10500
27 (55-65] 2003 10972
28 (55-65] 2017 20135
29 (55-65] 2018 21095
30 (55-65] 2019 21173
31 (55-65] 2020 17242
16 (45-55] 2000 20624
17 (45-55] 2001 22067
18 (45-55] 2002 25860
19 (45-55] 2003 27601
20 (45-55] 2017 39476
21 (45-55] 2018 40725
22 (45-55] 2019 40566
23 (45-55] 2020 33979
8 (35-45] 2000 37474
9 (35-45] 2001 48605
10 (35-45] 2002 54593
11 (35-45] 2003 56279
12 (35-45] 2017 65870
13 (35-45] 2018 65798
14 (35-45] 2019 64587
15 (35-45] 2020 51502
0 (25-35] 2000 65823
1 (25-35] 2001 85059
2 (25-35] 2002 92841
3 (25-35] 2003 95394
4 (25-35] 2017 88479
5 (25-35] 2018 157492
6 (25-35] 2019 149862
7 (25-35] 2020 122067

How can i combine years and month variables on pandas dataframe in python?

I have 2 integer variables in pandas dataframe. These are months and years. I want to combine them into one variable like 2021-1. Each index is matching one-to-one (No problem).New variable must be the time series. How can I do that.
For example my dataframe seems like this:
import pandas as pd
a = [2015,2015,2015,2015,2015,2016,2016,2016,2016]
b = [1,2,3,4,5,1,2,3,4]
c = pd.DataFrame(a , columns=["Year"])
d = pd.DataFrame(b , columns = ["Month"])
e = pd.concat([c,d] , axis = 1)
e.head()

There are multiple ways to do this, two examples:
import datetime as dt
r = pd.date_range("1-jan-2018", freq="M", periods=24)
df = pd.DataFrame({"year":r.year, "month":r.month})
df.assign(ymstr=df.astype({"year":"string","month":"string"}).apply("-".join, axis=1),
ymdt=df.apply(lambda r: dt.datetime(r[0],r[1], 1).strftime("%Y-%-m"), axis=1))
year
month
ymstr
ymdt
0
2018
1
2018-1
2018-1
1
2018
2
2018-2
2018-2
2
2018
3
2018-3
2018-3
3
2018
4
2018-4
2018-4
4
2018
5
2018-5
2018-5
5
2018
6
2018-6
2018-6
6
2018
7
2018-7
2018-7
7
2018
8
2018-8
2018-8
8
2018
9
2018-9
2018-9
9
2018
10
2018-10
2018-10
10
2018
11
2018-11
2018-11
11
2018
12
2018-12
2018-12
12
2019
1
2019-1
2019-1
13
2019
2
2019-2
2019-2
14
2019
3
2019-3
2019-3
15
2019
4
2019-4
2019-4
16
2019
5
2019-5
2019-5
17
2019
6
2019-6
2019-6
18
2019
7
2019-7
2019-7
19
2019
8
2019-8
2019-8
20
2019
9
2019-9
2019-9
21
2019
10
2019-10
2019-10
22
2019
11
2019-11
2019-11
23
2019
12
2019-12
2019-12

I found it.
from datetime import
e['NewTime'] = e.apply(lambda row: datetime.strptime(f"{int(row.Year)}-
{int(row.Month)}", '%Y-%m'), axis=1)

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.

First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1

I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9

Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

Pandas define a seasonal year from June 1 - July 30 instead of Jan 1 - Dec 31

I have seasonal snow data which I want to group by snow year (July 1, 1954 - June 30, 1955) rather than having one winter's data split over two years (January 1, 1954 - December 31, 1954 and January 1, 1955 - Dec 31, 1955.)
example data
I modified the code from this question:
Using pandas to select specific seasons from a dataframe whose values are over a defined threshold (thanks Pad)
def get_season(row):
if row['date'].month <= 7:
return row['date'].year
else:
return row['date'].year + 1
df['Seasonal_Year'] = df.apply(get_season, axis=1)
results of method call
Is there a better way to do this than I have done?

I think yes, with numpy.where:
years = df['date'].dt.year
df['Seasonal_Year'] = np.where(df['date'].dt.month <= 7, years, years + 1)

you can use pd.offsets.MonthBegin
Consider the dataframe of dates df
df = pd.DataFrame(dict(Date=pd.date_range('2010-01-30', periods=24, freq='M')))
We can offset the Date and grab the year
df.assign(Season=(df.Date - pd.offsets.MonthBegin(7)).dt.year + 1)
Date Season
0 2010-01-31 2010
1 2010-02-28 2010
2 2010-03-31 2010
3 2010-04-30 2010
4 2010-05-31 2010
5 2010-06-30 2010
6 2010-07-31 2011
7 2010-08-31 2011
8 2010-09-30 2011
9 2010-10-31 2011
10 2010-11-30 2011
11 2010-12-31 2011
12 2011-01-31 2011
13 2011-02-28 2011
14 2011-03-31 2011
15 2011-04-30 2011
16 2011-05-31 2011
17 2011-06-30 2011
18 2011-07-31 2012
19 2011-08-31 2012
20 2011-09-30 2012
21 2011-10-31 2012
22 2011-11-30 2012
23 2011-12-31 2012

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.