How to aggregate by month in pandas

How to aggregate by month in pandas - python

I have a dataframe with fields ['Transaction Description', 'Transaction Date', 'Debit Amount']. I want to sum the Debit Amount column by month and by Transaction Description to see how much I am spending on different things each month.
I tried
df.groupby(['Transaction Description', 'Transaction Date'])['Debit Amount'].sum().sort_values(ascending=False).head(180)
which gives me a sum for each Transcripton Description by day.
Alternatively,
df.groupby(['Transaction Description'])['Debit Amount'].sum().sort_values(ascending=False).head(180)
gives me what I want but for the whole dataframe, not split by month.
I would like the output to have the months in order and for each one show the total amount spent on each Transaction Description, sorted from largest to smallest. This is so I can look at a given month and see what I have been spending my money on.
Here is a sample from the dataframe
{'Transaction Date': {0: Timestamp('2022-05-04 00:00:00'),
1: Timestamp('2022-05-04 00:00:00'),
2: Timestamp('2022-04-04 00:00:00'),
3: Timestamp('2022-04-04 00:00:00'),
4: Timestamp('2022-04-04 00:00:00'),
5: Timestamp('2022-04-04 00:00:00'),
6: Timestamp('2022-04-04 00:00:00'),
7: Timestamp('2022-04-04 00:00:00'),
8: Timestamp('2022-04-04 00:00:00'),
9: Timestamp('2022-01-04 00:00:00')},
'Transaction Description': {0: 'School',
1: 'Cleaner',
2: 'Taxi',
3: 'shop',
4: 'MOBILE',
5: 'Restaurant',
6: 'Restaurant',
7: 'shop',
8: 'Taxi',
9: 'shop'},
'Debit Amount': {0: 15.0,
1: 26.0,
2: 48.48,
3: 9.18,
4: 7.0,
5: 10.05,
6: 9.1,
7: 2.14,
8: 16.0,
9: 11.68}
In this case I would like the output to be something like:
2022-01 shop 11.68
2022-04 Taxi 64.48
shop 23.00
MOBILE 7.00
Restaurant 19.15
2022-05 School 15.00
Cleaner 26.00

Use pd.Grouper. I assume your Transaction Date is of type date:
df.groupby([pd.Grouper(key="Transaction Date", freq="MS"), "Transaction Description"]).sum()

Try this out, if you had shared the MRE, i would have validated and shared the resultset
# create a ym column and use that in your groupby
(df.assign(ym=pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m'))
.groupby(['ym','Transaction Description' ] )['Debit Amount'].sum()
)
ym Transaction Description
2022-01 shop 11.68
2022-04 MOBILE 7.00
Restaurant 19.15
Taxi 64.48
shop 11.32
2022-05 Cleaner 26.00
School 15.00
Name: Debit Amount, dtype: float64
OR
(df.assign(ym=pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m'))
.groupby(['ym','Transaction Description' ], as_index=False )['Debit Amount'].sum()
)
ym Transaction Description Debit Amount
0 2022-01 shop 11.68
1 2022-04 MOBILE 7.00
2 2022-04 Restaurant 19.15
3 2022-04 Taxi 64.48
4 2022-04 shop 11.32
5 2022-05 Cleaner 26.00
6 2022-05 School 15.00

Related

How to sort result of groupby in pandas

I have a dataframe with my finances in it. As a sample see:
{'Transaction Date': {0: Timestamp('2022-05-04 00:00:00'),
1: Timestamp('2022-05-04 00:00:00'),
2: Timestamp('2022-04-04 00:00:00'),
3: Timestamp('2022-04-04 00:00:00'),
4: Timestamp('2022-04-04 00:00:00'),
5: Timestamp('2022-04-04 00:00:00'),
6: Timestamp('2022-04-04 00:00:00'),
7: Timestamp('2022-04-04 00:00:00'),
8: Timestamp('2022-04-04 00:00:00'),
9: Timestamp('2022-01-04 00:00:00')},
'Transaction Description': {0: 'School',
1: 'Cleaner',
2: 'Taxi',
3: 'shop',
4: 'MOBILE',
5: 'Restaurant',
6: 'Restaurant',
7: 'shop',
8: 'Taxi',
9: 'shop'},
'Debit Amount': {0: 15.0,
1: 26.0,
2: 48.48,
3: 9.18,
4: 7.0,
5: 10.05,
6: 9.1,
7: 2.14,
8: 16.0,
9: 11.68}
I can print a summary for each month with:
reportseries = df.assign(ym=pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m')).groupby(['ym','Transaction Description' ] )['Debit Amount'].sum()
print(reportseries)
This gives me:
ym Transaction Description
2022-01 shop 11.68
2022-04 MOBILE 7.00
Restaurant 19.15
Taxi 64.48
shop 11.32
2022-05 Cleaner 26.00
School 15.00
Name: Debit Amount, dtype: float64
How can I sort each group so I get this instead?
ym Transaction Description
2022-01 shop 11.68
2022-04 Taxi 64.48
Restaurant 19.15
shop 11.32
MOBILE 7.00
2022-05 Cleaner 26.00
School 15.00
Name: Debit Amount, dtype: float64

Lets reset the index then sort values in Debit Amount in desc order and use the new sorted index with iloc to reposition elements
reportseries.iloc[reportseries.reset_index().sort_values(['ym', 'Debit Amount'], ascending=[True, False]).index]
Alternatively you can restructure/update your existing code to get the result directly
cols = ['ym', 'Transaction Description']
ym = pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m')
reportseries = (
df.assign(ym=ym)
.groupby(cols, as_index=False)['Debit Amount'].sum()
.sort_values(['ym', 'Debit Amount'], ascending=[True, False])
.set_index(cols)['Debit Amount']
)
Result
ym Transaction Description
2022-01 shop 11.68
2022-04 Taxi 64.48
Restaurant 19.15
shop 11.32
MOBILE 7.00
2022-05 Cleaner 26.00
School 15.00
Name: Debit Amount, dtype: float64

How to resample, and create a time series of value_counts() and count() on multiple columns?

I have the following dataframe:
Client_Id Date Age_Group Gender
0 579427 2020-02-01 Under 65 Female
1 579464 2020-02-01 Under 65 Female
2 579440 2020-02-01 Under 65 Male
3 579470 2020-02-01 75 - 79 Female
4 579489 2020-02-01 75 - 79 Female
5 579424 2020-02-01 75 - 79 Male
6 579492 2020-02-01 75 - 79 Male
7 579552 2020-02-01 75 - 79 Male
8 579439 2020-02-01 80 - 84 Male
9 579445 2020-03-01 80 - 84 Female
10 579496 2020-03-01 80 - 84 Female
11 579569 2020-03-01 80 - 84 Male
12 579610 2020-03-01 80 - 84 Male
13 579450 2020-03-01 80 - 84 Female
14 579423 2020-03-01 85 and over Female
15 579428 2020-03-01 85 and over Male
I am trying to resample, and get a time series of count of Client_Id, count of Gender, and count of Age_Group.
For example, I can get value_counts of Gender:
df.set_index('Date').resample('D')['Gender'].value_counts()
Date Gender
2020-02-01 Male 5
Female 4
2020-03-01 Female 4
Male 3
I can also get value_counts for Age_Group.
And I can get number of clients per day:
df.set_index('Date').resample('D')['Client_Id'].count()
Date
2020-01-02 9
2020-01-03 7
However I would like to all outputs to be one dataframe, with the result of the value counts as their own column.
I have managed to do it, like this:
However the code is VERY ugly. I also have more column to process, and I would prefer not to have such a long chain of merge.
This is what I've done, using unstack and merge:
(df.set_index('Date').resample('D')['Client_Id'].count().to_frame()
.merge(df.set_index('Date').resample('D')['Gender'].value_counts().unstack(), left_index=True, right_index=True)
.merge(df.set_index('Date').resample('D')['Age_Group'].value_counts().unstack(), left_index=True, right_index=True))
Is there an easier / more tidy / built in way to do this?
My dataframe as a dict:
{'Client_Id': {0: 579427,
1: 579464,
2: 579440,
3: 579470,
4: 579489,
5: 579424,
6: 579492,
7: 579552,
8: 579439,
9: 579445,
10: 579496,
11: 579569,
12: 579610,
13: 579450,
14: 579423,
15: 579428},
'Date': {0: Timestamp('2020-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-01-02 00:00:00'),
3: Timestamp('2020-01-02 00:00:00'),
4: Timestamp('2020-01-02 00:00:00'),
5: Timestamp('2020-01-02 00:00:00'),
6: Timestamp('2020-01-02 00:00:00'),
7: Timestamp('2020-01-02 00:00:00'),
8: Timestamp('2020-01-02 00:00:00'),
9: Timestamp('2020-01-03 00:00:00'),
10: Timestamp('2020-01-03 00:00:00'),
11: Timestamp('2020-01-03 00:00:00'),
12: Timestamp('2020-01-03 00:00:00'),
13: Timestamp('2020-01-03 00:00:00'),
14: Timestamp('2020-01-03 00:00:00'),
15: Timestamp('2020-01-03 00:00:00')},
'Age_Group': {0: 'Under 65',
1: 'Under 65',
2: 'Under 65',
3: '75 - 79',
4: '75 - 79',
5: '75 - 79',
6: '75 - 79',
7: '75 - 79',
8: '80 - 84',
9: '80 - 84',
10: '80 - 84',
11: '80 - 84',
12: '80 - 84',
13: '80 - 84',
14: '85 and over',
15: '85 and over'},
'Gender': {0: 'Female ',
1: 'Female ',
2: 'Male ',
3: 'Female ',
4: 'Female ',
5: 'Male ',
6: 'Male ',
7: 'Male ',
8: 'Male ',
9: 'Female ',
10: 'Female ',
11: 'Male ',
12: 'Male ',
13: 'Female ',
14: 'Female ',
15: 'Male '}}

Use Series.unstack for DatetimeIndex in df1, so possible use concat:
df1 = df.set_index('Date').resample('D')['Gender'].value_counts().unstack()
df2 = df.set_index('Date').resample('D')['Client_Id'].count()
df = pd.concat([df1, df2], axis=1)

Dynamic mean calculation of float values using dask and groupby?

I'm trying to dynamically calculate the mean for all float64 columns. The end user should be able to filter through any of the category columns against a chart and get the various means for each instance. In order to achieve this, I've written the further below Python script using dask and groupby function.
However...
When performing groupby on the below columns, the object columns are vanished from the output CSV file as a result of aggregation and mean calculation on the columns containing float64 type. On this example, I'm using dask to read the dataframe (pandas is not an option due to high mem usage) and save the output file as CSV.
The dtypes of the input CSV columns are:
Dtype Columns
String (eg. 2018-P01/02) Time Period
Integer Journey Code
Object Journey Name
Object Route
Object Route Name
Object Area
Object Area Name
Object Fuel Type
Float64 Fuel Load
Float64 Mileage
Float64 Odometer Reading
Float64 Delay Minutes
My code for reading/saving the CSV and performing the mean calculation is:
import numpy as np
import dask.dataframe as dd
import pandas as pd
filename = "H:\\testing3.csv"
data = dd.read_csv("H:\\testing2.csv")
cols=['Time Period','Journey Code','Journey Name','Route',
'Route Name','Area','Area Name','Fuel Type']
x = data.groupby(cols).aggregation(['mean'])
x.to_csv(filename, index = False)
An example of the original dataset is:
Time Period Journey Code Route Route Name Area Area Name
2018-P01 261803 High France-Germany WE West
2018-P01-02 325429 High France-Germany EA Eastern
2018-P01-02 359343 High France-Germany WCS West Coast South
2018-P01-02 359343 High France-Germany WE West
2018-P01-03 370697 High France-Germany WE West
2018-P01-04 392535 High France-Germany EA Eastern
2018-P01-04 394752 High France-Germany WCS West Coast South
2018-P01-05 408713 High France-Germany WE West
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 165 6 14567.1 2
Diesel 210 12 98765.8 0
Diesel 210 5 23406.2 0
Diesel 130 10 54418.8 0
Diesel 152.5 37 58838.35 2
Diesel 142 140 63257.9 37.1194012
Diesel 131.5 120 67677.45 0
Diesel 121 13 72097 1.25
Why are the object columns vanishing from the resulting CSV file and how can I produce a result like the below?
Desired output (example on line 2 and 3): No average for the starting line but then any future float64 values will contain average (current against previous). I'm splitting each instance separately to get the dynamic result but any ideas are welcome.
Time Period Journey Code Route Route Name Area Area Name
2018-P01-02
325429
High
France-Germany
EA
Eastern
…….
359343 High France-Germany WCS West Coast South
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
......
Diesel 170 8.5 23406.2 NaN
Edit: Added sample dataset in format of df.head(10).to_dict()
{'Time Period': {0: '2018-P01', 1: '2018-P01-02', 2: '2018-P01-02', 3: '2018-P01-02', 4: '2018-P01-03', 5: '2018-P01-04', 6: '2018-P01-04', 7: '2018-P01-05', 8: '2018-P01-06', 9: '2018-P01-07'}, 'Odometer Reading': {0: 14567.1, 1: 98765.8, 2: 23406.2, 3: 54418.8, 4: 58838.35, 5: 63257.9, 6: 67677.45, 7: 72097.0, 8: 89221.0, 9: 89221.0}, 'Journey Code': {0: 261803, 1: 325429, 2: 359343, 3: 359343, 4: 370697, 5: 392535, 6: 394752, 7: 408713, 8: 408714, 9: 408715}, 'Fuel Type': {0: 'Diesel', 1: 'Diesel', 2: 'Diesel', 3: 'Diesel', 4: 'Diesel', 5: 'Diesel', 6: 'Diesel', 7: 'Diesel', 8: 'Diesel', 9: 'Diesel'}, 'Route Name': {0: 'France-Germany', 1: 'France-Germany', 2: 'France-Germany', 3: 'France-Germany', 4: 'France-Germany', 5: 'France-Germany', 6: 'France-Germany', 7: 'France-Germany', 8: 'France-Germany', 9: 'France-Germany'}, 'Area': {0: 'WE', 1: 'EA', 2: 'WCS', 3: 'WE', 4: 'WE', 5: 'EA', 6: 'WCS', 7: 'WE', 8: 'WE', 9: 'WE'}, 'Route': {0: 'High', 1: 'High', 2: 'High', 3: 'High', 4: 'High', 5: 'High', 6: 'High', 7: 'High', 8: 'High', 9: 'High'}, 'Fuel Load': {0: 165.0, 1: 210.0, 2: 170.0, 3: 130.0, 4: 152.5, 5: 142.0, 6: 131.5, 7: 121.0, 8: 121.0, 9: 121.0}, 'Delay Minutes': {0: 2.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 2.0, 5: 37.119401200000006, 6: 0.0, 7: 1.25, 8: 2.56, 9: 2.56}, 'Mileage': {0: 6.0, 1: 12.0, 2: 8.5, 3: 10.0, 4: 37.0, 5: 140.0, 6: 120.0, 7: 13.0, 8: 13.0, 9: 13.0}, 'Area Name': {0: 'West', 1: 'Eastern', 2: 'West Coast South', 3: 'West', 4: 'West', 5: 'Eastern', 6: 'West Coast South', 7: 'West', 8: 'West', 9: 'West'}}

Python 2.7: DataFrame groupby and find find the percentage distribution of values within group

I have a dataframe and i would like to find the percentage difference of values in a column within a group.
An example of a group is df.groupby(['race', 'tyre', 'stint']).get_group(("Australian Grand Prix", "Super soft", 1))
I would like to find out what is the percentage distribution of "time diff" values for each row of the group.
Her is the dataframe in dictionary format.There will be many other groups, but below df only shows the first group.
{'driverRef': {0: 'vettel',
1: 'raikkonen',
2: 'rosberg',
4: 'hamilton',
6: 'ricciardo',
7: 'alonso',
14: 'haryanto'},
'race': {0: 'Australian Grand Prix',
1: 'Australian Grand Prix',
2: 'Australian Grand Prix',
4: 'Australian Grand Prix',
6: 'Australian Grand Prix',
7: 'Australian Grand Prix',
14: 'Australian Grand Prix'},
'stint': {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0, 6: 1.0, 7: 1.0, 14: 1.0},
'total diff': {0: 125147.50728499777,
1: 281292.0366694695,
2: 166278.41312954266,
4: 64044.234019635056,
6: 648383.28046950256,
7: 400675.77449897071,
14: 2846411.2560531585},
'tyre': {0: u'Super soft',
1: u'Super soft',
2: u'Super soft',
4: u'Super soft',
6: u'Super soft',
7: u'Super soft',
14: u'Super soft'}}

If I understand correctly what you need, this might help:
sums = df.groupby(['race', 'tyre', 'stint'])['total diff'].sum()
df = df.set_index(['race', 'tyre', 'stint']).assign(pct=sums).reset_index()
df['pct'] = df['total diff'] / df['pct']
# race tyre stint driverRef total diff pct
# 0 Australian Grand Prix Super soft 1.0 vettel 1.251475e+05 0.027613
# 1 Australian Grand Prix Super soft 1.0 raikkonen 2.812920e+05 0.062065
# 2 Australian Grand Prix Super soft 1.0 rosberg 1.662784e+05 0.036688
# 3 Australian Grand Prix Super soft 1.0 hamilton 6.404423e+04 0.014131
# 4 Australian Grand Prix Super soft 1.0 ricciardo 6.483833e+05 0.143060
# 5 Australian Grand Prix Super soft 1.0 alonso 4.006758e+05 0.088406
# 6 Australian Grand Prix Super soft 1.0 haryanto 2.846411e+06 0.628037

Setting a row index on and querying a pandas dataframe with multi-index columns

Starting from a pandas dataframe with a multi-dimensional column heading structure such as the following, is there a way I can transform the Area Names and Area Codes headings so they span each level (i.e. so single Area Names and Area Codes labels spanning the multiple column heading rows?
If so, how could I then run a query on the column to just return the row corresponding to a particular value (e.g. an Area Code of E06000047), or the Low and Very High values for ENGLAND in 2012/13?
I wonder if it would be easier to define a row index based on either Area Code or Area Names, or a two column row index ['*Area Code*', '*Area Names*']. And if so, how can I do this from the current table? set_index seems to balk at this using the current structure?
Code fragment to create the above:
import pandas as pd
df= pd.DataFrame({('2011/12*', 'High', '7-8'): {3: 49.83,
5: 50.01,
7: 48.09,
8: 43.58,
9: 44.19},
('2011/12*', 'Low', '0-4'): {3: 6.51, 5: 6.53, 7: 6.49, 8: 6.41, 9: 6.12},
('2011/12*', 'Medium', '5-6'): {3: 17.44,
5: 17.59,
7: 18.11,
8: 19.23,
9: 20.01},
('2011/12*', 'Very High', '9-10'): {3: 26.22,
5: 25.87,
7: 27.32,
8: 30.78,
9: 29.68},
('2012/13*', 'High', '7-8'): {3: 51.16,
5: 51.35,
7: 48.47,
8: 44.67,
9: 49.39},
('2012/13*', 'Low', '0-4'): {3: 5.71, 5: 5.74, 7: 6.73, 8: 8.42, 9: 6.51},
('2012/13*', 'Medium', '5-6'): {3: 17.1,
5: 17.29,
7: 18.46,
8: 20.23,
9: 15.81},
('2012/13*', 'Very High', '9-10'): {3: 26.03,
5: 25.62,
7: 26.34,
8: 26.68,
9: 28.3},
('Area Codes', 'Area Codes', 'Area Codes'): {3: 'K02000001',
5: 'E92000001',
7: 'E12000001',
8: 'E06000047',
9: 'E06000005'},
('Area Names', 'Area Names', 'Area Names'): {3: 'UNITED KINGDOM',
5: 'ENGLAND',
7: 'NORTH EAST',
8: 'County Durham',
9: 'Darlington'}})

I think you need set_index with tuples for set by MultiIndex:
df.set_index([('Area Codes','Area Codes','Area Codes'),
('Area Names','Area Names','Area Names')], inplace=True)
df.index.names = ['Area Codes','Area Names']
print (df)
2011/12* 2012/13* \
High Low Medium Very High High Low
7-8 0-4 5-6 9-10 7-8 0-4
Area Codes Area Names
K02000001 UNITED KINGDOM 49.83 6.51 17.44 26.22 51.16 5.71
E92000001 ENGLAND 50.01 6.53 17.59 25.87 51.35 5.74
E12000001 NORTH EAST 48.09 6.49 18.11 27.32 48.47 6.73
E06000047 County Durham 43.58 6.41 19.23 30.78 44.67 8.42
E06000005 Darlington 44.19 6.12 20.01 29.68 49.39 6.51
Medium Very High
5-6 9-10
Area Codes Area Names
K02000001 UNITED KINGDOM 17.10 26.03
E92000001 ENGLAND 17.29 25.62
E12000001 NORTH EAST 18.46 26.34
E06000047 County Durham 20.23 26.68
E06000005 Darlington 15.81 28.30
Then need sort_index, because:
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
Last use selecting by slicers:
idx = pd.IndexSlice
print (df.loc[idx['E06000047',:], :])
2011/12* 2012/13* \
High Low Medium Very High High Low
7-8 0-4 5-6 9-10 7-8 0-4
Area Codes Area Names
E06000047 County Durham 43.58 6.41 19.23 30.78 44.67 8.42
Medium Very High
5-6 9-10
Area Codes Area Names
E06000047 County Durham 20.23 26.68
print (df.loc[idx[:,'ENGLAND'], idx['2012/13*',['Low','Very High']]])
2012/13*
Low Very High
0-4 9-10
Area Codes Area Names
E92000001 ENGLAND 5.74 25.62

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to aggregate by month in pandas - python

Use pd.Grouper. I assume your Transaction Date is of type date: df.groupby([pd.Grouper(key="Transaction Date", freq="MS"), "Transaction Description"]).sum()

Related

How to sort result of groupby in pandas

How to resample, and create a time series of value_counts() and count() on multiple columns?

Dynamic mean calculation of float values using dask and groupby?

Python 2.7: DataFrame groupby and find find the percentage distribution of values within group

Setting a row index on and querying a pandas dataframe with multi-index columns

Categories

Resources