groupby by month the data - python

I have a dataset and i want to do group by month the days. My dataset example;
Date Price
2020-01-02 23245
2020-01-03 23245
2020-01-04 23245
2020-01-05 23245
I want to this:
Date Price
2020-01 252525
2020-02 4525224
2020-03 2424552
2020-04 4552525
So, i want to sum by month while removing days.

Make sure df has correct types first, then you can group by year, month:
df["Date"] = pd.to_datetime(df.Date)
df["Price"] = df.Price.astype(float)
df["year"] = df.Date.dt.year
df["month"] = df.Date.dt.month
df.groupby([df.year, df.month], as_index=False).sum()
output:
year month Price
0 2020 1 92980

Let's assume you have a dataframe with one record per day and the price is a random number between 1 and 100, then you can groupby the month and get the sum of price.
import pandas as pd
import random
df = pd.DataFrame({'Date':pd.date_range(start='2020-01-01', end='2020-12-30', freq='D'),
'Price':[random.randint(1,100) for _ in range(365)]})
df['Month'] = df.Date.dt.strftime('%Y-%m')
print (df)
print (df.groupby('Month')['Price'].sum().reset_index())
Here's the output:
Date Price Month
0 2020-01-01 13 2020-01
1 2020-01-02 40 2020-01
2 2020-01-03 61 2020-01
3 2020-01-04 86 2020-01
4 2020-01-05 100 2020-01
.. ... ... ...
360 2020-12-26 80 2020-12
361 2020-12-27 82 2020-12
362 2020-12-28 13 2020-12
363 2020-12-29 10 2020-12
364 2020-12-30 58 2020-12
[365 rows x 3 columns]
Month Price
0 2020-01 1622
1 2020-02 1244
2 2020-03 1564
3 2020-04 1335
4 2020-05 1625
5 2020-06 1545
6 2020-07 1406
7 2020-08 1891
8 2020-09 1625
9 2020-10 1625
10 2020-11 1309
11 2020-12 1327

Related

Split rows based on different rows and columns

I could really appreciate your help on this.
I have a table with products, dates, and amounts. This is what the initial table looks like.
Product ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
The purpose is to split each row into the number of months remaining until the end of the year
If it's the first year then split starting from the month of the first year until the end of the year
If the year is the end year then split until the month in the end year. the revenue needs to be split on the number of rows of the month as the revenue in the first table refers to the whole period.
all years in between will be divided into 12 rows along with the revenue one for each month.
Product ID goliveyear endyear Year Month Revenue
1 2020-10 2022-02 2020 10 90/3=30
1 2020-10 2022-02 2020 11 30
1 2020-10 2022-02 2020 12 30
1 2020-10 2022-02 2021 01 140/12 =11.67
1 2020-10 2022-02 2021 02 11.67
1 2020-10 2022-02 2021 03 11.67
1 2020-10 2022-02 2021 04 11.67
... ... ... ... ... ...
1 2020-10 2022-02 2022 01 60/2 = 30
1 2020-10 2022-02 2022 02 30
Thank you so much, everyone.
Quite a few steps.
Start by setting up the df
from io import StringIO
import pandas as pd
from datetime import datetime,timedelta
df = pd.read_csv(StringIO(
"""
Product_ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
"""), delim_whitespace=True)
df['goliveyear'] = pd.to_datetime(df['goliveyear'])
df['endyear'] = pd.to_datetime(df['endyear'])
df
Then add year_start, year_end, period_start, period_end columns
df['ys'] = df['goliveyear'].dt.year + df.groupby('Product_ID').cumcount()
df['ye'] = df['ys'] + 1
df['ys'] = pd.to_datetime(df['ys'], format = '%Y')
df['ye'] = pd.to_datetime(df['ye'], format = '%Y')+ timedelta(days=-1)
df['ps'] = df[['goliveyear','ys']].max(axis=1)
df['pe'] = df[['endyear','ye']].min(axis=1)
produces
Product_ID goliveyear endyear Revenue ys ye ps pe
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- -------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00
Then add months as lists at first
df['months'] = df.apply(lambda r: [d.month for d in pd.date_range(r['ps'], r['pe'], freq='MS', closed = None).to_pydatetime()], axis=1)
output:
Product_ID goliveyear endyear Revenue ys ye ps pe months
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- ------------------- ---------------------------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00 [10, 11, 12]
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00 [1, 2]
Then we explode months do the required calc for revenue and drop unneeded columns
df = df.explode('months')
df['Revenue'] = df['Revenue'] / df.groupby(['Product_ID','ys'])['months'].transform('count')
df = df.drop(columns = ['goliveyear','endyear','ye','ps','pe'])
df['ys'] = df['ys'].dt.year
to get
Product_ID Revenue ys months
-- ------------ --------- ---- --------
0 1 30 2020 10
0 1 30 2020 11
0 1 30 2020 12
1 1 11.6667 2021 1
1 1 11.6667 2021 2
1 1 11.6667 2021 3
1 1 11.6667 2021 4
1 1 11.6667 2021 5
1 1 11.6667 2021 6
1 1 11.6667 2021 7
1 1 11.6667 2021 8
1 1 11.6667 2021 9
1 1 11.6667 2021 10
1 1 11.6667 2021 11
1 1 11.6667 2021 12
2 1 30 2022 1
2 1 30 2022 2
Try this:
import pandas as pd
from io import StringIO
s = """
Product ID,goliveyear,endyear,Revenue
1,2020-10,2022-02,90
1,2020-10,2022-02,140
1,2020-10,2022-02,60"""
df = pd.read_csv(StringIO(s))
# generate the months list between thest two months
df['rng'] = df.apply(lambda x: pd.date_range(x['goliveyear'], x['endyear'],
freq='MS'), axis=1)
# explode the dataframe by months list
df_exploded = df.explode('rng')
df_exploded['Year'] = df_exploded['rng'].dt.year
df_exploded['Month'] = df_exploded['rng'].dt.month
# the (index,year) pair to fliter rows
filter_year = list(zip(df.index, df_exploded.Year.unique()))
# used columns
use_cols = ['Product ID', 'goliveyear', 'endyear', 'Revenue', 'Month']
# filter rows
df_filter = df_exploded.set_index([df_exploded.index,
df_exploded.Year]).loc[filter_year,
use_cols].reset_index().drop(columns='level_0')
# calculate the average Revenue
result = df_filter.set_index(['Year', "Month"]).assign(
Revenue=(df_filter.groupby(['Year', 'Month'])['Revenue'].sum() /
df_filter.groupby('Year')['Month'].count())
).reset_index()
result
Output
Year Month Product ID goliveyear endyear Revenue
0 2020 10 1 2020-10 2022-02 30.000000
1 2020 11 1 2020-10 2022-02 30.000000
2 2020 12 1 2020-10 2022-02 30.000000
3 2021 1 1 2020-10 2022-02 11.666667
4 2021 2 1 2020-10 2022-02 11.666667
5 2021 3 1 2020-10 2022-02 11.666667
6 2021 4 1 2020-10 2022-02 11.666667
7 2021 5 1 2020-10 2022-02 11.666667
8 2021 6 1 2020-10 2022-02 11.666667
9 2021 7 1 2020-10 2022-02 11.666667
10 2021 8 1 2020-10 2022-02 11.666667
11 2021 9 1 2020-10 2022-02 11.666667
12 2021 10 1 2020-10 2022-02 11.666667
13 2021 11 1 2020-10 2022-02 11.666667
14 2021 12 1 2020-10 2022-02 11.666667
15 2022 1 1 2020-10 2022-02 30.000000
16 2022 2 1 2020-10 2022-02 30.000000

Group by month from a particular date python

I have a DataFrame of account statement that contains date, debit and credit.
Lets just say salary gets deposited every 20th of the month.
I want to groupby date column from every 20th of each month to find sum of debits and credits. For e.g., 20th Jan to 20th Feb and so on.
date_parsed Debit Credit
0 2020-05-02 775.0 0.0
1 2020-04-30 209.0 0.0
2 2020-04-24 5000.0 0.0
3 2020-04-24 25000.0 0.0
... ... ... ...
79 2020-04-20 750.0 0.0
80 2020-04-15 5000.0 0.0
81 2020-04-13 0.0 2283.0
82 2020-04-09 0.0 6468.0
83 2020-04-03 0.0 1000.0
I am not sure but pd.offsett can be used with groupby.
You could add an extra month column which truncates up or down based on the day of month. Then its just groupby and sum. E.g. month 2020-06 would include dates between 2020-05-20 and 2020-06-19.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_parsed': ['2020-05-02', '2020-05-03', '2020-05-20', '2020-05-22'], 'Credit': [1,2,3,4], 'Debit': [5,6,7,8]})
df['date'] = pd.to_datetime(df.date_parsed)
df['month'] = np.where(df.date.dt.day < 20, df.date.dt.to_period('M'), (df.date + pd.DateOffset(months=1)).dt.to_period('M'))
print(df[['month', 'Credit', 'Debit']].groupby('month').sum().reset_index())
Input:
date_parsed Credit Debit
0 2020-05-02 1 5
1 2020-05-03 2 6
2 2020-05-20 3 7
3 2020-05-22 4 8
Result:
month Credit Debit
0 2020-05 3 11
1 2020-06 7 15

How can I group dates into pandas

Datos
2015-01-01 58
2015-01-02 42
2015-01-03 41
2015-01-04 13
2015-01-05 6
... ...
2020-06-18 49
2020-06-19 41
2020-06-20 23
2020-06-21 39
2020-06-22 22
2000 rows × 1 columns
I have this df which is made up of a column whose data represents the average temperature of each day in an interval of years. I would like to know how to get the maximum of each day (taking into account that the year has 365 days) and obtain a df similar to this:
Datos
1 40
2 50
3 46
4 8
5 26
... ...
361 39
362 23
363 23
364 37
365 25
365 rows × 1 columns
Forgive my ignorance and thank you very much for the help.
You can do this:
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(by=pd.Grouper(key='Date', freq='D')).max().reset_index()
df['Day'] = df['Date'].dt.dayofyear
print(df)
Date Temp Day
0 2015-01-01 58.0 1
1 2015-01-02 42.0 2
2 2015-01-03 41.0 3
3 2015-01-04 13.0 4
4 2015-01-05 6.0 5
... ... ... ...
1995 2020-06-18 49.0 170
1996 2020-06-19 41.0 171
1997 2020-06-20 23.0 172
1998 2020-06-21 39.0 173
1999 2020-06-22 22.0 174
Make a new column:
df["day of year"] = df.Datos.dayofyear
Then
df.groupby("day of year").max()

Monthly aggregated values, pandas dataframe

A sample CSV data in which the first column is a time stamp (date + time):
2018-01-01 10:00:00,23,43
2018-01-02 11:00:00,34,35
2018-01-05 12:00:00,25,4
2018-01-10 15:00:00,22,96
2018-01-01 18:00:00,24,53
2018-03-01 10:00:00,94,98
2018-04-20 10:00:00,90,9
2018-04-10 10:00:00,45,51
2018-01-01 10:00:00,74,44
2018-12-01 10:00:00,76,87
2018-11-01 10:00:00,76,87
2018-12-12 10:00:00,87,90
I already wrote some codes to do the monthly aggregated values task while waiting for someone to give me some suggestions.
Thanks #moys, anyway!
import pandas as pd
df = pd.read_csv('Sample.txt', header=None, names = ['Timestamp', 'Value 1', 'Value 2'])
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['Monthly'] = df1['Timestamp'].dt.to_period('M')
grouper = pd.Grouper(key='Monthly')
df2 = df1.groupby(grouper)['Value 1', 'Value 2'].sum().reset_index()
The output is:
Monthly Value 1 Value 2
0 2018-01 202 275
1 2018-03 94 98
2 2018-04 135 60
3 2018-12 163 177
4 2018-11 76 87
What if there's a dataset with more columns, how to motified the my code to make it automatically working on the dataset which has more columns?
2018-02-01 10:00:00,23,43,32
2018-02-02 11:00:00,34,35,43
2018-03-05 12:00:00,25,4,43
2018-02-10 15:00:00,22,96,24
2018-05-01 18:00:00,24,53,98
2018-02-01 10:00:00,94,98,32
2018-02-20 10:00:00,90,9,24
2018-07-10 10:00:00,45,51,32
2018-01-01 10:00:00,74,44,34
2018-12-04 10:00:00,76,87,53
2018-12-02 10:00:00,76,87,21
2018-12-12 10:00:00,87,90,98
You can do something like below
df.groupby(pd.to_datetime(df['date']).dt.month).sum().reset_index()
Output Here, 'date' column is the month number.
date val1 val2
0 1 202 275
1 3 94 98
2 4 135 60
3 11 76 87
4 12 163 177

How to copy paste values from another dataset conditional on a column

I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed

Categories

Resources