This is my dataframe
ID Date Value Final Value
0 9560 12/15/2021 30 5.0
1 9560 07/3/2021 25 5.0
2 9560 03/03/2021 20 20.0
3 9712 08/20/2021 15 5.0
4 9712 12/31/2021 10 10.0
5 9920 04/11/2021 5 5.0
Here I need to create a another column 'Round Date'. Get the date from 'Date' column if date is greater than 15 Date should be round off to the beginning date of that month or else begininng date of next month. The expected output is given below.
ID Date Value Final Value Round Date
0 9560 12/15/2021 30 5.0 12/01/2021
1 9560 07/3/2021 25 5.0 07/01/2021
2 9560 03/03/2021 20 20.0 03/01/2021
3 9712 08/20/2021 15 5.0 09/01/2021
4 9712 12/31/2021 10 10.0 01/01/2022
5 9920 04/11/2021 5 5.0 04/01/2021
The solution is composed of a few elements:
create a function to round a single date. Although an anonymous function lambda can be used it is better to create a function because we can test the function (unit test) to see that it performs the way we expect.
First we need to convert the 'Date' column into a datetime type. For this we can use the built in pandas function pd.to_datetime().
Lastly in order to create the 'Round Date' column we will just call apply() on the 'Date' column and give it our round date function.
Here is the code to round the Date column in pandas dataframe:
from dateutil import relativedelta
import pandas as pd
# function to round date
def round_date(date):
# check if day is greater than 15
if date.day > 15:
# change month to next month
date += relativedelta.relativedelta(months=1)
# change day to start of month
date = date.replace(day=1)
return date
# read data
df = pd.read_csv('your-path')
# change date column to datetime
df['Date'] = pd.to_datetime(df['Date'] )
# apply round date function to column
df['Round Date'] = df['Date'].apply(round_date)
Input:
ID Date Value Final Value
0 9560 12/15/2021 30 5.0
1 9560 07/03/2021 25 5.0
2 9560 03/03/2021 20 20.0
3 9712 08/20/2021 15 5.0
4 9712 12/31/2021 10 10.0
5 9920 04/11/2021 5 5.0
Output:
ID Date Value Final Value Round Date
0 9560 12/15/2021 30 5.0 12/01/2021
1 9560 07/03/2021 25 5.0 07/01/2021
2 9560 03/03/2021 20 20.0 03/01/2021
3 9712 08/20/2021 15 5.0 09/01/2021
4 9712 12/31/2021 10 10.0 01/01/2022
5 9920 04/11/2021 5 5.0 04/01/2021
Here is a solution using apply() and lambda
Convert values in Date column to datetime using pd.to_datetime()
use apply() and lambda function to set day to 1 and increase month value if condition day is greater than 15
Code:
import pandas as pd
from datetime import timedelta
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': ['2/1/2021', '01/31/2021', '12/31/2021', '2021-12-01']})
df.Date = pd.to_datetime(df.Date)
df['Round Date'] = df.Date.apply(lambda x: x.replace(day=1)+ relativedelta(months=1)
if x.day > 15
else x.replace(day=1))
Input:
Date
0 2/1/2021
1 01/31/2021
2 12/31/2021
3 2021-12-01
Output:
Date Round Date
0 2021-02-01 2021-02-01
1 2021-01-31 2021-02-01
2 2021-12-31 2022-01-01
3 2021-12-01 2021-12-01
Using numpy's where:
import pandas as pd
import numpy as np
df["Date"] = pd.to_datetime(df.date)
df["date.rounded"] = np.where(
df.Date.dt.day > 15,
df.Date + pd.offsets.MonthBegin(0),
df.Date + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)
)
This yields:
Date date.rounded
0 2021-12-15 2021-12-01
1 2021-07-03 2021-07-01
2 2021-03-03 2021-03-01
3 2021-08-20 2021-09-01
4 2021-12-31 2022-01-01
5 2021-04-11 2021-04-01
Related
I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 47
2020-11-01 A 67
2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 46
2020-11-01 A 47
2020-12-01 A 67
2021-01-01 A 0
2021-02-01 A 0
2021-03-01 A 0
2021-04-01 A 0
2021-05-01 A 0
2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()
Let's try:
Getting the minimum value per group using groupby.min
Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
Create individual date_range per group based on the min and max values
Series.explode into rows to have a DataFrame that represents the new index.
Create a MultiIndex.from_frame to reindex the DataFrame with.
reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
An alternative using pd.MultiIndex with list comprehension:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
Here is a different approach:
from itertools import product
# compute the min-max date range
date_range = pd.date_range(*mydf['Dt'].agg(['min', 'max']), freq='MS', name='Dt')
# make MultiIndex per group, keep only values above min date per group
idx = pd.MultiIndex.from_tuples([e for Id,Dt_min in mydf.groupby('Id')['Dt'].min().items()
for e in list(product(date_range[date_range>Dt_min],
[Id]))
])
# concatenate the original dataframe and the missing indexes
mydf = mydf.set_index(['Dt', 'Id'])
mydf = pd.concat([mydf,
mydf.reindex(idx.difference(mydf.index)).fillna(0)]
).sort_index(level=1).reset_index()
mydf
output:
Dt Id Sales
0 2020-10-01 A 47.0
1 2020-11-01 A 67.0
2 2020-12-01 A 46.0
3 2021-01-01 A 0.0
4 2021-02-01 A 0.0
5 2021-03-01 A 0.0
6 2021-04-01 A 0.0
7 2021-05-01 A 0.0
8 2021-06-01 A 0.0
9 2021-03-01 B 2.0
10 2021-04-01 B 42.0
11 2021-05-01 B 20.0
12 2021-06-01 B 4.0
We can use the complete function from pyjanitor to expose the missing values:
Convert Dt to datetime:
df['Dt'] = pd.to_datetime(df['Dt'])
Create a mapping of Dt to new values, via pd.date_range, and set the frequency to monthly begin (MS):
max_time = df.Dt.max()
new_values = {"Dt": lambda df:pd.date_range(df.min(), max_time, freq='1MS')}
# pip install pyjanitor
import janitor
import pandas as pd
df.complete([new_values], by='Id').fillna(0)
Id Dt Sales
0 A 2020-10-01 47.0
1 A 2020-11-01 67.0
2 A 2020-12-01 46.0
3 A 2021-01-01 0.0
4 A 2021-02-01 0.0
5 A 2021-03-01 0.0
6 A 2021-04-01 0.0
7 A 2021-05-01 0.0
8 A 2021-06-01 0.0
9 B 2021-03-01 2.0
10 B 2021-04-01 42.0
11 B 2021-05-01 20.0
12 B 2021-06-01 4.0
Sticking to Pandas only, we can combine apply, with groupby and reindex; thankfully, Dt is unique, so we can safely reindex:
(df
.set_index('Dt')
.groupby('Id')
.apply(lambda df: df.reindex(pd.date_range(df.index.min(),
max_time,
freq='1MS'),
fill_value = 0)
)
.drop(columns='Id')
.rename_axis(['Id', 'Dt'])
.reset_index())
Id Dt Sales
0 A 2020-10-01 47
1 A 2020-11-01 67
2 A 2020-12-01 46
3 A 2021-01-01 0
4 A 2021-02-01 0
5 A 2021-03-01 0
6 A 2021-04-01 0
7 A 2021-05-01 0
8 A 2021-06-01 0
9 B 2021-03-01 2
10 B 2021-04-01 42
11 B 2021-05-01 20
12 B 2021-06-01 4
I have a data frame:
ID Date Volume
1 2019Q1 9
1 2020Q2 11
2 2019Q3 39
2 2020Q4 23
I want to convert this to yyyy-Qn to datetime.
I have used a dictionary to map the corresponding dates to the quarters.
But I need a more generalized code in instances where the yyyy changes.
Expected output:
ID Date Volume
1 2019-03 9
1 2020-06 11
2 2019-09 39
2 2020-12 23
Let's use pd.PeriodIndex:
df['Date_new'] = pd.PeriodIndex(df['Date'], freq='Q').strftime('%Y-%m')
Output:
ID Date Volume Date_new
0 1 2019Q1 9 2019-03
1 1 2020Q2 11 2020-06
2 2 2019Q3 39 2019-09
3 2 2020Q4 23 2020-12
Here's a simple solution but not as efficient (shouldn't be a problem if your dataset is not too large).
Convert the date column to datetime using to_datetime.
Then add 2 months to each date because you want month to be end-of-quarter month
df = pd.DataFrame({'date': ["2019Q1" ,"2019Q3", "2019Q2", "2020Q4"], 'volume': [1,2,3, 4]})
df['datetime'] = pd.to_datetime(df['date'])
df['datetime'] = df['datetime'] + pd.DateOffset(months=2)
Output is the same
date volume datetime
0 2019Q1 1 2019-03-01
1 2019Q3 2 2019-09-01
2 2019Q2 3 2019-06-01
3 2020Q4 4 2020-12-01
I have a DataFrame of account statement that contains date, debit and credit.
Lets just say salary gets deposited every 20th of the month.
I want to groupby date column from every 20th of each month to find sum of debits and credits. For e.g., 20th Jan to 20th Feb and so on.
date_parsed Debit Credit
0 2020-05-02 775.0 0.0
1 2020-04-30 209.0 0.0
2 2020-04-24 5000.0 0.0
3 2020-04-24 25000.0 0.0
... ... ... ...
79 2020-04-20 750.0 0.0
80 2020-04-15 5000.0 0.0
81 2020-04-13 0.0 2283.0
82 2020-04-09 0.0 6468.0
83 2020-04-03 0.0 1000.0
I am not sure but pd.offsett can be used with groupby.
You could add an extra month column which truncates up or down based on the day of month. Then its just groupby and sum. E.g. month 2020-06 would include dates between 2020-05-20 and 2020-06-19.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_parsed': ['2020-05-02', '2020-05-03', '2020-05-20', '2020-05-22'], 'Credit': [1,2,3,4], 'Debit': [5,6,7,8]})
df['date'] = pd.to_datetime(df.date_parsed)
df['month'] = np.where(df.date.dt.day < 20, df.date.dt.to_period('M'), (df.date + pd.DateOffset(months=1)).dt.to_period('M'))
print(df[['month', 'Credit', 'Debit']].groupby('month').sum().reset_index())
Input:
date_parsed Credit Debit
0 2020-05-02 1 5
1 2020-05-03 2 6
2 2020-05-20 3 7
3 2020-05-22 4 8
Result:
month Credit Debit
0 2020-05 3 11
1 2020-06 7 15
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I have a pandas dataframe that looks like this:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
I would like to groupby on KEY and sum on VALUE but only on continuous periods of time. For instance in the above example I would like to get:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-01 5.0
2 B 2017-01-01 2017-02-10 6.0
There are tow groups for A since there is a gap in the time periods.
I would like to avoid for loops since the dataframe has tens of millions of rows.
Create helper Series by compare shifted START column per group and use it for groupby:
s = df.loc[df.groupby('KEY')['START'].shift(-1) == df['END'], 'END']
s = s.combine_first(df['START'])
print (s)
0 2017-01-01
1 2017-01-23
2 2017-01-23
3 2017-02-02
4 2017-02-02
Name: END, dtype: datetime64[ns]
df = df.groupby(['KEY', s], as_index=False).agg({'START':'first','END':'last','VALUE':'sum'})
print (df)
KEY VALUE START END
0 A 2.1 2017-01-01 2017-01-16
1 A 5.0 2017-01-28 2017-03-01
2 B 6.0 2017-01-01 2017-02-10
The answer from jezrael works like a charm if there are only two consecutive rows to aggregate. In the new example, it would not aggregate the last three rows for KEY = A.
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
5 A 2017-03-01 2017-03-23 1.0
The following solution (slight modification of jezrael's solution) enables to aggregate all rows that should be aggregated:
df = df.sort_values(by='START')
idx = df.groupby('KEY')['START'].shift(-1) != df['END']
df['DATE'] = df.loc[idx, 'START']
df['DATE'] = df.groupby('KEY').DATE.fillna(method='backfill')
df = (df.groupby(['KEY', 'DATE'], as_index=False)
.agg({'START': 'first', 'END': 'last', 'VALUE': 'sum'})
.drop(['DATE'], axis=1))
Which gives:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-23 6.0
2 B 2017-01-01 2017-02-10 6.0
Thanks #jezrael for the elegant approach!