Add missing datetime columns to grouped dataframe - python

Is it possible to add missing date columns from created date_range to grouped dataframe df without for loop and fill zeros as missing values?
date_range has 7 date elements. df has 4 date columns. So how to add 3 missing columns to df?
import pandas as pd
from datetime import datetime
start = datetime(2018,6,4, )
end = datetime(2018,6,10,)
date_range = pd.date_range(start=start, end=end, freq='D')
DatetimeIndex(['2018-06-04', '2018-06-05', '2018-06-06', '2018-06-07',
'2018-06-08', '2018-06-09', '2018-06-10'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({
'date':
['2018-06-07', '2018-06-10', '2018-06-09','2018-06-09',
'2018-06-08','2018-06-09','2018-06-08','2018-06-10',
'2018-06-10','2018-06-10',],
'name':
['sogan', 'lyam','alex','alex',
'kovar','kovar','kovar','yamo','yamo','yamo',]
})
df['date'] = pd.to_datetime(df['date'])
df = (df
.groupby(['name', 'date',])['date',]
.count()
.unstack(fill_value=0)
)
df
date date date date
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0 0 2 0
kovar 0 2 1 0
lyam 0 0 0 1
sogan 1 0 0 0
yamo 0 0 0 3

I would pivot the table for making the date columns as rows then use the .asfreq function of pandas as below:
DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html

Thanks Sina Shabani for clue to making date columns as rows. And in this situation more suitable setting date as index and using .reindex appeared
df = (df.groupby(['date', 'name'])['name']
.size()
.reset_index(name='count')
.pivot(index='date', columns='name', values='count')
.fillna(0))
df
name alex kovar lyam sogan yamo
date
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.index = pd.DatetimeIndex(df.index)
df = (df.reindex(pd.date_range(start, freq='D', periods=7), fill_value=0)
.sort_index())
df
name alex kovar lyam sogan yamo
2018-06-04 0.0 0.0 0.0 0.0 0.0
2018-06-05 0.0 0.0 0.0 0.0 0.0
2018-06-06 0.0 0.0 0.0 0.0 0.0
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.T
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0.0 0.0 2.0 0.0
kovar 0.0 2.0 1.0 0.0
lyam 0.0 0.0 0.0 1.0
sogan 1.0 0.0 0.0 0.0
yamo 0.0 0.0 0.0 3.0

Related

How can I fill in missing dates in a dataframe?

I have a dataframe df_counts that contains the number of events that happen on a given day.
My goal is to fill in all the missing dates, and assign them a count of 0.
date count
0 2012-03-14 8
1 2012-03-19 1
2 2012-04-07 3
3 2012-04-10 1
4 2012-04-19 5
Desired output:
date count
0 2012-03-14 8
1 2012-03-15 0
2 2012-03-16 0
3 2012-03-17 0
4 2012-03-18 0
5 2012-03-19 0
6 2012-03-20 0
7 2012-03-21 0
8 2012-03-22 0
9 2012-03-23 0
...
Links I've read through:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
https://datatofish.com/pandas-dataframe-to-series/
Add missing dates to pandas dataframe
What I've tried:
idx = pd.date_range('2012-03-06', '2022-12-05')
s = df_counts.sequeeze
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value = 0)
s.head()
Output of what I've tried:
AttributeError: 'DataFrame' object has no attribute 'sequeeze'
You can achieve this using the asfreq() function.
# convert date column to datetime
df.date = pd.to_datetime(df.date)
# set date columns as index, drop the original index
df = df.set_index("date", drop=True)
# daily freq
df = df.asfreq("D", fill_value=0)
Some notes on your code:
squeeze() is a function, not an attribute. Therefore you get an error there.
You can directly use the date column as an index using the set_index() function.
You can use resample:
new_df = df.set_index('date').resample('1D').agg('first').fillna(0)
Output:
'''
date count
2012-03-14 8.0
2012-03-15 0.0
2012-03-16 0.0
2012-03-17 0.0
2012-03-18 0.0
2012-03-19 1.0
2012-03-20 0.0
2012-03-21 0.0
2012-03-22 0.0
2012-03-23 0.0
2012-03-24 0.0
2012-03-25 0.0
2012-03-26 0.0
2012-03-27 0.0
2012-03-28 0.0
2012-03-29 0.0
2012-03-30 0.0
2012-03-31 0.0
2012-04-01 0.0
2012-04-02 0.0
2012-04-03 0.0
2012-04-04 0.0
2012-04-05 0.0
2012-04-06 0.0
2012-04-07 3.0
2012-04-08 0.0
2012-04-09 0.0
2012-04-10 1.0
2012-04-11 0.0
2012-04-12 0.0
2012-04-13 0.0
2012-04-14 0.0
2012-04-15 0.0
2012-04-16 0.0
2012-04-17 0.0
2012-04-18 0.0
2012-04-19 5.0
'''
My favorite is fillna, you can use it for whole df or single column:
df["column"].fillna(0)

using date range to convert month end data to weekly data on Pandas

I have a dataframe that looks like the following. It is month end data.
date , value , expectation
31/01/2020, 34, 40
28/02/2020, 35, 38
31/03/2020, 40, 44
What I need:
date , value , expectation
07/01/2020, 0, 0
14/01/2020, 0, 0
21/01/2020, 0, 0
28/01/2020, 0, 0
04/02/2020, 34, 40
11/02/2020, 0, 0
18/02/2020, 0, 0
25/02/2020, 0, 0
04/03/2020, 35, 38
Basically, I am trying to convert the month end data to weekly data. But, the twist is that the exact month-end-date may not match the weekly date range, so it will fall into the end of week date (e.g., 04/02/2020 for 31/01/2020). The other week-end date are filled with 0. It sounds messy. But this what I have tried.
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W')
empty = pd.DataFrame(index=dtr)
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
The code works but I do not get the exact expected output. Any help is appreciated.
Use merge_asof:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W')
empty = pd.DataFrame(index=dtr)
df = pd.merge_asof(empty,
df,
left_index=True,
right_index=True,
tolerance=pd.Timedelta(7, 'd')).fillna(0)
print (df)
value expectation
2020-01-05 0.0 0.0
2020-01-12 0.0 0.0
2020-01-19 0.0 0.0
2020-01-26 0.0 0.0
2020-02-02 34.0 40.0
2020-02-09 0.0 0.0
2020-02-16 0.0 0.0
2020-02-23 0.0 0.0
2020-03-01 35.0 38.0
2020-03-08 0.0 0.0
2020-03-15 0.0 0.0
2020-03-22 0.0 0.0
2020-03-29 0.0 0.0
If need also change start of weeks, e.g from Tuesdays change freq in date_range:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W-Tue')
empty = pd.DataFrame(index=dtr)
df = pd.merge_asof(empty,
df,
left_index=True,
right_index=True,
tolerance=pd.Timedelta(7, 'd')).fillna(0)
print (df)
value expectation
2020-01-07 0.0 0.0
2020-01-14 0.0 0.0
2020-01-21 0.0 0.0
2020-01-28 0.0 0.0
2020-02-04 34.0 40.0
2020-02-11 0.0 0.0
2020-02-18 0.0 0.0
2020-02-25 0.0 0.0
2020-03-03 35.0 38.0
2020-03-10 0.0 0.0
2020-03-17 0.0 0.0
2020-03-24 0.0 0.0
2020-03-31 40.0 44.0
Below given piece of code will give you the desired result:
for end_date in df["date"]:
days_diff = (end_date - pd.date_range(end=end_date , freq='W', periods=5)[-1])
pd.date_range(end='2020-03-31', freq='W', periods=5) + days_diff

python how to find the number of days in each month between two date columns

I have two date columns Start Date and End Date and I want to find the Year and number of days in each month between those two dates. I can find a year but no idea how to find the number of days in each month. Not sure if it is feasible to get this o/p.
from pandas import DataFrame
import re
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
Expected O/P:
I come up with solution using pd.date_range and resample. You need convert both columns StartDate and EndDate to datetime dtype
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.month)
df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)
Out[1036]:
1 2 3 4 5 6 7 8 9 10 11 12
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 31.0
2 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0 30.0 31.0 30.0 0.0
4 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0
Finally join back to original dataframe
df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])
Out[1042]:
StartDate EndDate Year 1 2 3 4 5 6 7 8 \
0 2019-12-10 2019-12-30 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2019-12-01 2019-12-31 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2019-01-01 2019-03-30 2019 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0
3 2019-05-10 2019-11-30 2019 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0
4 2019-03-10 2019-06-10 2019 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0
9 10 11 12
0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 31.0
2 0.0 0.0 0.0 0.0
3 30.0 31.0 30.0 0.0
4 0.0 0.0 0.0 0.0
Solution
You can employ a combination of vectorization with pandas and numpy to achieve this as follows. The custom function is provided below, for ease of use. Since, it uses vectorization, it is supposed to be fairly fast.
Note: The assumption used here, based on the sample data:
The date range is only for one year
both start and end dates fall in the same year.
If you have data from different years, you would need to apply this to each year's data there. Also, if start and end dates fall on different years, you will have to adapt this method for that. Since the problem presented here, does not state that requirement, I leave this implementation as a guide for anyone interested in its application to a multiyear-spanned dataset.
If you would like to try out this solution in jupyter notebook environment, you can access it here on github. It has a Google Colaboratory link as well. So, you could also, directly open it in Google Colab Notebook.
# Updated DataFrame
df = process_dataframe(df) # custom function
display(df.head())
Dummy Data and Custom Function
Tested with pandas==0.25.3 and numpy==0.17.4 in Google Colab Environment.
import numpy as np
import pandas as pd
#from pandas.tseries.offsets import MonthEnd
from IPython.display import display
# Dummy Data
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
# Function for Processing the DataFrame
def process_dataframe(df):
"""Returns the updated dataframe. """
df.StartDate = pd.to_datetime(df.StartDate)
df.EndDate = pd.to_datetime(df.EndDate)
month_ends = pd.date_range(start='2019-01', freq='M', periods=12)
month_headers = month_ends.month_name().str.upper().str[:3].tolist()
month_days = month_ends.day.to_numpy()
month_nums = (np.arange(12) + 1)
# Evaluate expressions to avoid multiple times evaluation
start_date_month_num = df.StartDate.dt.month.to_numpy().reshape(-1,1)
end_date_month_num = df.EndDate.dt.month.to_numpy().reshape(-1,1)
#start_month_days = pd.to_datetime(df.StartDate, format="%Y%m") + MonthEnd(1) - df.StartDate
# start_month_days.dt.days.to_numpy()
# Number of days not in the end_month
end_month_days_excluded = month_days[df.EndDate.dt.month.to_numpy() - 1] - df.EndDate.dt.day.to_numpy()
# Determine the months that fall within the start and end dates (inclusive
# of start and end months) and then calculate the number of days in each
# month.
# add all days for relevant months
result = ((start_date_month_num <= month_nums) & \
(end_date_month_num >= month_nums)).astype(int) \
* month_days.reshape(1,-1)
# subtract number of days not in starting month
result = result + \
(-1) * (start_date_month_num == month_nums).astype(int) \
* (df.StartDate.dt.day.to_numpy() - 1).reshape(-1,1)
# subtract number of days not in ending month
result = result + \
(-1) * (end_date_month_num == month_nums).astype(int) \
* end_month_days_excluded.reshape(-1,1)
return pd.merge(df, pd.DataFrame(result, columns = month_headers), left_index=True, right_index=True)
Original DataFrame:
# Original DataFrame
display(df.head())
Output:

Pandas dataframe net present value vectorization (function vectorization)

I have the following dataframe:
import numpy as np
import pandas as pd
dates = pd.date_range(start='2014-01-01',end='2018-01-01', freq='Y')
df = pd.DataFrame(5*np.eye(4,), index=dates, columns=['Var1', 'Var2', 'Var3', 'Var4'])
print(df)
Var1 Var2 Var3 Var4
2014-12-31 5.0 0.0 0.0 0.0
2015-12-31 0.0 5.0 0.0 0.0
2016-12-31 0.0 0.0 5.0 0.0
2017-12-31 0.0 0.0 0.0 5.0
I would like to compute the NPV value of each variable for the years 2014 and 2015 for 3 years.
Right now I know how to obtain the present value for one variable and one row at the time:
Var1_2014 = df.loc['2014':'2016','Var1'].tolist()
NPV_Var1_2014 = np.npv(0.7,[0]+Var1_2014)
However I do not know how to vectorize the function to compute directly the entire column. I would like to obtain something like that:
Var1 Var2 Var3 Var4 Var1_NPV
2014-12-31 5.0 0.0 0.0 0.0 a
2015-12-31 0.0 5.0 0.0 0.0 b
2016-12-31 0.0 0.0 5.0 0.0 Nan
2017-12-31 0.0 0.0 0.0 5.0 Nan
where I could say something like df['Var1_NPV']= npv('Var1',duration=3years,discount_rate=0.7)
Any idea on how I could vectorize that function efficiently?
Many thanks,
I find a solution with apply and offset:
def give_npv(date,df,var,wacc):
date2 = date + pd.DateOffset(years=2)
data = df.loc[date:date2,var].tolist()
NPV_var = np.npv(wacc,[0]+data)
return NPV_var
df['index2'] = df.index
df['test'] = df.apply(lambda x: give_npv(x['index2'],df,'Var2',0.07) ,axis=1 )
print(df)
Var1 Var2 Var3 Var4 index2 test
2014-12-31 5.0 0.0 0.0 0.0 2014-12-31 4.367194
2015-12-31 0.0 5.0 0.0 0.0 2015-12-31 4.672897
2016-12-31 0.0 0.0 5.0 0.0 2016-12-31 0.000000
2017-12-31 0.0 0.0 0.0 5.0 2017-12-31 0.000000

Pandas resample timeseries in to 24hours

I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.
I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...

Categories

Resources