How to transform many_days to daily in python? - python

I have the following dataframe:
import pandas as pd
dt = pd.DataFrame({'start_date': ['2019-05-20', '2019-05-21', '2019-05-21'],
'end_date': ['2019-05-23', '2019-05-24', '2019-05-22'],
'reg': ['A', 'B','A'],
'measure': [100, 200,1000]})
I would to create a new column, called 'date', which will have values from start_date until end_date and also have a new column measure_daily which will be the measure spread equally among these dates.
So basically, I would like to expand the dt in terms of rows
So I would like the final df to look like:
dt_f = pd.DataFrame({'date':['2019-05-20','2019-05-21','2019-05-22','2019-05-23','2019-05-21','2019-05-22','2019-05-23','2019-05-24', '2019-05-21','2019-05-22'],
'reg':['A','A','A','A','B','B','B','B','A','A'],
'measure_daily':[25,25,25,25,50,50,50,50,500,500]})
Is there an efficient way to do this in python ?

TL;DR
just give me the solution:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
Breakdown:
First we melt your start and end date to the same column:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
reg measure key date
0 A 100 0 2019-05-20
1 B 200 1 2019-05-21
2 A 1000 2 2019-05-21
3 A 100 0 2019-05-23
4 B 200 1 2019-05-24
5 A 1000 2 2019-05-22
Then we resample on daily basis while applying groupby to keep the different reg in their own group.
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
date reg measure key
0 2019-05-20 A 100.0 0.0
1 2019-05-21 A 100.0 0.0
2 2019-05-22 A 100.0 0.0
3 2019-05-23 A 100.0 0.0
4 2019-05-21 B 200.0 1.0
5 2019-05-22 B 200.0 1.0
6 2019-05-23 B 200.0 1.0
7 2019-05-24 B 200.0 1.0
8 2019-05-21 A 1000.0 2.0
9 2019-05-22 A 1000.0 2.0
Finally we spread out the measure column over the size of each group with assign:
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
date reg measure
0 2019-05-20 A 25.0
1 2019-05-21 A 25.0
2 2019-05-22 A 25.0
3 2019-05-23 A 25.0
4 2019-05-21 B 50.0
5 2019-05-22 B 50.0
6 2019-05-23 B 50.0
7 2019-05-24 B 50.0
8 2019-05-21 A 500.0
9 2019-05-22 A 500.0

Related

Pandas fill missing dates and values simultaneously for each group

I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 47
2020-11-01 A 67
2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 46
2020-11-01 A 47
2020-12-01 A 67
2021-01-01 A 0
2021-02-01 A 0
2021-03-01 A 0
2021-04-01 A 0
2021-05-01 A 0
2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()
Let's try:
Getting the minimum value per group using groupby.min
Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
Create individual date_range per group based on the min and max values
Series.explode into rows to have a DataFrame that represents the new index.
Create a MultiIndex.from_frame to reindex the DataFrame with.
reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
An alternative using pd.MultiIndex with list comprehension:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
Here is a different approach:
from itertools import product
# compute the min-max date range
date_range = pd.date_range(*mydf['Dt'].agg(['min', 'max']), freq='MS', name='Dt')
# make MultiIndex per group, keep only values above min date per group
idx = pd.MultiIndex.from_tuples([e for Id,Dt_min in mydf.groupby('Id')['Dt'].min().items()
for e in list(product(date_range[date_range>Dt_min],
[Id]))
])
# concatenate the original dataframe and the missing indexes
mydf = mydf.set_index(['Dt', 'Id'])
mydf = pd.concat([mydf,
mydf.reindex(idx.difference(mydf.index)).fillna(0)]
).sort_index(level=1).reset_index()
mydf
output:
Dt Id Sales
0 2020-10-01 A 47.0
1 2020-11-01 A 67.0
2 2020-12-01 A 46.0
3 2021-01-01 A 0.0
4 2021-02-01 A 0.0
5 2021-03-01 A 0.0
6 2021-04-01 A 0.0
7 2021-05-01 A 0.0
8 2021-06-01 A 0.0
9 2021-03-01 B 2.0
10 2021-04-01 B 42.0
11 2021-05-01 B 20.0
12 2021-06-01 B 4.0
We can use the complete function from pyjanitor to expose the missing values:
Convert Dt to datetime:
df['Dt'] = pd.to_datetime(df['Dt'])
Create a mapping of Dt to new values, via pd.date_range, and set the frequency to monthly begin (MS):
max_time = df.Dt.max()
new_values = {"Dt": lambda df:pd.date_range(df.min(), max_time, freq='1MS')}
# pip install pyjanitor
import janitor
import pandas as pd
df.complete([new_values], by='Id').fillna(0)
Id Dt Sales
0 A 2020-10-01 47.0
1 A 2020-11-01 67.0
2 A 2020-12-01 46.0
3 A 2021-01-01 0.0
4 A 2021-02-01 0.0
5 A 2021-03-01 0.0
6 A 2021-04-01 0.0
7 A 2021-05-01 0.0
8 A 2021-06-01 0.0
9 B 2021-03-01 2.0
10 B 2021-04-01 42.0
11 B 2021-05-01 20.0
12 B 2021-06-01 4.0
Sticking to Pandas only, we can combine apply, with groupby and reindex; thankfully, Dt is unique, so we can safely reindex:
(df
.set_index('Dt')
.groupby('Id')
.apply(lambda df: df.reindex(pd.date_range(df.index.min(),
max_time,
freq='1MS'),
fill_value = 0)
)
.drop(columns='Id')
.rename_axis(['Id', 'Dt'])
.reset_index())
Id Dt Sales
0 A 2020-10-01 47
1 A 2020-11-01 67
2 A 2020-12-01 46
3 A 2021-01-01 0
4 A 2021-02-01 0
5 A 2021-03-01 0
6 A 2021-04-01 0
7 A 2021-05-01 0
8 A 2021-06-01 0
9 B 2021-03-01 2
10 B 2021-04-01 42
11 B 2021-05-01 20
12 B 2021-06-01 4

Calculate delta between two columns and two following rows for different group

Are there any vector operations for improving runtime?
I found no other way besides for loops.
Sample DataFrame:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
'start_value': [12, 15, 1, 3, 2, 6],
'end_value': [20, 17, 6,19,13.5,9]})
ID start_date start_value end_value
0 1 01-Jan 12 20.0
1 1 05-Jan 15 17.0
2 1 08-Jan 1 6.0
3 2 05-Jan 3 19.0
4 2 06-Jan 2 13.5
5 2 10-Jan 6 9.0
I've tried:
import pandas as pd
df_original # contains data
data_frame_diff= pd.DataFrame()
for ID in df_original ['ID'].unique():
tmp_frame = df_original .loc[df_original ['ID']==ID]
tmp_start_value = 0
for label, row in tmp_frame.iterrows():
last_delta = tmp_start_value - row['value']
tmp_start_value = row['end_value']
row['last_delta'] = last_delta
data_frame_diff= data_frame_diff.append(row,True)
Expected Result:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan',
'10-Jan'],
'last_delta': [0, 5, 16, 0, 17, 7.5]})
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.
Is there a way to improve runtime of this code?
Use DataFrame.groupby
on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:
s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)
Result:
print(df1)
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
It's a bit difficult to follow from your description what you need, but you might find this helpful:
import pandas as pd
df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
})
.reset_index().rename(columns={'index': 'ID'})
)
df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()
print(df)
It results in
ID t1 t2 delta_t1_t2 delta_to_previous_t1
0 0 2020-01-01 00:00:00 2020-01-01 10:00:00 10:00:00 NaT
1 1 2020-01-01 01:00:00 2020-01-01 11:00:00 10:00:00 01:00:00
2 2 2020-01-01 02:00:00 2020-01-01 12:00:00 10:00:00 01:00:00
3 3 2020-01-01 03:00:00 2020-01-01 13:00:00 10:00:00 01:00:00

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Dates fill in for date range and fillna

I am querying my database to show records from the past week. I am then aggregating the data and transposing it in python and pandas into a DataFrame.
In this table I am attempting to show what occurred on each day in the past 7 week, however, on some days no events occur. In these cases, the date is missing altogether. I am looking for an approach to append the dates that are not present (but are part of the date range specified in the query) so that I can then fillna with any value I wish for the other missing columns.
In some trials I have the data set into a pandas Dataframe where the dates are the index and in others the dates are a column. I am preferably looking to have the dates as the top index - so group by name, stack purchase and send_back and dates are the 'columns'.
Here is an example of how the dataframe looks now and what I am looking for:
Dates set in query for - 01.08.2016 - 08.08.2016. The dataframe looks liks so:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 04.08.2016 Sarah 55 0
3 05.08.2016 Michael 80 20
4 07.08.2016 Sarah 130 0
After:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 03.08.2016 - 0 0
3 04.08.2016 Sarah 55 0
4 05.08.2016 Michael 80 20
5 06.08.2016 - 0 0
6 07.08.2016 Sarah 130 0
7 08.08.2016 Sarah 0 35
8 08.08.2016 Michael 20 0
Printing the following:
df.index
gives:
'Index([ u'dates',u'name',u'purchase',u'send_back'],
dtype='object')
RangeIndex(start=0, stop=1, step=1)'
I appreciate any guidance.
assuming you have the following DF:
In [93]: df
Out[93]:
name purchase send_back
dates
2016-08-01 Michael 120 0
2016-08-02 Sarah 100 40
2016-08-04 Sarah 55 0
2016-08-05 Michael 80 20
2016-08-07 Sarah 130 0
you can resample and replace:
In [94]: df.resample('D').first().replace({'name':{np.nan:'-'}}).fillna(0)
Out[94]:
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Your index is of object type and you must convert it to datetime format.
# Converting the object date to datetime.date
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
# Setting the index column
df.set_index(['dates'], inplace=True)
# Choosing a date range extending from first date to the last date with a set frequency
new_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
# Making the required modifications
df.ix[:,0], df.ix[:,1:] = df.ix[:,0].fillna('-'), df.ix[:,1:].fillna(0)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Let's suppose you have data for a single day (as mentioned in the comments section) and you would like to fill the other days of the week with null values:
Data Setup:
df = pd.DataFrame({'dates':['01.08.2016'], 'name':['Michael'],
'purchase':[120], 'send_back':[0]})
print (df)
dates name purchase send_back
0 01.08.2016 Michael 120 0
Operations:
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
df.set_index(['dates'], inplace=True)
# Setting periods as 7 to account for the end of the week
new_index = pd.date_range(start=df.index[0], periods=7, freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 NaN NaN NaN
2016-08-03 NaN NaN NaN
2016-08-04 NaN NaN NaN
2016-08-05 NaN NaN NaN
2016-08-06 NaN NaN NaN
2016-08-07 NaN NaN NaN
Incase you want to fill the null values with 0's, you could do:
df.fillna(0, inplace=True)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 0 0.0 0.0
2016-08-03 0 0.0 0.0
2016-08-04 0 0.0 0.0
2016-08-05 0 0.0 0.0
2016-08-06 0 0.0 0.0
2016-08-07 0 0.0 0.0

Categories

Resources